|
|
Natsume (2006-01-15) Find duplicate files. Synopsis perl natsume.pl <files> find -type f | perl natsume.pl Description Natsume detects duplicates in a set of files, taken from command line arguments if available, or stdin otherwise. For each duplicate, Natsume will print a shell command for ln(1). That is, the first quoted string is the relative path to the duplicate file, and the second quoted string is duplicate file itself. For each empty file, Natsume will print the command to link to /dev/null. Thus, if your system has all the right utilities installed, you should be able to do something like: find -type f | perl natsume.pl | sh ... and see no difference in file contents, other than the decrease in disk space usage. To be safe, you should always check what Natsume thinks are duplicate files first before changing them to symlinks blindly. Files are always processed in sorted order, after the paths are normalized (removing excess . and .. components), so the output is deterministic even if the input files are specified in random order. Absolute paths should work, but mixing absolute path names with relative path names is probably a bad idea. That feature, or really, the whole program -- use at your own risk. Implementation Duplicate files are detected in 3 level hierarchy of hash tables: 1. Check that two files have difference sizes. 2. Check that MD5 for the first 1K of data are different. 3. Check that MD5 for the entire file is different. Each level is not evaluated until the previous level detects a collision. That is, MD5 for header part of a file is not computed at all until a second file with the same size is seen, and MD5 for the entire file is not computed until a second file with the same size and same header MD5 is seen. This saves a lot of file reads if there are few duplicate files, which is generally the case. Natsume can generally process gigabytes of files in a few seconds. Because full file comparisons are never made, there is a chance of false positives. Also, MD5 collisions is less of an assurance for duplicate files ever since someone found a fast way to crack it in 2005. This doesn't affect most files generated in the usual fashion (e.g. images downloaded from internet), but thought I should mention it. The same program was implemented in Perl, Python, OCaml, and C. Obviously the C version took the most effort since I had to implement all the data structures and MD5 function myself. OCaml was next, but for the ratio of coding effort versus performance, OCaml had the best deal. Python took the least time thanks to its extensive collection of standard libraries, but is also the slowest of them all. Packaged executable is compiled from the C implementation. Miscellaneous Template is based on Konno Natsume from Mahoraba. Natsume is one of Kozue's multiple personalities, and ends every other sentence with "maybe". This utility is somewhat similar in spirit, maybe. -- omoikane@uguu.org - http://uguu.org/
RAR | ZIP | / | |
2025-08-26 |