Natsume (2006-01-15)
Find duplicate files.
Synopsis
perl natsume.pl <files>
find -type f | perl natsume.pl
Description
Natsume detects duplicates in a set of files, taken from command
line arguments if available, or stdin otherwise.
For each duplicate, Natsume will print a shell command for ln(1).
That is, the first quoted string is the relative path to the
duplicate file, and the second quoted string is duplicate file
itself. For each empty file, Natsume will print the command to
link to /dev/null.
Thus, if your system has all the right utilities installed, you
should be able to do something like:
find -type f | perl natsume.pl | sh
... and see no difference in file contents, other than the decrease
in disk space usage. To be safe, you should always check what
Natsume thinks are duplicate files first before changing them to
symlinks blindly.
Files are always processed in sorted order, after the paths are
normalized (removing excess . and .. components), so the output is
deterministic even if the input files are specified in random
order. Absolute paths should work, but mixing absolute path names
with relative path names is probably a bad idea. That feature, or
really, the whole program -- use at your own risk.
Implementation
Duplicate files are detected in 3 level hierarchy of hash tables:
1. Check that two files have difference sizes.
2. Check that MD5 for the first 1K of data are different.
3. Check that MD5 for the entire file is different.
Each level is not evaluated until the previous level detects a
collision. That is, MD5 for header part of a file is not computed
at all until a second file with the same size is seen, and MD5 for
the entire file is not computed until a second file with the same
size and same header MD5 is seen. This saves a lot of file reads
if there are few duplicate files, which is generally the case.
Natsume can generally process gigabytes of files in a few seconds.
Because full file comparisons are never made, there is a chance of
false positives. Also, MD5 collisions is less of an assurance for
duplicate files ever since someone found a fast way to crack it in
2005. This doesn't affect most files generated in the usual
fashion (e.g. images downloaded from internet), but thought I
should mention it.
The same program was implemented in Perl, Python, OCaml, and C.
Obviously the C version took the most effort since I had to
implement all the data structures and MD5 function myself. OCaml
was next, but for the ratio of coding effort versus performance,
OCaml had the best deal. Python took the least time thanks to its
extensive collection of standard libraries, but is also the slowest
of them all. Packaged executable is compiled from the C
implementation.
Miscellaneous
Template is based on Konno Natsume from Mahoraba. Natsume is one
of Kozue's multiple personalities, and ends every other sentence
with "maybe". This utility is somewhat similar in spirit, maybe.
--
omoikane@uguu.org - http://uguu.org/