Natsume (2006-01-15)

Find duplicate files.


   perl <files>

   find -type f | perl


   Natsume detects duplicates in a set of files, taken from command
   line arguments if available, or stdin otherwise.

   For each duplicate, Natsume will print a shell command for ln(1).
   That is, the first quoted string is the relative path to the
   duplicate file, and the second quoted string is duplicate file
   itself.  For each empty file, Natsume will print the command to
   link to /dev/null.

   Thus, if your system has all the right utilities installed, you
   should be able to do something like:

      find -type f | perl | sh

   ... and see no difference in file contents, other than the decrease
   in disk space usage.  To be safe, you should always check what
   Natsume thinks are duplicate files first before changing them to
   symlinks blindly.

   Files are always processed in sorted order, after the paths are
   normalized (removing excess . and .. components), so the output is
   deterministic even if the input files are specified in random
   order.  Absolute paths should work, but mixing absolute path names
   with relative path names is probably a bad idea.  That feature, or
   really, the whole program -- use at your own risk.


   Duplicate files are detected in 3 level hierarchy of hash tables:
   1. Check that two files have difference sizes.
   2. Check that MD5 for the first 1K of data are different.
   3. Check that MD5 for the entire file is different.

   Each level is not evaluated until the previous level detects a
   collision.  That is, MD5 for header part of a file is not computed
   at all until a second file with the same size is seen, and MD5 for
   the entire file is not computed until a second file with the same
   size and same header MD5 is seen.  This saves a lot of file reads
   if there are few duplicate files, which is generally the case.
   Natsume can generally process gigabytes of files in a few seconds.

   Because full file comparisons are never made, there is a chance of
   false positives.  Also, MD5 collisions is less of an assurance for
   duplicate files ever since someone found a fast way to crack it in
   2005.  This doesn't affect most files generated in the usual
   fashion (e.g. images downloaded from internet), but thought I
   should mention it.

   The same program was implemented in Perl, Python, OCaml, and C.
   Obviously the C version took the most effort since I had to
   implement all the data structures and MD5 function myself.  OCaml
   was next, but for the ratio of coding effort versus performance,
   OCaml had the best deal.  Python took the least time thanks to its
   extensive collection of standard libraries, but is also the slowest
   of them all.  Packaged executable is compiled from the C


   Template is based on Konno Natsume from Mahoraba.  Natsume is one
   of Kozue's multiple personalities, and ends every other sentence
   with "maybe".  This utility is somewhat similar in spirit, maybe.

-- -