Find duplicate files in Linux

By | October 1, 2010

Let’s say you have a folder with 5000 MP3 files you want to check for duplicates. Or a directory containing thousands of EPUB files, all with different names but you have a hunch some of them might be duplicates. You can cd your way in the console up to that particular folder and then do a

find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate

This will output a list of files that are duplicates, according tot their HASH signature.
Another way is to install fdupes and do a

fdupes -r ./folder > duplicates_list.txt

The -r is for recursivity. Check the duplicates_list.txt afterwards in a text editor for a list of duplicate files.

7 thoughts on “Find duplicate files in Linux

  1. Nate Chapman

    Wow this is awesome, thanks! Is there any way you could send me an email that explains this code? That would be a huge help.

  2. Ross

    FYI, on cygwin, I had to modify the command to be:

    find . -type f -printf ‘%s\n’ | sort -rn | uniq -d | xargs -I{} -n1 find -ty
    pe f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=sepa

  3. Ross

    The above one-liner is O(n**2) in the number of nodes of the filesystem. In addition to the original `find` command, a separate `find` command must be run on each found duplicate. This is problematic if you’re running on a slow filesystem and/or through cygwin.

    Here is a modified command line that only invokes one `find` command total:

    find . -type f -exec stat –printf=’%32s ‘ {} \; -exec md5sum {} \; |sort -r
    n | uniq -d -w65 –all-repeated=separate

  4. Kwami

    +1 for fdupes! I was not aware of this tool, and it is VERY handy! It also allows you to delete the duplicates on the fly 🙂

  5. pbl

    Thanks. fdupes is just what I needed to seek clutter. Didn’t know it, but I knew their out to be a nice way to do that recursively.


Leave a Reply

Your email address will not be published. Required fields are marked *