Find duplicate files in Linux

Let’s say you have a folder with 5000 MP3 files you want to check for duplicates. Or a directory containing thousands of EPUB files, all with different names but you have a hunch some of them might be duplicates. You can cd your way in the console up to that particular folder and then do a

find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate

This will output a list of files that are duplicates, according tot their HASH signature.
Another way is to install fdupes and do a

fdupes -r ./folder > duplicates_list.txt

The -r is for recursivity. Check the duplicates_list.txt afterwards in a text editor for a list of duplicate files.

7 thoughts on “Find duplicate files in Linux”

Ronald Baljeu December 5, 2010

Thanks so much for posting this! This is exactly what I was looking for. Very useful.

Reply ↓

Nate Chapman March 1, 2011

Wow this is awesome, thanks! Is there any way you could send me an email that explains this code? That would be a huge help.

Reply ↓

Ross October 17, 2011

FYI, on cygwin, I had to modify the command to be:

find . -type f -printf ‘%s\n’ | sort -rn | uniq -d | xargs -I{} -n1 find -ty
pe f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=sepa
rate

Reply ↓

Ross October 18, 2011

The above one-liner is O(n**2) in the number of nodes of the filesystem. In addition to the original `find` command, a separate `find` command must be run on each found duplicate. This is problematic if you’re running on a slow filesystem and/or through cygwin.

Here is a modified command line that only invokes one `find` command total:

find . -type f -exec stat –printf=’%32s ‘ {} \; -exec md5sum {} \; |sort -r
n | uniq -d -w65 –all-repeated=separate

Reply ↓