Let’s say you have a folder with 5000 MP3 files you want to check for duplicates. Or a directory containing thousands of EPUB files, all with different names but you have a hunch some of them might be duplicates. You can cd your way in the console up to that particular folder and then do a
find -not -empty -type f -printf “%s\n” | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=separate
This will output a list of files that are duplicates, according tot their HASH signature.
Another way is to install fdupes and do a
fdupes -r ./folder > duplicates_list.txt
The -r is for recursivity. Check the duplicates_list.txt afterwards in a text editor for a list of duplicate files.
Thanks so much for posting this! This is exactly what I was looking for. Very useful.
Wow this is awesome, thanks! Is there any way you could send me an email that explains this code? That would be a huge help.
FYI, on cygwin, I had to modify the command to be:
find . -type f -printf ‘%s\n’ | sort -rn | uniq -d | xargs -I{} -n1 find -ty
pe f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 –all-repeated=sepa
rate
The above one-liner is O(n**2) in the number of nodes of the filesystem. In addition to the original `find` command, a separate `find` command must be run on each found duplicate. This is problematic if you’re running on a slow filesystem and/or through cygwin.
Here is a modified command line that only invokes one `find` command total:
find . -type f -exec stat –printf=’%32s ‘ {} \; -exec md5sum {} \; |sort -r
n | uniq -d -w65 –all-repeated=separate
I’ve further optimized this to eliminate md5sums for all but the files that match other files in this post.
+1 for fdupes! I was not aware of this tool, and it is VERY handy! It also allows you to delete the duplicates on the fly 🙂
Thanks. fdupes is just what I needed to seek clutter. Didn’t know it, but I knew their out to be a nice way to do that recursively.