Cutting down duplicate files

In this locale, people tend to rely too heavily on sneakernets. By this I don’t mean your regular pirated media files, but actual documents that should otherwise be version-controlled.

Spring-cleaning my hard-drive resulted in the discovery of several folders of PDF slides with names that you’d expect sneakernett-ed files to have (think “New”, “Update New”, and other devious combinations).

Now the slides were old, but still relevant, and they’d probably be useful later. I move all of the files in the folders into a single one…

$ cd slides
$ ls | wc -l 
28

Ugh. 28 files. And by the looks of it a whole bunch of “versioned” files.

I don’t really relish the idea of going through each and every file just to determine if it’s a duplicate of another file, especially when the changes are halfway through a 80-page document.

UNIX-tools to the rescue! Drum-roll please…

md5sum does an MD5 hash on an object and prints it out
cut cuts out pieces of text depending on the arguments you supply
sort does sorts on text
uniq outputs unique lines of text

So, how does that look in shell?

$ md5sum * | tee slides.sums
f23310e777825103e547d42d46fa9ec8  CHAPTER-03-v1.pdf
47c0cbdcf66996bdc2d3c62e6c80b65c  CHAPTER-03-v2.pdf
...
8a4e4506ab29b421ddeb04a8546c943b  CHAPTER-07-v7.pdf
71869d372f8f8d264fea6291e89d6ccd  CHAPTER-08.pdf
$ cut -d ' ' -f 1 slides.sums | sort | uniq -d | tee dup_slides.sums
05b2f3169f2eda0861db5f33a57d845d
14287febecded35be79b88990bbac75f
15fc012b129ff7766af49cbb41108a37
62e2fa7dbf2b20f1cc6ebe91c96ec605
6d8c58baf463eb0038c3b5bc4ab43864
71869d372f8f8d264fea6291e89d6ccd
8a4e4506ab29b421ddeb04a8546c943b
bb31637676c57bdaa0b1f2dd5746fef2
c969dfa9a54997635d9fd15858764d0c
ffa4b026518ce7de68077d6587041be5

If a set of files have the same hash, you can safely assume that their contents are identical. The -d flag used on uniq displays duplicate lines; if there is a repeated hash, we’ve found files that are duplicates.

Now we need to find which files have these hashes:

$ for l in $(<dup_slides.sums)
> do
>   echo "Files matching md5 hash ${l}: "
>   grep $l slides.sums | cut -d ' ' -f3
>   echo ''
> done \

Files with md5 hash 05b2f3169f2eda0861db5f33a57d845d: 
CHAPTER-04-v2.pdf
CHAPTER-04.pdf

Files with md5 hash 14287febecded35be79b88990bbac75f: 
CHAPTER-05-v3.pdf
CHAPTER-05-v5.pdf

Files with md5 hash 15fc012b129ff7766af49cbb41108a37: 
CHAPTER-07.pdf
CHAPTER-07-v3.pdf

Files with md5 hash 62e2fa7dbf2b20f1cc6ebe91c96ec605: 
CHAPTER-14-v1.pdf
CHAPTER-14.pdf

Files with md5 hash 6d8c58baf463eb0038c3b5bc4ab43864: 
CHAPTER-14-v3.pdf
CHAPTER-02-v1.pdf

Files with md5 hash 71869d372f8f8d264fea6291e89d6ccd: 
CHAPTER-08-v1.pdf
CHAPTER-08.pdf

Files with md5 hash 8a4e4506ab29b421ddeb04a8546c943b: 
CHAPTER-07-v5.pdf
CHAPTER-07-v7.pdf

Files with md5 hash bb31637676c57bdaa0b1f2dd5746fef2: 
CHAPTER-06-v2.pdf
CHAPTER-06-v6.pdf

Files with md5 hash c969dfa9a54997635d9fd15858764d0c: 
CHAPTER-05-v4.pdf
CHAPTER-05-v6.pdf

Files with md5 hash ffa4b026518ce7de68077d6587041be5: 
CHAPTER-14-v6.pdf
CHAPTER-14-v8.pdf

Now let’s tweak that to list the files we need to remove:

$ for l in $(<dup_slides.sums)
> do
>   grep $l slides.sums | cut -d ' ' -f3 | sed '1d'
> done | xargs rm
$ ls | wc -l
18

Ten duplicates gone! All with the power of shell, thanks UNIX! Now for some Descent.