Cutting down duplicate files

UNIX tools be saving me time…

March 20th, 2017

In this locale, people tend to rely too heavily on sneakernets. By this I don’t mean your regular pirated media files, but actual documents that should otherwise be version-controlled.

Spring-cleaning my hard-drive resulted in the discovery of several folders of PDF slides with names that you’d expect sneakernett-ed files to have (think “New”, “Update New”, and other devious combinations).

Now the slides were old, but still relevant, and they’d probably be useful later. I move all of the files in the folders into a single one…

$ cd slides
$ ls | wc -l 
28

Ugh. 28 files. And by the looks of it a whole bunch of “versioned” files.

I don’t really relish the idea of going through each and every file just to determine if it’s a duplicate of another file, especially when the changes are halfway through a 80-page document.

UNIX-tools to the rescue! Drum-roll please…

So, how does that look in shell?

$ md5sum * | tee slides.sums
f23310e777825103e547d42d46fa9ec8  CHAPTER-03-v1.pdf
47c0cbdcf66996bdc2d3c62e6c80b65c  CHAPTER-03-v2.pdf
...
8a4e4506ab29b421ddeb04a8546c943b  CHAPTER-07-v7.pdf
71869d372f8f8d264fea6291e89d6ccd  CHAPTER-08.pdf
$ cut -d ' ' -f 1 slides.sums | sort | uniq -d | tee dup_slides.sums
05b2f3169f2eda0861db5f33a57d845d
14287febecded35be79b88990bbac75f
15fc012b129ff7766af49cbb41108a37
62e2fa7dbf2b20f1cc6ebe91c96ec605
6d8c58baf463eb0038c3b5bc4ab43864
71869d372f8f8d264fea6291e89d6ccd
8a4e4506ab29b421ddeb04a8546c943b
bb31637676c57bdaa0b1f2dd5746fef2
c969dfa9a54997635d9fd15858764d0c
ffa4b026518ce7de68077d6587041be5

If a set of files have the same hash, you can safely assume that their contents are identical. The -d flag used on uniq displays duplicate lines; if there is a repeated hash, we’ve found files that are duplicates.

Now we need to find which files have these hashes:

$ for l in $(<dup_slides.sums)
> do
>   echo "Files matching md5 hash ${l}: "
>   grep $l slides.sums | cut -d ' ' -f3
>   echo ''
> done \

Files with md5 hash 05b2f3169f2eda0861db5f33a57d845d: 
CHAPTER-04-v2.pdf
CHAPTER-04.pdf

Files with md5 hash 14287febecded35be79b88990bbac75f: 
CHAPTER-05-v3.pdf
CHAPTER-05-v5.pdf

Files with md5 hash 15fc012b129ff7766af49cbb41108a37: 
CHAPTER-07.pdf
CHAPTER-07-v3.pdf

Files with md5 hash 62e2fa7dbf2b20f1cc6ebe91c96ec605: 
CHAPTER-14-v1.pdf
CHAPTER-14.pdf

Files with md5 hash 6d8c58baf463eb0038c3b5bc4ab43864: 
CHAPTER-14-v3.pdf
CHAPTER-02-v1.pdf

Files with md5 hash 71869d372f8f8d264fea6291e89d6ccd: 
CHAPTER-08-v1.pdf
CHAPTER-08.pdf

Files with md5 hash 8a4e4506ab29b421ddeb04a8546c943b: 
CHAPTER-07-v5.pdf
CHAPTER-07-v7.pdf

Files with md5 hash bb31637676c57bdaa0b1f2dd5746fef2: 
CHAPTER-06-v2.pdf
CHAPTER-06-v6.pdf

Files with md5 hash c969dfa9a54997635d9fd15858764d0c: 
CHAPTER-05-v4.pdf
CHAPTER-05-v6.pdf

Files with md5 hash ffa4b026518ce7de68077d6587041be5: 
CHAPTER-14-v6.pdf
CHAPTER-14-v8.pdf

Now let’s tweak that to list the files we need to remove:

$ for l in $(<dup_slides.sums)
> do
>   grep $l slides.sums | cut -d ' ' -f3 | sed '1d'
> done | xargs rm
$ ls | wc -l
18

Ten duplicates gone! All with the power of shell, thanks UNIX! Now for some Descent.