Duplicate File Finder for UNIX

POSIX-compatible script working also on FreeBSD

I needed to find duplicate files on an embedded NAS4free install, lacking developer tools (gcc) and GNU utilities, often offering additional features compared to bare POSIX tools (see for example xargs, grep, sed, find). This means that I couldn't use available tools like FSlint, DupeGuru or FDupes.

Using my limited knowledge and a lot of StackExchange, I arranged a script that scans multiple directories and lists potentially duplicate files.

The script uses as selection criterion the file size (switch "-s") or the SHA512 checksum (by default).

The script uses find to list all the files and xargs with wc to print their size. After sorting by size awk is used to keep only duplicates and, if enabled, the potentially duplicated files are checksummed, followed by another sorting and removal of non-duplicates.

It could happen that some file have the same size but different content, like MP3 files differing only for some characters of the ID3 tags. In this case, using the "-s" mode could help to detect those as well.

No operation on the files is performed, since I may have different needs each time I run the script.

For more complete detection and easier selection, Duplicate File Finder is a great tool also in the free version, but requires Windows (via SMB shares).

Download it.

#!/bin/bash

sizeOnly=0
if [ "$1" == "-s" ]; then
        set -- "${@:2}" #removed the 1st parameter
        sizeOnly=1
fi

if [ $# -eq 0 ]; then
    echo "Usage: $0 [-s] path(s)."
        echo "  -s only uses size for duplicates detection."
        exit 1
fi

folderList=
for var
do
        folderList="$folderList \"$var\""
done

if [ $sizeOnly -eq 1 ]; then
    eval find $folderList -type f -print0 | \
                xargs -0 -P2 wc -c | \
                sort | \
                awk '{if ($1 in used) {if (used[$1] != "") print used[$1] ; print ;} else used[$1]=$0 ;}'
else
    eval find $folderList -type f -print0 | \
                xargs -0 -P2 wc -c | \
                sort | \
                awk '{if ($1 in used) {if (used[$1] != "") print used[$1] ; print ;} else used[$1]=$0 ;}' | \
                sed -E 's/^[[:blank:][:digit:]]+//' | \
                tr '\n' '\0' | \
                xargs -L1 -P4 -0 sha512 -r | \
                sort | \
                awk '{if ($1 in used) {if (used[$1] != "") print used[$1] ; print ;} else used[$1]=$0 ;}'
fi

The xargs parameters -P determine how many parallel processes are used. In the case of wc and on my NAS (N40L processor, dual core) 2 parallel tasks were already optimal. In the case of sha512 the number should be determined according to the speed of the HDD relative to the speed of the processor. In my case one core is able to calculate 85 MB/s using sha512, so I multiplied it by 4 to saturate the RAID1 of two disks.

Author: Olaf Marzocchi

First revision: 2017-01-01.
Last revision: 2017-01-02.