Skip to content

Instantly share code, notes, and snippets.

@peterjaap
Created April 8, 2025 08:20
Show Gist options
  • Save peterjaap/0aa89b208b7c0fd82b43d337f3456c29 to your computer and use it in GitHub Desktop.
Save peterjaap/0aa89b208b7c0fd82b43d337f3456c29 to your computer and use it in GitHub Desktop.
Efficiently detect content-duplicate files between two directories (app/ and vendor/) regardless of path. This script computes SHA256 hashes and file sizes to identify files in app/ that are binary-equivalent to any file in vendor/. It avoids expensive cmp comparisons by leveraging sorted hash lists and join. Ideal for large codebases with poten…
#!/bin/bash
# Step 1: Hash all files in app/ and vendor/ using sha256 and file size
find app -type f -exec stat --format="%s %n" {} + | while read -r size file; do
hash=$(sha256sum "$file" | cut -d' ' -f1)
echo "$size $hash $file"
done | sort > /tmp/app_files.txt
find vendor -type f -exec stat --format="%s %n" {} + | while read -r size file; do
hash=$(sha256sum "$file" | cut -d' ' -f1)
echo "$size $hash $file"
done | sort > /tmp/vendor_files.txt
# Step 2: Match based on size + hash
join -j 1 -o 1.1 1.2 1.3 2.3 -t' ' /tmp/app_files.txt /tmp/vendor_files.txt | awk '
{
printf "MATCH:\n app: %s\n vendor: %s\n\n", $3, $4
}
'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment