Created
April 8, 2025 08:20
-
-
Save peterjaap/0aa89b208b7c0fd82b43d337f3456c29 to your computer and use it in GitHub Desktop.
Efficiently detect content-duplicate files between two directories (app/ and vendor/) regardless of path. This script computes SHA256 hashes and file sizes to identify files in app/ that are binary-equivalent to any file in vendor/. It avoids expensive cmp comparisons by leveraging sorted hash lists and join. Ideal for large codebases with poten…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# Step 1: Hash all files in app/ and vendor/ using sha256 and file size | |
find app -type f -exec stat --format="%s %n" {} + | while read -r size file; do | |
hash=$(sha256sum "$file" | cut -d' ' -f1) | |
echo "$size $hash $file" | |
done | sort > /tmp/app_files.txt | |
find vendor -type f -exec stat --format="%s %n" {} + | while read -r size file; do | |
hash=$(sha256sum "$file" | cut -d' ' -f1) | |
echo "$size $hash $file" | |
done | sort > /tmp/vendor_files.txt | |
# Step 2: Match based on size + hash | |
join -j 1 -o 1.1 1.2 1.3 2.3 -t' ' /tmp/app_files.txt /tmp/vendor_files.txt | awk ' | |
{ | |
printf "MATCH:\n app: %s\n vendor: %s\n\n", $3, $4 | |
} | |
' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment