- Web Crawler
- katana (https://github.com/projectdiscovery/katana) : really powerful, but several key features (such as headless crawling) have known issues.
- Uploader to Wayback Machine
- spn.sh (https://github.com/overcast07/wayback-machine-spn-scripts) : Shell Scripts
- webarc/wayback (https://github.com/wabarc/wayback) : Golang
- palewire/savepagenow (https://github.com/palewire/savepagenow) : Python
Each uploader has different pros and cons; none of the tools above is perfect.
# the detailed options will differ depending on the website or the situations
# when you use headless mode, add `-headless -concurrency 1`
# standard
katana -u https://example.com/ -ignore-query-params -field-scope fqdn -debug -output urls.txt -timeout 30 -strategy breadth-first -extension-match html,htm
# another example
katana -u http://example.com/user/ -ignore-query-params -field-scope fqdn -debug -output urls.txt -timeout 30 -strategy breadth-first -extension-match html,htm -match-regex 'https?://example\.com/user/*'
# webarc/wayback
cat urls.txt | xargs -P3 -I{} sh -c 'wayback --ia {} && sleep 5s'
# spn.sh
# the api keys are from https://archive.org/account/s3.php
# Some options are set for the following behavior of the script;
# - Skip checking if a capture is a first. This will make captures run faster.
# - Capture web page only if the latest existing capture at the Archive is older than 7 days before.
# - Wait 10 secs after starting a capture job before starting the next one.
spn.sh -a acesskey:secret -d 'skip_first_archive=1&if_not_archived_within=604800' -w 10 urls.txt
# palewire/savepagenow
echo -e "SAVEPAGENOW_ACCESS_KEY=xxxxxxxxxx\nSAVEPAGENOW_SECRET_KEY=xxxxxxxxxx" > ~/.env
cat urls.txt | xargs -P3 -I{} uv run --env-file ~/.env savepagenow -a {}