Skip to content

Instantly share code, notes, and snippets.

@stepney141
Last active April 14, 2025 09:12
Show Gist options
  • Save stepney141/4ae526bae0f43bd3538d632e44374ae0 to your computer and use it in GitHub Desktop.
Save stepney141/4ae526bae0f43bd3538d632e44374ae0 to your computer and use it in GitHub Desktop.
My CLI Snippets to Save Web Pages to Wayback Machine (Internet Archive)

Requirements

Each uploader has different pros and cons; none of the tools above is perfect.

Get URL List

# the detailed options will differ depending on the website or the situations
# when you use headless mode, add `-headless -concurrency 1`

# standard
katana -u https://example.com/ -ignore-query-params -field-scope fqdn -debug -output urls.txt -timeout 30 -strategy breadth-first -extension-match html,htm

# another example
katana -u http://example.com/user/ -ignore-query-params -field-scope fqdn -debug -output urls.txt -timeout 30 -strategy breadth-first -extension-match html,htm -match-regex 'https?://example\.com/user/*'

Save

# webarc/wayback
cat urls.txt | xargs -P3 -I{} sh -c 'wayback --ia {} && sleep 5s'
# spn.sh
# the api keys are from https://archive.org/account/s3.php
# Some options are set for the following behavior of the script;
# - Skip checking if a capture is a first. This will make captures run faster.
# - Capture web page only if the latest existing capture at the Archive is older than 7 days before.
# - Wait 10 secs after starting a capture job before starting the next one.
spn.sh -a acesskey:secret -d 'skip_first_archive=1&if_not_archived_within=604800' -w 10 urls.txt
# palewire/savepagenow
echo -e "SAVEPAGENOW_ACCESS_KEY=xxxxxxxxxx\nSAVEPAGENOW_SECRET_KEY=xxxxxxxxxx" > ~/.env
cat urls.txt | xargs -P3 -I{} uv run --env-file ~/.env savepagenow -a {}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment