Created
December 13, 2017 21:11
-
-
Save MikeNGarrett/e68cf491d33add4b6f85bd2ab4616395 to your computer and use it in GitHub Desktop.
Crawl a site to find 404, 301, 302, 500, etc responses
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Crawl a site's public urls to produce a csv list of urls and response codes | |
# This could be reduced into a single command, but I find it helpful to have a list of all urls. | |
# overview: crawl the site to add one url per line in a text file. | |
# NOTE: this must run and complete first. | |
# wget mirror's the site (including static files) | |
# grep files the line with the url on it. | |
# awk grabs the 3rd item (separated by spaces) and writes to urls.txt | |
wget --mirror -p https://domain.com/ 2>&1 | grep '^--' | awk '{ print $3 }' > urls.txt | |
# overview: given a file with one url per line, output a csv with urls and response codes. | |
# cat read the file | |
# xargs executes a command for each new line. | |
# - P 10 sets 10 parallel processes. | |
# - User agent matches Google's bots. | |
# - Only read the head. | |
# - Replace the write out with our own. Check out the available variables: https://ec.haxx.se/usingcurl-verbose.html#available---write-out-variables | |
# tee outputs piped content to a file. Same as >> | |
cat urls.txt | xargs -P 10 curl --user-agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" -o /dev/null --silent --head --write-out '%{url_effective};%{http_code};\n' | tee tested-urls.csv | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment