Created
June 2, 2024 19:17
-
-
Save eliasdabbas/a1b4ad5aff2fcdf5dad2b0a3d24e9f83 to your computer and use it in GitHub Desktop.
Filter non 200 status codes on a daily basis
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
import datetime | |
import pandas as pd | |
today = datetime.datetime.now(datetime.UTC).strftime('%Y_%m_%d') | |
url_status_time = pd.concat( | |
pd.read_json(f'/path/to/status_codes/{file}', | |
lines=True) | |
for file in os.listdir('/path/to/status_codes')) | |
(url_status_time | |
[url_status_time['status'].ne(200)] | |
[['url', 'status', 'crawl_time']] | |
.to_csv(f'/path/to/non_200_codes/{today}.csv', | |
index=False)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
create a daily cron job, running after the previous script at 00:30 every day for example
Syncronize the filtered files with your local machine with
rsync
This will synchronize the files in the folder
/non_200_codes/
to the folder of your choosing on your local machine.