Skip to content

Instantly share code, notes, and snippets.

View thunderpoot's full-sized avatar
💨

underwood thunderpoot

💨
View GitHub Profile
@thunderpoot
thunderpoot / create_cc_index_table.sql
Created November 12, 2024 19:37
SQL script used to create an external table in Amazon Athena (and so on). Contains the schema for CC's columnar index
CREATE EXTERNAL TABLE IF NOT EXISTS commoncrawl_index -- let’s create a new table with the following columns:
(
url_surtkey STRING, -- Sort-friendly URI Reordering Transform
url STRING, -- the URL (duh) including protocol (http or https)
url_host_name STRING, -- the hostname, including subdomain(s)
url_host_tld STRING, -- the top-level domain such as `.org`
url_host_registered_domain STRING, -- the registered domain name
url_host_private_domain STRING, -- private domain such as `example.com`
url_host_public_suffix STRING, -- public suffix of the domain such as `.co.uk` or `.edu`
url_protocol STRING, -- the transfer protocol used, (http or https)
@thunderpoot
thunderpoot / fetch_subdomains.sh
Created November 6, 2024 13:37
Shell script using curl and jq to retrieve all subdomains for a given domain from a given Common Crawl index
#!/bin/bash
# Shell script using curl and jq to retrieve all subdomains for a given domain
# from Common Crawl's most recent index or a specified crawl ID. This script
# dynamically retrieves the latest crawl ID if none is provided, fetches data
# (across multiple pages if necessary), retries failed requests, and extracts
# unique subdomains.
# Usage:
# bash fetch_subdomains.sh <domain> [crawl_id]
@thunderpoot
thunderpoot / cc-get-page.sh
Created October 23, 2024 20:11
A shell script to retrieve a single HTML page from a Common Crawl archive
#!/bin/bash
# This script retrieves WARC (Web ARChive) data from Common Crawl based on a specified URL.
# It fetches the metadata for the URL, downloads the relevant segment of the WARC file, and extracts the HTML content.
# The script can also fetch the latest crawl data from Common Crawl's collection info.
# It uses Python's warcio library to extract HTML content and can open the result in the user's default browser.
# Usage: ./script.sh [URL] [optional: crawl name]
# If no crawl name is provided, the latest crawl is automatically selected.
@thunderpoot
thunderpoot / vinyl.py
Created July 11, 2024 22:37
Simple Python program to simulate playing a track at a different speed on a turntable
#!/usr/bin/env python3
# _ _
# __ __ (_) _ __ _ _ | | _ __ _ _
# \ \ / / | | | '_ \ | | | | | | | '_ \ | | | |
# \ V / | | | | | | | |_| | | | _ | |_) | | |_| |
# \_/ |_| |_| |_| \__, | |_| (_) | .__/ \__, |
# |___/ |_| |___/
# This command-line program allows you to change the playback speed of an
@thunderpoot
thunderpoot / describe_parquet.py
Last active March 2, 2025 12:45
Parquet Examples
import os
import pyarrow.parquet as pq
def describe_parquet(file_path):
file_size = os.path.getsize(file_path)
print(f"File Size: {file_size} bytes")
table = pq.read_table(file_path)
columns = table.column_names
@thunderpoot
thunderpoot / ghostbuster
Created January 30, 2024 22:39
Mosh: You have N detached Mosh sessions on this server
#!/bin/bash
# You know that really annoying message that pops up...
# Mosh: You have 3 detached Mosh sessions on this server, with PIDs:
# - mosh [2294539]
# - mosh [1874313]
# - mosh [2294805]
# I often find myself copying this list of PIDs in order to kill them manually
@thunderpoot
thunderpoot / cc_fetch_page.py
Last active November 8, 2024 22:33
An example of fetching a page from Common Crawl using the Common Crawl Index
import requests
import json
# For parsing URLs:
from urllib.parse import quote_plus
# For parsing WARC records:
from warcio.archiveiterator import ArchiveIterator
# The URL of the Common Crawl Index server
@thunderpoot
thunderpoot / findwords.pl
Last active March 2, 2025 12:45
Simple Perl script to find words containing only letters provided as argument
#!/usr/bin/env perl
# This script is useful when proofing with only some glyphs completed…
# Usage example:
# $ perl findwords.pl qwertyasdf
# Searching for words in /usr/share/dict/words containing only q, w, e, r, t, y, a, s, d, f
# westerwards
# afterstate
# aftertaste
@thunderpoot
thunderpoot / gifserver
Created January 29, 2021 20:29
a small telnet gif server using gif-for-cli and netcat
#!/bin/bash
if [ $# -eq 0 ]
then
echo "%usage: $0 <id> [options]"
exit
fi
echo "[$0] 🌶 Now servin' up hot GIFs!"
@thunderpoot
thunderpoot / skipping_stones.py
Created November 2, 2020 17:46
...ported from Perl version
#!/usr/bin/python
import random
import time
import sys
class Unbuffered( object ) :
def __init__( self, stream ) :
self.stream = stream
def write( self, data ) :