underwood thunderpoot

⚡

💨

Principal Technologist at the Common Crawl Foundation

66 followers · 91 following

@commoncrawl
London, United Kingdom
23:37 (UTC +01:00)
https://underwood.network
https://lpx.org.uk

View GitHub Profile

Recently created

Least recently created

Recently updated

Least recently updated

thunderpoot / create_cc_index_table.sql

Created November 12, 2024 19:37

SQL script used to create an external table in Amazon Athena (and so on). Contains the schema for CC's columnar index

	CREATE EXTERNAL TABLE IF NOT EXISTS commoncrawl_index -- let’s create a new table with the following columns:
	(
	url_surtkey STRING, -- Sort-friendly URI Reordering Transform
	url STRING, -- the URL (duh) including protocol (http or https)
	url_host_name STRING, -- the hostname, including subdomain(s)
	url_host_tld STRING, -- the top-level domain such as `.org`
	url_host_registered_domain STRING, -- the registered domain name
	url_host_private_domain STRING, -- private domain such as `example.com`
	url_host_public_suffix STRING, -- public suffix of the domain such as `.co.uk` or `.edu`
	url_protocol STRING, -- the transfer protocol used, (http or https)

thunderpoot / fetch_subdomains.sh

Created November 6, 2024 13:37

Shell script using curl and jq to retrieve all subdomains for a given domain from a given Common Crawl index

	#!/bin/bash

	# Shell script using curl and jq to retrieve all subdomains for a given domain
	# from Common Crawl's most recent index or a specified crawl ID. This script
	# dynamically retrieves the latest crawl ID if none is provided, fetches data
	# (across multiple pages if necessary), retries failed requests, and extracts
	# unique subdomains.

	# Usage:
	# bash fetch_subdomains.sh <domain> [crawl_id]

thunderpoot / cc-get-page.sh

Created October 23, 2024 20:11

A shell script to retrieve a single HTML page from a Common Crawl archive

	#!/bin/bash

	# This script retrieves WARC (Web ARChive) data from Common Crawl based on a specified URL.
	# It fetches the metadata for the URL, downloads the relevant segment of the WARC file, and extracts the HTML content.
	# The script can also fetch the latest crawl data from Common Crawl's collection info.
	# It uses Python's warcio library to extract HTML content and can open the result in the user's default browser.

	# Usage: ./script.sh [URL] [optional: crawl name]
	# If no crawl name is provided, the latest crawl is automatically selected.

thunderpoot / vinyl.py

Created July 11, 2024 22:37

Simple Python program to simulate playing a track at a different speed on a turntable

	#!/usr/bin/env python3

	# _ _
	# __ __ (_) _ __ _ _ \| \| _ __ _ _
	# \ \ / / \| \| \| '_ \ \| \| \| \| \| \| \| '_ \ \| \| \| \|
	# \ V / \| \| \| \| \| \| \| \|_\| \| \| \| _ \| \|_) \| \| \|_\| \|
	# \_/ \|_\| \|_\| \|_\| \__, \| \|_\| (_) \| .__/ \__, \|
	# \|___/ \|_\| \|___/

	# This command-line program allows you to change the playback speed of an

thunderpoot / describe_parquet.py

Last active March 2, 2025 12:45

Parquet Examples

	import os
	import pyarrow.parquet as pq

	def describe_parquet(file_path):
	file_size = os.path.getsize(file_path)
	print(f"File Size: {file_size} bytes")

	table = pq.read_table(file_path)
	columns = table.column_names

thunderpoot / ghostbuster

Created January 30, 2024 22:39

Mosh: You have N detached Mosh sessions on this server

	#!/bin/bash

	# You know that really annoying message that pops up...

	# Mosh: You have 3 detached Mosh sessions on this server, with PIDs:
	# - mosh [2294539]
	# - mosh [1874313]
	# - mosh [2294805]

	# I often find myself copying this list of PIDs in order to kill them manually

thunderpoot / cc_fetch_page.py

Last active November 8, 2024 22:33

An example of fetching a page from Common Crawl using the Common Crawl Index

	import requests
	import json

	# For parsing URLs:
	from urllib.parse import quote_plus

	# For parsing WARC records:
	from warcio.archiveiterator import ArchiveIterator

	# The URL of the Common Crawl Index server

thunderpoot / findwords.pl

Last active March 2, 2025 12:45

Simple Perl script to find words containing only letters provided as argument

	#!/usr/bin/env perl

	# This script is useful when proofing with only some glyphs completed…
	# Usage example:

	# $ perl findwords.pl qwertyasdf
	# Searching for words in /usr/share/dict/words containing only q, w, e, r, t, y, a, s, d, f
	# westerwards
	# afterstate
	# aftertaste

thunderpoot / gifserver

Created January 29, 2021 20:29

a small telnet gif server using gif-for-cli and netcat

	#!/bin/bash

	if [ $# -eq 0 ]
	then
	echo "%usage: $0 <id> [options]"
	exit
	fi

	echo "[$0] 🌶 Now servin' up hot GIFs!"

thunderpoot / skipping_stones.py

Created November 2, 2020 17:46

...ported from Perl version

	#!/usr/bin/python

	import random
	import time
	import sys

	class Unbuffered( object ) :
	def __init__( self, stream ) :
	self.stream = stream
	def write( self, data ) :

NewerOlder

	#!/usr/bin/env python3

	# _ _
	# __ __ (_) _ __ _ _ \| \| _ __ _ _
	# \ \ / / \| \| \| '_ \ \| \| \| \| \| \| \| '_ \ \| \| \| \|
	# \ V / \| \| \| \| \| \| \| \|_\| \| \| \| _ \| \|_) \| \| \|_\| \|
	# \_/ \|_\| \|_\| \|_\| \__, \| \|_\| (_) \| .__/ \__, \|
	# \|___/ \|_\| \|___/

	# This command-line program allows you to change the playback speed of an