Louis Maddox lmmx

How do you ensure the upload integrity of a dataset of maybe 1M+ files?

Imagine you are processing 10k source files by facetting them into 500+ languages.

You have 10,000 files of 2,000 IDs each so a total of about 20M IDs.
Each row contains 1 or more language (which may not be null: so all IDs will be kept when you facet by language).
Rare languages make up 0.01% of the IDs, while some like English are 1%.
You partition your dataset into subsets by language and send them to a reliable uploader CLI tool
You don't get any info back about whether the upload was a success
You can write any record you want to assist your verification of the multipart upload

batcmd - Separate stdout and stderr with syntax highlighting

A bashrc shell function that runs any command and displays stdout and stderr as separate streams with bat syntax highlighting. Uses file descriptor swapping to cleanly separate the streams without temporary files or running the command twice.

Features:

Configurable syntax highlighting language
Clean separation with "STDOUT" and "STDERR" headers
Works with any command
No temporary files or command duplication

Requires NVIDIA Hopper architecture GPU (sm_90a must be supported)

i.e. sm_86 (RTX 30x0 series) won't work, need RTX 40x0 series

Installation:

make two copies of the repo, call uv venv in one and use conda create in the other (use Python 3.11.11 for both)

mkdir deepgemm && cd deepgemm

	import datetime as dt
	import random
	from functools import wraps

	import polars as pl
	from narlogs import print_step


	def load_dog_data():
	"""Load sample dog registry data"""

	import datetime as dt
	import random
	from functools import wraps

	import polars as pl
	from narlogs import print_step


	def load_dog_data():
	"""Load sample dog registry data"""

	import polars as pl

	lf = pl.LazyFrame({"a": [range(1_000_000)], "b": [range(1_000_000)]})

	perm = lf.with_columns(pl.col("a").shuffle()).with_columns((pl.col("a") * pl.col("b")).alias("c"))

	agg = (perm.sort(by="a").group_by_dynamic("a", every="10000i", period="50000i")
	.agg([pl.col("b").sum().alias("b_sum"),
	pl.col("c").std().alias("c_std"),
	pl.col("c").quantile(0.95).alias("c_p95")

	inherits = "heisenberg"

	"comment" = { fg = "gray", modifiers = ["italic"] }
	# "comment" = { fg = "teddy_bear_pink_intense", modifiers = ["italic"] }
	# "comment.line" = { fg = "teddy_bear_pink", modifiers = ["italic"] }
	"comment.block" = { fg = "hazmat_yellow" }
	"comment.block.documentation" = { fg = "chili_powder_red" }
	"comment.line.documentation" = { fg = "teddy_bear_pink_intense" }
	# "type.enum.variant" = { fg = "chili_powder_red" }
	"type.enum.variant.builtin" = { fg = "chili_powder_red" }

	#!/usr/bin/env rust-script
	//! Automatic function call tracing with parameters
	//!
	//! ```cargo
	//! [dependencies]
	//! tracing = "0.1"
	//! tracing-subscriber = { version = "0.3", features = ["fmt"] }
	//! ```

	use tracing::instrument;

	# Keep the latest nightly and remove older dated nightlies
	LATEST_DATE=$(rustup toolchain list \| grep 'nightly-[0-9]' \| sed 's/nightly-\([0-9-]\)./\1/' \| sort -r \| head -n 1)

	# If we found dated nightlies, remove all except the most recent one
	if [ -n "$LATEST_DATE" ]; then
	for toolchain in $(rustup toolchain list \| grep 'nightly-[0-9]' \| grep -v "$LATEST_DATE"); do
	echo "Removing old nightly: $toolchain"
	rustup toolchain uninstall "$toolchain"
	done
	fi

	//! ```cargo
	//! [dependencies]
	//! spez = "0.1.2"
	//! ```

	use spez::spez;
	use core::marker::PhantomData;

	#[derive(Debug)] pub enum Cooked {}
	#[derive(Debug)] pub enum Raw {}