Peter pszemraj

LLM Foundry Configuration Reference

This document provides a comprehensive reference for all configuration options available in LLM Foundry YAML files. Configuration files are used for training, fine-tuning, and evaluating large language models.

The Enshittification of Closed-Weight Frontier Models

Ruminations on Theory and Motivations

The Concept of Enshittification: Coined by Cory Doctorow, it describes the pattern where platforms initially offer great value to attract users, then lock them in, and finally extract value by degrading the service for users while increasing value extraction for business customers (advertisers, etc.) or, in this case, the platform owner themselves by reducing costs.
Applying it to Frontier AI Chatbots:
- Phase 1: Attract Users: Release a groundbreaking model (e.g., initial GPT-4, Claude 3 Opus). Offer free access or affordable subscriptions. Generate massive hype and positive press. Users are amazed by the capabilities (complex reasoning, creativity, coding).

Phase 2: Lock-in Users: Users integrate the tool into their daily workflows, studies, or creative processes. They become accustomed to its abilities and interface. Subscription models create a direct financial lock-in

	# -- coding: utf-8 --
	"""gemma-3n-test

	pip install -U -q git+https://github.com/huggingface/transformers.git
	pip install -U -q git+https://github.com/huggingface/pytorch-image-models.git
	"""

	from transformers import pipeline
	import torch

	#!/usr/bin/env python3
	"""
	Slice a (possibly very tall) image into fixed-height chunks.

	Creates a sibling directory called <image stem>_slices/
	and writes slice_000.png, slice_001.png, … inside it.
	"""

	import argparse
	from pathlib import Path

	"""
	Create & save an hf dataset with train/test/val splits from dir w/ text files

	Ideal structure:

	root / section_name_1 / file 1
	root / section_name_1 / file 2
	root / section_name_1 / file YYY
	root / section_name_2 / file 1
	root / section_name_2 / file ZZZ

	#!/usr/bin/env python
	# -- coding: utf-8 --
	"""
	Standalone Asynchronous Nanonets-OCR-s Inference Script using vLLM and PyMuPDF.

	This script processes PDF files from an input directory using the
	nanonets/Nanonets-OCR-s model served locally by vLLM via its OpenAI-compatible API.
	It renders each page, sends API requests concurrently for OCR, extracts the
	structured markdown/HTML text, and saves the combined text for each PDF into a
	corresponding .txt file in the specified output directory.

	from dataclasses import dataclass
	from typing import List, Optional, Tuple

	import torch
	import torch.nn as nn


	@dataclass
	class _LayerSummary:
	"""A dataclass to hold summary information for a single layer."""

	"""
	WaveNet: An Ultra-Small Language Model (PyTorch Implementation)

	Based on the paper: https://arxiv.org/abs/2411.02674
	Hugging Face Transformers compatible implementation.
	"""
	import math
	from typing import Dict, Optional, Tuple, Union

	import torch

	from typing import Dict, List, Optional

	import datasets

	# Optional: Keep the version print outside the function if desired
	# print(f"Using datasets library version: {datasets.__version__}")


	def create_interleaved_streaming_dataset(
	dataset_path: str = "Zyphra/Zyda-2",

	#!/usr/bin/env python
	"""
	create_unified_mcqa.py – “batteries‑included” multiple‑choice aggregator
	✅ Handles all datasets listed in the conversation
	✅ Survives missing/renamed columns
	✅ Converts every `label` to pure int64 to avoid ClassLabel clashes
	✅ Explicitly casts features to ensure concatenation compatibility
	✅ Improved error handling and skipping for malformed examples
	✅ Limits warning/info messages per dataset
	✅ Fixes column mismatch error during cast

Peter pszemraj

LLM Foundry Configuration Reference

Table of Contents

The Enshittification of Closed-Weight Frontier Models