This document provides a comprehensive reference for all configuration options available in LLM Foundry YAML files. Configuration files are used for training, fine-tuning, and evaluating large language models.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
"""gemma-3n-test | |
pip install -U -q git+https://github.com/huggingface/transformers.git | |
pip install -U -q git+https://github.com/huggingface/pytorch-image-models.git | |
""" | |
from transformers import pipeline | |
import torch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python3 | |
""" | |
Slice a (possibly very tall) image into fixed-height chunks. | |
Creates a sibling directory called <image stem>_slices/ | |
and writes slice_000.png, slice_001.png, … inside it. | |
""" | |
import argparse | |
from pathlib import Path |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Create & save an hf dataset with train/test/val splits from dir w/ text files | |
Ideal structure: | |
root / section_name_1 / file 1 | |
root / section_name_1 / file 2 | |
root / section_name_1 / file YYY | |
root / section_name_2 / file 1 | |
root / section_name_2 / file ZZZ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
""" | |
Standalone Asynchronous Nanonets-OCR-s Inference Script using vLLM and PyMuPDF. | |
This script processes PDF files from an input directory using the | |
nanonets/Nanonets-OCR-s model served locally by vLLM via its OpenAI-compatible API. | |
It renders each page, sends API requests concurrently for OCR, extracts the | |
structured markdown/HTML text, and saves the combined text for each PDF into a | |
corresponding .txt file in the specified output directory. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from dataclasses import dataclass | |
from typing import List, Optional, Tuple | |
import torch | |
import torch.nn as nn | |
@dataclass | |
class _LayerSummary: | |
"""A dataclass to hold summary information for a single layer.""" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
WaveNet: An Ultra-Small Language Model (PyTorch Implementation) | |
Based on the paper: https://arxiv.org/abs/2411.02674 | |
Hugging Face Transformers compatible implementation. | |
""" | |
import math | |
from typing import Dict, Optional, Tuple, Union | |
import torch |
Ruminations on Theory and Motivations
-
The Concept of Enshittification: Coined by Cory Doctorow, it describes the pattern where platforms initially offer great value to attract users, then lock them in, and finally extract value by degrading the service for users while increasing value extraction for business customers (advertisers, etc.) or, in this case, the platform owner themselves by reducing costs.
-
Applying it to Frontier AI Chatbots:
- Phase 1: Attract Users: Release a groundbreaking model (e.g., initial GPT-4, Claude 3 Opus). Offer free access or affordable subscriptions. Generate massive hype and positive press. Users are amazed by the capabilities (complex reasoning, creativity, coding).
- Phase 2: Lock-in Users: Users integrate the tool into their daily workflows, studies, or creative processes. They become accustomed to its abilities and interface. Subscription models create a direct financial lock-in
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from typing import Dict, List, Optional | |
import datasets | |
# Optional: Keep the version print outside the function if desired | |
# print(f"Using datasets library version: {datasets.__version__}") | |
def create_interleaved_streaming_dataset( | |
dataset_path: str = "Zyphra/Zyda-2", |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
""" | |
create_unified_mcqa.py – “batteries‑included” multiple‑choice aggregator | |
✅ Handles all datasets listed in the conversation | |
✅ Survives missing/renamed columns | |
✅ Converts every `label` to pure int64 to avoid ClassLabel clashes | |
✅ Explicitly casts features to ensure concatenation compatibility | |
✅ Improved error handling and skipping for malformed examples | |
✅ Limits warning/info messages per dataset | |
✅ Fixes column mismatch error during cast |
NewerOlder