Skip to content

Instantly share code, notes, and snippets.

View pszemraj's full-sized avatar

Peter pszemraj

View GitHub Profile
@pszemraj
pszemraj / load_zyda2.py
Last active April 27, 2025 21:25
load zyda 2 with streaming
from typing import Dict, List, Optional
import datasets
# Optional: Keep the version print outside the function if desired
# print(f"Using datasets library version: {datasets.__version__}")
def create_interleaved_streaming_dataset(
dataset_path: str = "Zyphra/Zyda-2",
@pszemraj
pszemraj / create_unified_mcqa.py
Created April 18, 2025 19:20
multiple‑choice dataset aggregator
#!/usr/bin/env python
"""
create_unified_mcqa.py – “batteries‑included” multiple‑choice aggregator
✅ Handles all datasets listed in the conversation
✅ Survives missing/renamed columns
✅ Converts every `label` to pure int64 to avoid ClassLabel clashes
✅ Explicitly casts features to ensure concatenation compatibility
✅ Improved error handling and skipping for malformed examples
✅ Limits warning/info messages per dataset
✅ Fixes column mismatch error during cast
@pszemraj
pszemraj / async_pipeline.py
Last active April 7, 2025 22:51
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Standalone Asynchronous RolmOCR Inference Script using vLLM and PyMuPDF.
This script processes PDF files from an input directory using the
reducto/RolmOCR model served locally by vLLM via its OpenAI-compatible API.
It renders each page, sends API requests concurrently for OCR, extracts plain
text, and saves the combined text for each PDF into a corresponding .txt file
in the specified output directory.
@pszemraj
pszemraj / alternate_attn_report.md
Created April 4, 2025 14:39
deep research report by gpt-4.5

Alternate Attention Mechanisms for Sequence Modeling (2023–2025)

Transformer-style self-attention has been central to recent advances in language modeling, but its $\mathcal{O}(L^2)$ complexity (for sequence length $L$) motivates research into more efficient alternate attention mechanisms. This report surveys state-of-the-art methods from 2023–2025 that replace or augment standard self-attention in language sequence models. We organize methods by broad families – from linear approximations and sparsity-based variants to convolutional, state-space, and recurrent mechanisms – outlining each method’s motivation, technical formulation, empirical performance on language tasks, and efficiency characteristics.

Contents:

@pszemraj
pszemraj / fix_extensions.py
Created March 31, 2025 22:50
File Extension Fixer using Magika
#!/usr/bin/env python3
"""
File Extension Fixer using Magika
This script analyzes files using Google's Magika deep learning model to identify
their actual content types and fix incorrect file extensions.
pip install -U joblib magika tqdm
"""
sudo apt-get update && sudo apt upgrade -y
sudo apt-get install -y poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
git clone https://github.com/allenai/olmocr.git --depth 1
cd olmocr
pip install -q ninja
pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
# clean up
pip cache purge && apt autoremove -y
@pszemraj
pszemraj / layernorm_scaling.py
Last active March 26, 2025 03:08
LayerNorm Scaling implementation to mitigate the Curse of Depth in LLMs.
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class LayerNormScaling(nn.Module):
"""
LayerNorm Scaling implementation to mitigate the Curse of Depth in LLMs.
Model Average CR⬆️ AGIEval Mean (Min, Max) AGIEval CR MMLU-Pro Mean (Min, Max) MMLU-Pro CR Math Mean (Min, Max) Math CR #Params (B)
meta-llama/Llama-3.1-70B-Instruct 72.39 72.43, (65.34, 74.66) 81.79 66.63, (55.16, 70.68) 73.19 65.88, (64.58, 67.86) 62.18 0
mistralai/Mistral-Large-Instruct-2407 71.93 68.78, (61.41, 74.49) 75.77 65.1, (50.28, 69.23) 72.31 71.04, (69.66, 72.72) 67.71 0
meta-llama/Meta-Llama-3-70B-Instruct 69.11 69.71, (60.77, 71.2) 83.13 58.75, (49.3, 63.16) 75.24 51.29, (49.66, 54.2) 48.96 0
01-ai/Yi-1.5-34B-Chat 58.43 63.89
@pszemraj
pszemraj / tensorboard_inspect.py
Last active March 11, 2025 00:33
CLI utility to quickly inspect the latest scalar values from TensorBoard logs.
#!/usr/bin/env python
"""
CLI utility to quickly inspect the latest scalar values from TensorBoard logs.
Dependencies:
pip install tbparse pandas fire tqdm
Usage:
python tensorboard_inspect.py --logdir ./path/to/logs
"""