Skip to content

Instantly share code, notes, and snippets.

@gengwg
Created May 7, 2025 03:09
Show Gist options
  • Save gengwg/74e0331acfd810d0ad12b1baef17cff9 to your computer and use it in GitHub Desktop.
Save gengwg/74e0331acfd810d0ad12b1baef17cff9 to your computer and use it in GitHub Desktop.
notes for the **System & Reliability @scale Conference**, structured for clarity and impact:

Core Principles for Scaling Systems

  1. Embrace Failure as Inevitable

    • Design for graceful degradation, automatic recovery, and chaos testing.
    • Example: "Like brushing your teeth" – build resilience into daily workflows.
  2. Simplify to Scale

    • Zoom out: Prioritize high-impact components.
    • Declutter services: Eliminate redundant systems to reduce failure points.
  3. Co-Design Physics and Software

    • Data centers are where hardware constraints (physics) meet software optimization.
    • Optimize for energy efficiency, cooling, and hardware-software synergy.

Key Challenges in Scaling

  1. Distributed Training

    • Checkpointing limits: Frequent saves risk I/O bottlenecks; explore incremental or differential strategies.
    • SPMD limitations: Move beyond Single Program Multiple Data with hybrid parallelism (e.g., pipeline + tensor sharding).
    • Asynchronous training: Balance staleness vs. throughput tradeoffs.
  2. Distributed Inference

    • Hockey stick demand: Design for sudden traffic spikes (e.g., global smart routing).
    • Cost optimization: Smaller models for high-scale tasks; leverage quantization, pruning.
    • Latency vs. reliability: Ensure end-to-end (E2E) SLAs with fault-tolerant pipelines.
  3. Reasoning Models at Scale

    • Reduce hallucinations: Improve grounding via retrieval-augmented generation (RAG) or constrained decoding.

Case Studies & Strategies

  1. Scaling Reinforcement Learning (RL)

    • Decouple training and inference workloads dynamically (e.g., move work to idle clusters).
  2. Twine Gang Scheduling

    • Benefits:
      • Reliability: Atomic task orchestration.
      • Efficiency: Optimize resource packing.
      • Scalability: Unified API for heterogenous workloads.
  3. Microservices at Scale

    • Modularize systems (e.g., "1 MV ≈ 1000 homes" – decompose monoliths into domain-specific units).

Future Directions

  1. Hardware-Software Co-Innovation

    • Maximize new architectures (e.g., TPU v6, Grace Hopper GPUs) for distributed workflows.
  2. Traffic-Aware Routing

    • Global smart routing: Go beyond ping (consider cost, carbon footprint, latency).
  3. Shift-Left for Inference

    • Design systems with inference constraints upfront (memory, latency, cost).

Conclusion

Scaling systems demands proactive design for failure, ruthless simplification, and co-evolution of hardware, software, and operational practices. Focus on traction – iterative wins that compound reliability and efficiency at scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment