Skip to content

Instantly share code, notes, and snippets.

@sallyom
Created July 31, 2025 15:42
Show Gist options
  • Save sallyom/375d7bbfc33270c5791b0ac0430b6998 to your computer and use it in GitHub Desktop.
Save sallyom/375d7bbfc33270c5791b0ac0430b6998 to your computer and use it in GitHub Desktop.

llm-d Observability: PromQL Queries and Trace Spans

Based on your observability needs, here's a comprehensive mapping of PromQL queries and complementary trace spans:

Tier 1: Immediate Failure & Saturation Indicators

Metric Need PromQL Query Trace Spans to Enhance Insight
Overall Error Rate (Platform-wide) sum(rate(inference_model_request_error_total[5m])) / sum(rate(inference_model_request_total[5m])) gateway.request with error status codes and error messages
Per-Model Error Rate sum by(model) (rate(inference_model_request_error_total[5m])) / sum by(model) (rate(inference_model_request_total[5m])) gateway.request with gen_ai.request.model attribute
Request Preemptions (per vLLM instance) sum by(pod, instance) (rate(vllm:num_preemptions_total[5m])) vllm.request.preemption with reason and KV cache state
Overall Latency P90/P99 histogram_quantile(0.99, sum(rate(inference_model_request_duration_seconds_bucket[5m]))) gateway.requestvllm.inference full trace
Model-Specific TTFT P99 histogram_quantile(0.99, sum by(model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m]))) vllm.prefill span duration
Model-Specific TPT P99 histogram_quantile(0.99, sum by(model_name) (rate(vllm:time_per_output_token_seconds_bucket[5m]))) vllm.decode.step per-token spans
Scheduler Health up{job="inference-scheduler"} * on(pod) group_left() (1 - rate(container_restarts_total{container="scheduler"}[5m])) epp.scheduler.health_check periodic spans
GPU Utilization avg by(gpu, node) (DCGM_FI_DEV_GPU_UTIL or nvidia_gpu_duty_cycle) N/A - hardware metric

Tier 2: Diagnostic Drill-Down

Path A: Basic Model Serving & Scaling

Metric Need PromQL Query Trace Spans to Enhance Insight
KV Cache Utilization avg by(pod, model_name) (vllm:kv_cache_usage_perc) kv-cache-manager.GetPodScores with llm_d.kv_cache.utilization
Request Queue Lengths sum by(pod, model_name) (vllm:num_requests_waiting) vllm.queue.enqueue and vllm.queue.dequeue timing
Model Throughput (Requests/sec) sum by(model_name, pod) (rate(vllm:request_success_total[5m])) gateway.request count aggregation
Model Throughput (Tokens/sec) sum by(model_name, pod) (rate(vllm:generation_tokens_total[5m])) vllm.generation with token count attributes

Path B: Intelligent Routing & Load Balancing

Metric Need PromQL Query Trace Spans to Enhance Insight
Request Distribution (QPS per instance) sum by(pod) (rate(inference_model_request_total{target_model!=""}[5m])) epp.routing.decision with selected pod
Token Distribution sum by(pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m])) routing_proxy.request with token counts
Idle GPU Time 1 - avg by(pod) (rate(vllm:iteration_tokens_total[5m]) > 0) vllm.engine.step gaps between iterations
Routing Decision Latency histogram_quantile(0.99, sum(rate(inference_extension_scheduler_plugin_duration_seconds_bucket[5m]))) epp.scheduler.score_pods duration
Routing Rule Hits sum by(rule) (increase(inference_extension_routing_rule_hits_total[5m])) epp.routing.rule_evaluation with rule name

Path C: Prefix Caching

Metric Need PromQL Query Trace Spans to Enhance Insight
Prefix Cache Hit Rate sum(rate(vllm:prefix_cache_hits[5m])) / sum(rate(vllm:prefix_cache_queries[5m])) vllm.prefix_cache.lookup with hit/miss
Per-Instance Hit Rate sum by(pod) (rate(vllm:prefix_cache_hits[5m])) / sum by(pod) (rate(vllm:prefix_cache_queries[5m])) kv-cache-manager.FindTokens with cache results
Cache Memory Usage sum by(pod) (vllm:prefix_cache_memory_bytes / 1024 / 1024 / 1024) vllm.prefix_cache.allocation with size
Cache Eviction Rate sum by(pod) (rate(vllm:prefix_cache_evictions_total[5m])) vllm.prefix_cache.evict with eviction reason

Path D: P/D Disaggregation

Metric Need PromQL Query Trace Spans to Enhance Insight
KV Cache Transfer Time histogram_quantile(0.99, sum(rate(vllm:kv_cache_transfer_duration_seconds_bucket[5m]))) pd.kv_transfer prefill→decode transfer
Prefill Worker Utilization avg by(pod) (vllm:num_requests_running{phase="prefill"} / vllm:max_concurrent_requests) vllm.prefill.batch with batch size
Decode Worker Utilization avg by(pod) (vllm:kv_cache_usage_perc{phase="decode"}) vllm.decode.batch with active sequences
Prefill Queue Length sum by(pod) (vllm:num_requests_waiting{phase="prefill"}) pd.prefill.queue_time duration

Comparison with Existing Grafana Dashboard

Existing Dashboard vs. Proposed Queries

Dashboard Panel Current Query Proposed Enhancement
E2E Request Latency Uses vllm:e2e_request_latency_seconds ✅ Good, add inference_model_request_duration_seconds for gateway perspective
Token Throughput Tracks prompt & generation tokens separately ✅ Good, consider adding per-pod breakdown
Scheduler State Shows running/waiting/swapped ⚠️ Missing preemption tracking - add vllm:num_preemptions_total
Cache Utilization Uses deprecated gpu_cache_usage_perc ⚠️ Update to kv_cache_usage_perc
TTFT Latency Has P50-P99 percentiles ✅ Good coverage
Queue Time Uses rate of sum ⚠️ Consider histogram quantiles for better percentile accuracy

Key Gaps in Current Dashboard

  1. No Error Rate Tracking - Critical for SRE alerting
  2. No Scheduler Health Metrics - Missing restart/OOM tracking
  3. No Routing Distribution Metrics - Can't see load imbalance
  4. No Prefix Cache Metrics - Missing hit rate and evictions
  5. No P/D Disaggregation Metrics - If enabled, need transfer times

Recommended Trace Spans for Maximum Insight

1. Request Lifecycle Trace

gateway.request
└── epp.routing.decision
    └── kv-cache-manager.GetPodScores
        └── vllm.inference
            ├── vllm.prefill
            └── vllm.decode (multiple spans)

2. P/D Disaggregation Trace

epp.pd_prerequest
├── pd.prefill.schedule
├── vllm.prefill (on prefill worker)
├── pd.kv_transfer
└── vllm.decode (on decode worker)

3. Cache Operation Trace

kv-cache-manager.FindTokens
├── redis.query
└── vllm.prefix_cache.lookup
    └── vllm.prefix_cache.evict (if needed)

These traces provide critical timing and causality information that metrics alone cannot capture, enabling root cause analysis of complex issues like routing inefficiencies or cache thrashing.

Additional Metrics from Appendix

Path A - Extra Metrics

Metric Need PromQL Query Trace Spans
GPU Memory Bandwidth DCGM_FI_DEV_MEM_COPY_UTIL or nvidia_gpu_memory_bandwidth_utilization N/A - hardware metric
Model Load Time histogram_quantile(0.99, sum by(model) (rate(vllm:model_load_duration_seconds_bucket[5m]))) vllm.model.load with model size

Path B - Extra Metrics

Metric Need PromQL Query Trace Spans
Load Imbalance Index stddev by(model) (sum by(pod, model) (rate(inference_model_request_total[5m]))) / avg by(model) (sum by(pod, model) (rate(inference_model_request_total[5m]))) epp.load_balancer.score with imbalance factor
Routing Retries sum by(model) (rate(inference_extension_routing_retries_total[5m])) epp.routing.retry with retry reason

Implementation Notes

  1. Metric Naming: All vLLM metrics use vllm: prefix, gateway metrics use inference_ prefix
  2. Label Consistency: Ensure model_name, pod, namespace labels are consistent across metrics
  3. Trace Context: Always propagate trace context through all components for end-to-end visibility
  4. Sampling Strategy: Consider adaptive sampling based on error status and latency thresholds
  5. Retention: Balance metric retention with storage costs - consider downsampling older data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment