llm-d Observability: PromQL Queries and Trace Spans

Based on your observability needs, here's a comprehensive mapping of PromQL queries and complementary trace spans:

Tier 1: Immediate Failure & Saturation Indicators

Metric Need	PromQL Query	Trace Spans to Enhance Insight
Overall Error Rate (Platform-wide)	`sum(rate(inference_model_request_error_total[5m])) / sum(rate(inference_model_request_total[5m]))`	`gateway.request` with error status codes and error messages
Per-Model Error Rate	`sum by(model) (rate(inference_model_request_error_total[5m])) / sum by(model) (rate(inference_model_request_total[5m]))`	`gateway.request` with `gen_ai.request.model` attribute
Request Preemptions (per vLLM instance)	`sum by(pod, instance) (rate(vllm:num_preemptions_total[5m]))`	`vllm.request.preemption` with reason and KV cache state
Overall Latency P90/P99	`histogram_quantile(0.99, sum(rate(inference_model_request_duration_seconds_bucket[5m])))`	`gateway.request` → `vllm.inference` full trace
Model-Specific TTFT P99	`histogram_quantile(0.99, sum by(model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m])))`	`vllm.prefill` span duration
Model-Specific TPT P99	`histogram_quantile(0.99, sum by(model_name) (rate(vllm:time_per_output_token_seconds_bucket[5m])))`	`vllm.decode.step` per-token spans
Scheduler Health	`up{job="inference-scheduler"} * on(pod) group_left() (1 - rate(container_restarts_total{container="scheduler"}[5m]))`	`epp.scheduler.health_check` periodic spans
GPU Utilization	`avg by(gpu, node) (DCGM_FI_DEV_GPU_UTIL or nvidia_gpu_duty_cycle)`	N/A - hardware metric

Tier 2: Diagnostic Drill-Down

Path A: Basic Model Serving & Scaling

Metric Need	PromQL Query	Trace Spans to Enhance Insight
KV Cache Utilization	`avg by(pod, model_name) (vllm:kv_cache_usage_perc)`	`kv-cache-manager.GetPodScores` with `llm_d.kv_cache.utilization`
Request Queue Lengths	`sum by(pod, model_name) (vllm:num_requests_waiting)`	`vllm.queue.enqueue` and `vllm.queue.dequeue` timing
Model Throughput (Requests/sec)	`sum by(model_name, pod) (rate(vllm:request_success_total[5m]))`	`gateway.request` count aggregation
Model Throughput (Tokens/sec)	`sum by(model_name, pod) (rate(vllm:generation_tokens_total[5m]))`	`vllm.generation` with token count attributes

Path B: Intelligent Routing & Load Balancing

Metric Need	PromQL Query	Trace Spans to Enhance Insight
Request Distribution (QPS per instance)	`sum by(pod) (rate(inference_model_request_total{target_model!=""}[5m]))`	`epp.routing.decision` with selected pod
Token Distribution	`sum by(pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))`	`routing_proxy.request` with token counts
Idle GPU Time	`1 - avg by(pod) (rate(vllm:iteration_tokens_total[5m]) > 0)`	`vllm.engine.step` gaps between iterations
Routing Decision Latency	`histogram_quantile(0.99, sum(rate(inference_extension_scheduler_plugin_duration_seconds_bucket[5m])))`	`epp.scheduler.score_pods` duration
Routing Rule Hits	`sum by(rule) (increase(inference_extension_routing_rule_hits_total[5m]))`	`epp.routing.rule_evaluation` with rule name

Path C: Prefix Caching

Metric Need	PromQL Query	Trace Spans to Enhance Insight
Prefix Cache Hit Rate	`sum(rate(vllm:prefix_cache_hits[5m])) / sum(rate(vllm:prefix_cache_queries[5m]))`	`vllm.prefix_cache.lookup` with hit/miss
Per-Instance Hit Rate	`sum by(pod) (rate(vllm:prefix_cache_hits[5m])) / sum by(pod) (rate(vllm:prefix_cache_queries[5m]))`	`kv-cache-manager.FindTokens` with cache results
Cache Memory Usage	`sum by(pod) (vllm:prefix_cache_memory_bytes / 1024 / 1024 / 1024)`	`vllm.prefix_cache.allocation` with size
Cache Eviction Rate	`sum by(pod) (rate(vllm:prefix_cache_evictions_total[5m]))`	`vllm.prefix_cache.evict` with eviction reason

Path D: P/D Disaggregation

Metric Need	PromQL Query	Trace Spans to Enhance Insight
KV Cache Transfer Time	`histogram_quantile(0.99, sum(rate(vllm:kv_cache_transfer_duration_seconds_bucket[5m])))`	`pd.kv_transfer` prefill→decode transfer
Prefill Worker Utilization	`avg by(pod) (vllm:num_requests_running{phase="prefill"} / vllm:max_concurrent_requests)`	`vllm.prefill.batch` with batch size
Decode Worker Utilization	`avg by(pod) (vllm:kv_cache_usage_perc{phase="decode"})`	`vllm.decode.batch` with active sequences
Prefill Queue Length	`sum by(pod) (vllm:num_requests_waiting{phase="prefill"})`	`pd.prefill.queue_time` duration

Comparison with Existing Grafana Dashboard

Existing Dashboard vs. Proposed Queries

Dashboard Panel	Current Query	Proposed Enhancement
E2E Request Latency	Uses `vllm:e2e_request_latency_seconds`	✅ Good, add `inference_model_request_duration_seconds` for gateway perspective
Token Throughput	Tracks prompt & generation tokens separately	✅ Good, consider adding per-pod breakdown
Scheduler State	Shows running/waiting/swapped	⚠️ Missing preemption tracking - add `vllm:num_preemptions_total`
Cache Utilization	Uses deprecated `gpu_cache_usage_perc`	⚠️ Update to `kv_cache_usage_perc`
TTFT Latency	Has P50-P99 percentiles	✅ Good coverage
Queue Time	Uses rate of sum	⚠️ Consider histogram quantiles for better percentile accuracy

Key Gaps in Current Dashboard

No Error Rate Tracking - Critical for SRE alerting
No Scheduler Health Metrics - Missing restart/OOM tracking
No Routing Distribution Metrics - Can't see load imbalance
No Prefix Cache Metrics - Missing hit rate and evictions
No P/D Disaggregation Metrics - If enabled, need transfer times

Recommended Trace Spans for Maximum Insight

1. Request Lifecycle Trace

gateway.request
└── epp.routing.decision
    └── kv-cache-manager.GetPodScores
        └── vllm.inference
            ├── vllm.prefill
            └── vllm.decode (multiple spans)

2. P/D Disaggregation Trace

epp.pd_prerequest
├── pd.prefill.schedule
├── vllm.prefill (on prefill worker)
├── pd.kv_transfer
└── vllm.decode (on decode worker)

3. Cache Operation Trace

kv-cache-manager.FindTokens
├── redis.query
└── vllm.prefix_cache.lookup
    └── vllm.prefix_cache.evict (if needed)

These traces provide critical timing and causality information that metrics alone cannot capture, enabling root cause analysis of complex issues like routing inefficiencies or cache thrashing.

Additional Metrics from Appendix

Path A - Extra Metrics

Metric Need	PromQL Query	Trace Spans
GPU Memory Bandwidth	`DCGM_FI_DEV_MEM_COPY_UTIL or nvidia_gpu_memory_bandwidth_utilization`	N/A - hardware metric
Model Load Time	`histogram_quantile(0.99, sum by(model) (rate(vllm:model_load_duration_seconds_bucket[5m])))`	`vllm.model.load` with model size

Path B - Extra Metrics

Metric Need	PromQL Query	Trace Spans
Load Imbalance Index	`stddev by(model) (sum by(pod, model) (rate(inference_model_request_total[5m]))) / avg by(model) (sum by(pod, model) (rate(inference_model_request_total[5m])))`	`epp.load_balancer.score` with imbalance factor
Routing Retries	`sum by(model) (rate(inference_extension_routing_retries_total[5m]))`	`epp.routing.retry` with retry reason

Implementation Notes

Metric Naming: All vLLM metrics use vllm: prefix, gateway metrics use inference_ prefix
Label Consistency: Ensure model_name, pod, namespace labels are consistent across metrics
Trace Context: Always propagate trace context through all components for end-to-end visibility
Sampling Strategy: Consider adaptive sampling based on error status and latency thresholds
Retention: Balance metric retention with storage costs - consider downsampling older data

sallyom/llm-d-promQL-traces.md