llm-d Observability: PromQL Queries and Trace Spans
Based on your observability needs, here's a comprehensive mapping of PromQL queries and complementary trace spans:
Tier 1: Immediate Failure & Saturation Indicators
Metric Need
PromQL Query
Trace Spans to Enhance Insight
Overall Error Rate (Platform-wide)
sum(rate(inference_model_request_error_total[5m])) / sum(rate(inference_model_request_total[5m]))
gateway.request
with error status codes and error messages
Per-Model Error Rate
sum by(model) (rate(inference_model_request_error_total[5m])) / sum by(model) (rate(inference_model_request_total[5m]))
gateway.request
with gen_ai.request.model
attribute
Request Preemptions (per vLLM instance)
sum by(pod, instance) (rate(vllm:num_preemptions_total[5m]))
vllm.request.preemption
with reason and KV cache state
Overall Latency P90/P99
histogram_quantile(0.99, sum(rate(inference_model_request_duration_seconds_bucket[5m])))
gateway.request
→ vllm.inference
full trace
Model-Specific TTFT P99
histogram_quantile(0.99, sum by(model_name) (rate(vllm:time_to_first_token_seconds_bucket[5m])))
vllm.prefill
span duration
Model-Specific TPT P99
histogram_quantile(0.99, sum by(model_name) (rate(vllm:time_per_output_token_seconds_bucket[5m])))
vllm.decode.step
per-token spans
Scheduler Health
up{job="inference-scheduler"} * on(pod) group_left() (1 - rate(container_restarts_total{container="scheduler"}[5m]))
epp.scheduler.health_check
periodic spans
GPU Utilization
avg by(gpu, node) (DCGM_FI_DEV_GPU_UTIL or nvidia_gpu_duty_cycle)
N/A - hardware metric
Tier 2: Diagnostic Drill-Down
Path A: Basic Model Serving & Scaling
Metric Need
PromQL Query
Trace Spans to Enhance Insight
KV Cache Utilization
avg by(pod, model_name) (vllm:kv_cache_usage_perc)
kv-cache-manager.GetPodScores
with llm_d.kv_cache.utilization
Request Queue Lengths
sum by(pod, model_name) (vllm:num_requests_waiting)
vllm.queue.enqueue
and vllm.queue.dequeue
timing
Model Throughput (Requests/sec)
sum by(model_name, pod) (rate(vllm:request_success_total[5m]))
gateway.request
count aggregation
Model Throughput (Tokens/sec)
sum by(model_name, pod) (rate(vllm:generation_tokens_total[5m]))
vllm.generation
with token count attributes
Path B: Intelligent Routing & Load Balancing
Metric Need
PromQL Query
Trace Spans to Enhance Insight
Request Distribution (QPS per instance)
sum by(pod) (rate(inference_model_request_total{target_model!=""}[5m]))
epp.routing.decision
with selected pod
Token Distribution
sum by(pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))
routing_proxy.request
with token counts
Idle GPU Time
1 - avg by(pod) (rate(vllm:iteration_tokens_total[5m]) > 0)
vllm.engine.step
gaps between iterations
Routing Decision Latency
histogram_quantile(0.99, sum(rate(inference_extension_scheduler_plugin_duration_seconds_bucket[5m])))
epp.scheduler.score_pods
duration
Routing Rule Hits
sum by(rule) (increase(inference_extension_routing_rule_hits_total[5m]))
epp.routing.rule_evaluation
with rule name
Metric Need
PromQL Query
Trace Spans to Enhance Insight
Prefix Cache Hit Rate
sum(rate(vllm:prefix_cache_hits[5m])) / sum(rate(vllm:prefix_cache_queries[5m]))
vllm.prefix_cache.lookup
with hit/miss
Per-Instance Hit Rate
sum by(pod) (rate(vllm:prefix_cache_hits[5m])) / sum by(pod) (rate(vllm:prefix_cache_queries[5m]))
kv-cache-manager.FindTokens
with cache results
Cache Memory Usage
sum by(pod) (vllm:prefix_cache_memory_bytes / 1024 / 1024 / 1024)
vllm.prefix_cache.allocation
with size
Cache Eviction Rate
sum by(pod) (rate(vllm:prefix_cache_evictions_total[5m]))
vllm.prefix_cache.evict
with eviction reason
Path D: P/D Disaggregation
Metric Need
PromQL Query
Trace Spans to Enhance Insight
KV Cache Transfer Time
histogram_quantile(0.99, sum(rate(vllm:kv_cache_transfer_duration_seconds_bucket[5m])))
pd.kv_transfer
prefill→decode transfer
Prefill Worker Utilization
avg by(pod) (vllm:num_requests_running{phase="prefill"} / vllm:max_concurrent_requests)
vllm.prefill.batch
with batch size
Decode Worker Utilization
avg by(pod) (vllm:kv_cache_usage_perc{phase="decode"})
vllm.decode.batch
with active sequences
Prefill Queue Length
sum by(pod) (vllm:num_requests_waiting{phase="prefill"})
pd.prefill.queue_time
duration
Comparison with Existing Grafana Dashboard
Existing Dashboard vs. Proposed Queries
Dashboard Panel
Current Query
Proposed Enhancement
E2E Request Latency
Uses vllm:e2e_request_latency_seconds
✅ Good, add inference_model_request_duration_seconds
for gateway perspective
Token Throughput
Tracks prompt & generation tokens separately
✅ Good, consider adding per-pod breakdown
Scheduler State
Shows running/waiting/swapped
⚠️ Missing preemption tracking - add vllm:num_preemptions_total
Cache Utilization
Uses deprecated gpu_cache_usage_perc
⚠️ Update to kv_cache_usage_perc
TTFT Latency
Has P50-P99 percentiles
✅ Good coverage
Queue Time
Uses rate of sum
⚠️ Consider histogram quantiles for better percentile accuracy
Key Gaps in Current Dashboard
No Error Rate Tracking - Critical for SRE alerting
No Scheduler Health Metrics - Missing restart/OOM tracking
No Routing Distribution Metrics - Can't see load imbalance
No Prefix Cache Metrics - Missing hit rate and evictions
No P/D Disaggregation Metrics - If enabled, need transfer times
Recommended Trace Spans for Maximum Insight
1. Request Lifecycle Trace
gateway.request
└── epp.routing.decision
└── kv-cache-manager.GetPodScores
└── vllm.inference
├── vllm.prefill
└── vllm.decode (multiple spans)
2. P/D Disaggregation Trace
epp.pd_prerequest
├── pd.prefill.schedule
├── vllm.prefill (on prefill worker)
├── pd.kv_transfer
└── vllm.decode (on decode worker)
kv-cache-manager.FindTokens
├── redis.query
└── vllm.prefix_cache.lookup
└── vllm.prefix_cache.evict (if needed)
These traces provide critical timing and causality information that metrics alone cannot capture, enabling root cause analysis of complex issues like routing inefficiencies or cache thrashing.
Additional Metrics from Appendix
Path A - Extra Metrics
Metric Need
PromQL Query
Trace Spans
GPU Memory Bandwidth
DCGM_FI_DEV_MEM_COPY_UTIL or nvidia_gpu_memory_bandwidth_utilization
N/A - hardware metric
Model Load Time
histogram_quantile(0.99, sum by(model) (rate(vllm:model_load_duration_seconds_bucket[5m])))
vllm.model.load
with model size
Path B - Extra Metrics
Metric Need
PromQL Query
Trace Spans
Load Imbalance Index
stddev by(model) (sum by(pod, model) (rate(inference_model_request_total[5m]))) / avg by(model) (sum by(pod, model) (rate(inference_model_request_total[5m])))
epp.load_balancer.score
with imbalance factor
Routing Retries
sum by(model) (rate(inference_extension_routing_retries_total[5m]))
epp.routing.retry
with retry reason
Metric Naming : All vLLM metrics use vllm:
prefix, gateway metrics use inference_
prefix
Label Consistency : Ensure model_name
, pod
, namespace
labels are consistent across metrics
Trace Context : Always propagate trace context through all components for end-to-end visibility
Sampling Strategy : Consider adaptive sampling based on error status and latency thresholds
Retention : Balance metric retention with storage costs - consider downsampling older data