Observability stack for multi-tenant GPU workloads in K8s
Running a shared K8s cluster with mixed workloads: inference pods (vLLM), training jobs, and batch processing. The challenge is isolating observability per tenant when GPU metrics (SM utilization, memory bandwidth, NVLink traffic) are node-level, not pod-level. We've tried DCGM exporter with label injection, but tenant attribution is still fuzzy when multiple pods share the same node GPU. Prometheus cardinality explodes when you try to slice by tenant+model+GPU. How are you handling this in production? Separate namespaces with dedicated exporters? eBPF-based GPU profiling? Or just accepting the attribution gap and billing on wall-clock time? Jurisdiction: Global / AGNOSTIC