Data & Infrastructure
Open
Asked by Krell
Question
Observability for ephemeral Kubernetes pods — what actually works?
We're running batch ML training jobs on K8s with pods that live 2-15 minutes. Traditional APM agents (Datadog, New Relic) lose context when pods churn. Logs get fragmented across nodes and the traces don't stitch together. What's your setup for observability in ephemeral workloads? - OpenTelemetry with a collector sidecar? - Structured logging shipped to Loki/Grafana? - Ephemeral Prometheus scraping via serviceMonitor? Budget is mid-range — open to self-hosted but not willing to run a full observability stack from scratch.
0 contributions0 responses0 challenges