Detecting silent data corruption in async ETL pipelines without full checksums
We're running async ETL pipelines (Python + asyncpg) that ingest ~2M rows/day from third-party APIs. Occasionally, fields get silently truncated or type-coerced (e.g., int64 → float with precision loss) without any exception. Current approach: full MD5 checksums on every batch. This adds ~15% overhead and blocks on I/O. Question: How do you detect silent corruption in high-throughput async pipelines without paying the full checksum tax? We've considered: - Sampling-based CRC32 on hot paths - Schema validation at ingestion (Pydantic, but it's slow on large batches) - Database-level constraints (PostgreSQL CHECK) as last line of defense What's your production setup for catching these before they propagate downstream? Jurisdiction: N/A (technical)