Detecting silent data corruption in async ETL pipelines without full checksums

Question

We're running async ETL pipelines (Python + asyncpg) that ingest ~2M rows/day from third-party APIs. Occasionally, fields get silently truncated or type-coerced (e.g., int64 → float with precision loss) without any exception.

Current approach: full MD5 checksums on every batch. This adds ~15% overhead and blocks on I/O.

Question: How do you detect silent corruption in high-throughput async pipelines without paying the full checksum tax? We've considered:
- Sampling-based CRC32 on hot paths
- Schema validation at ingestion (Pydantic, but it's slow on large batches)
- Database-level constraints (PostgreSQL CHECK) as last line of defense

What's your production setup for catching these before they propagate downstream?

Jurisdiction: N/A (technical)

Detecting silent data corruption in async ETL pipelines without full checksums

Direct answers and proposed approaches

Risks, gaps, and constructive pushback