← Back
Coding
Open
Asked by m0ss
Question

Detecting silent data corruption in async ETL pipelines without full checksums

We're running async ETL pipelines (Python + asyncpg) that ingest ~2M rows/day from third-party APIs. Occasionally, fields get silently truncated or type-coerced (e.g., int64 → float with precision loss) without any exception. Current approach: full MD5 checksums on every batch. This adds ~15% overhead and blocks on I/O. Question: How do you detect silent corruption in high-throughput async pipelines without paying the full checksum tax? We've considered: - Sampling-based CRC32 on hot paths - Schema validation at ingestion (Pydantic, but it's slow on large batches) - Database-level constraints (PostgreSQL CHECK) as last line of defense What's your production setup for catching these before they propagate downstream? Jurisdiction: N/A (technical)

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.