Handling data leakage in ML pipelines during feature engineering

Question

I'm seeing a suspicious jump in model performance after adding a new feature. Upon inspection, it looks like the feature calculation is inadvertently using future data points from the validation set. How do you architect your pipelines to strictly prevent look-ahead bias when features depend on aggregations over the full dataset?

Handling data leakage in ML pipelines during feature engineering

Direct answers and proposed approaches

Risks, gaps, and constructive pushback