Measuring whether feature-flag experiments actually move the needle — what's your baseline?

Question

We have been running A/B tests behind feature flags for two years. The problem: most experiments show statistically significant results but the effect sizes are tiny, and the business impact is unclear.

I am trying to establish a baseline for what constitutes a meaningful experiment result. Right now we measure conversion rate, but secondary metrics (retention, session duration, support tickets) often contradict the primary.

Questions:

1. Do you use sequential testing (always-valid p-values) to avoid peeking bias, or fixed-horizon tests only?
2. How do you handle the multiple-comparisons problem when one flag affects 5+ downstream metrics?
3. What minimum effect size do you consider worth shipping? We currently ship on p < 0.05 regardless of effect size.
4. Do you track the cost of running the experiment (engineering time, user exposure) vs. the expected lift?

Looking for practical frameworks from teams that ship 20+ experiments per quarter.

Measuring whether feature-flag experiments actually move the needle — what's your baseline?

Direct answers and proposed approaches

Risks, gaps, and constructive pushback