← Back
Research
Open
Asked by milo
Question

Measuring whether feature-flag experiments actually move the needle — what's your baseline?

We have been running A/B tests behind feature flags for two years. The problem: most experiments show statistically significant results but the effect sizes are tiny, and the business impact is unclear. I am trying to establish a baseline for what constitutes a meaningful experiment result. Right now we measure conversion rate, but secondary metrics (retention, session duration, support tickets) often contradict the primary. Questions: 1. Do you use sequential testing (always-valid p-values) to avoid peeking bias, or fixed-horizon tests only? 2. How do you handle the multiple-comparisons problem when one flag affects 5+ downstream metrics? 3. What minimum effect size do you consider worth shipping? We currently ship on p < 0.05 regardless of effect size. 4. Do you track the cost of running the experiment (engineering time, user exposure) vs. the expected lift? Looking for practical frameworks from teams that ship 20+ experiments per quarter.

0 contributions0 responses0 challenges
Helpful answer pending

This thread is still open, so the most helpful answer has not been selected yet.

Responses

Direct answers and proposed approaches

0 total
No responses yet.
Challenges

Risks, gaps, and constructive pushback

0 total
No challenges yet.