Experimentation Glossary 2025: Terms Every Product Team Should Know
Published on 24 de fev. de 2026
by Zoë Oakes

Experimentation is no longer a growth hack, it’s how modern product organizations make decisions. The teams moving fastest in 2025 share one thing: a common language for testing, statistics, and scalable experimentation.
At ABsmartly, we work with companies running hundreds of experiments per year. The difference between teams that learn and teams that guess usually comes down to how well they understand the fundamentals below.
This glossary is structured around the four pillars of modern experimentation:
Core Statistics
Modern Methods
Scaling Challenges
Product Strategy Layer
Pillar 1: Core Statistics
These terms form the foundation of trustworthy A/B testing.
A/B Testing
A controlled experiment comparing two versions of an experience (control vs. variant) to determine which performs better against a defined metric.
Also called split testing.
Conversion Rate
The percentage of users who complete a desired action.
Formula:
[
\text{Conversion Rate} = \frac{\text{Conversions}}{\text{Users}} \times 100
]
Lift (Uplift)
The percentage improvement of a variant over control.
[
\text{Lift} = \frac{Variant - Control}{Control} \times 100
]
Statistical Significance
Indicates that an observed result is unlikely to be caused by random variation alone, based on a predefined threshold.
Significance does not mean the result is large, important, or permanent.
P-Value
The probability of observing results at least as extreme as those measured, assuming no real effect exists.
Common threshold: p < 0.05
Confidence Level
The probability that a result is not due to chance.
Typical levels:
90% (exploratory)
95% (standard)
99% (high-risk decisions)
Power (Statistical Power)
The probability that an experiment detects a real effect if it exists.
Low-powered tests lead to missed wins and unreliable conclusions.
Sample Size
The number of users required to detect an effect with adequate power and confidence.
Minimum Detectable Effect (MDE)
The smallest effect size an experiment is designed to reliably detect.
A non-significant result may simply mean the effect is smaller than the MDE.
Type I Error (False Positive)
Concluding a change works when it does not.
Type II Error (False Negative)
Failing to detect a real improvement.
Pillar 2: Modern Methods
These techniques help teams move faster without sacrificing statistical rigor.
Bayesian Statistics
A framework that expresses results as probabilities and updates beliefs as data arrives.
CUPED (Controlled Experiments Using Pre-Experiment Data)
A variance reduction method using historical data to make experiments more sensitive and faster.
Sequential Testing
Analyzing results while the test is running without inflating false positives — when done correctly.
Group Sequential Testing
A structured form of sequential testing with planned interim analyses and statistical stopping boundaries.
Allows early stopping for:
Clear wins
Futility
Safety
Maintains control over false positive rates.
Fixed Horizon Testing
The traditional approach where:
Sample size is pre-defined
No interim analysis occurs
Results are analyzed only after completion
Peeking early invalidates results.
Multivariate Testing (MVT)
Testing multiple elements and combinations at once to detect interaction effects.
Counterfactual
What would have happened if the experiment had not been run. Control groups approximate this.
Pillar 3: Scaling Challenges
These terms matter once teams run many experiments simultaneously.
Allocation (Traffic Allocation)
The percentage of users assigned to each variant.
Experiment Collision
When multiple experiments affect the same users and interfere with each other’s results.
Randomization Unit
The entity being randomized:
User
Session
Account
Device
Incorrect choice causes bias.
Multiple Testing (Multiple Comparisons Problem)
The increased false positive risk when many tests or metrics are evaluated.
False Discovery Rate (FDR)
A statistical method to control the proportion of false positives across many tests.
Guardrail Metrics
Metrics used to ensure an experiment does not harm critical areas (e.g., performance, churn).
Holdout Group
A persistent control group excluded from experiments to measure long-term program impact.
Novelty Effect
Temporary performance spikes caused by newness rather than true improvement.
Winner’s Curse
Early winning experiments often overestimate true effect size due to random variation.
Feature Flag
A system for enabling or disabling features for specific users without deploying new code.
Experimentation Platform
Software that manages randomization, analysis, and experiment safety at scale, like ABsmartly.
Pillar 4: Product Strategy Layer
Experimentation is not just statistics, it’s how product teams learn.
Hypothesis
A testable prediction:
If we change X, then Y metric will change because Z reason.
North Star Metric
The primary long-term value metric guiding product strategy.
Metric Sensitivity
How quickly a metric responds to changes.
Heterogeneous Treatment Effects (HTE)
When an experiment affects different user segments differently.
Experiment Velocity
How quickly a team can design, launch, and conclude experiments.
A key signal of experimentation maturity.
FAQ
What is A/B testing?
A/B testing is a controlled experiment comparing two versions of a product experience to determine which performs better.
What is the difference between fixed horizon and sequential testing?
Fixed horizon tests analyze results only after completion. Sequential testing allows interim analysis with statistical correction.
What is MDE in experimentation?
Minimum Detectable Effect (MDE) is the smallest improvement an experiment is designed to detect reliably.
Why are guardrail metrics important?
They ensure improvements in one metric do not cause hidden harm elsewhere.
Final Takeaway
Teams that master this language avoid false wins, reduce risk, and learn faster. Those that don’t often mistake noise for insight.
At ABsmartly, we build experimentation infrastructure that enforces statistical rigor while supporting high experiment velocity, because modern product development runs on learning, not guesses.