Experimentation Glossary 2025: Terms Every Product Team Should Know
What's in This Post
The Four Pillars of Modern Experimentation
Experimentation is no longer just for “growth hacking.” It’s how modern product organizations make high-quality decisions that get them to their goals faster. The teams that make progress the fastest share one thing: a common language for testing, statistics, and scalable experimentation.
At ABsmartly, we work with companies running hundreds of experiments per year. And we’ve learned that the difference between teams that learn and teams that guess usually comes down to how well they understand the fundamentals below.
This glossary is structured around the four pillars of modern experimentation:
- Core Statistics
- Modern Statistical Methods
- Scaling Challenges
- Product Strategy Layer
Pillar 1: Core Statistics
These terms are used a lot by teams who practice trustworthy A/B testing.
A/B Testing
An A/B test is a controlled experiment that compares two versions of an experience (control vs. variant) to learn if one performs better than the other against a specific metric.
A/B testing is also commonly known as “split testing.”
Conversion Rate
The percentage of users who complete a specific action. Common “specific actions” are things such as when a person makes a purchase, signs up for an email list, books an appointment, etc.
This is the formula to calculate the conversion rate:

Lift (Uplift)
The percentage improvement of a treatment over the control.
Here’s the formula to calculate lift:

Statistical Significance
Indicates that an observed result is unlikely to be caused by random chance alone, based on a predefined threshold.
Significance does not mean the result is large, important, or permanent.
P-Value (a.k.a. Probability Value)
The probability of observing results at least as extreme as those measured, assuming no real effect exists. The most common threshold is typically p < 0.05 Typically, the smaller the p-value, the better.
Confidence Level
The probability that a result is not due to chance.
Typical levels many product teams use are:
- 90% (exploratory)
- 95% (standard)
- 99% (high-risk decisions)
Power (a.k.a. Statistical Power, beta, or β)
The probability that an experiment detects a real effect (if it exists).
Under-powered tests lead to missed wins and unreliable conclusions. A common value people use when designing their experiments is 80%.
Sample Size
The number of users needed to detect an effect with enough power and confidence.
Minimum Detectable Effect (a.k.a. MDE)
The smallest effect size an experiment is designed to reliably detect.
Good to Know: A non-significant result doesn’t necessarily mean that an effect doesn’t exist. It may simply mean the effect is smaller than the MDE.
Type I Error (False Positive)
When a test leads you to conclude that a real effect exists when—in fact—it does not.
Type II Error (False Negative)
When a test fails to detect a real effect.
Pillar 2: Modern Statistical Methods
These techniques are commonly used in product experimentation programs.
Bayesian Statistics
A framework that expresses results as probabilities and updates beliefs as data arrives.
CUPED (Controlled Experiments Using Pre-Experiment Data)
A variance reduction method that uses historical data to make experiments more sensitive and faster.
Sequential Testing
A method that allows experimenters to analyze results while the test runs without inflating false positives (when done correctly).
Group Sequential Testing (a.k.a GST)
A structured form of sequential testing with planned interim analyses and statistical stopping boundaries. It allows early stopping for:
- Clear wins
- Futility
- Safety
It also maintains control over false positive rates.
Fixed Horizon Testing
A traditional, rigorous approach where:
- Sample size is pre-defined
- No interim analysis occurs
- Analysis happens only after test completion
If you “peek” early, it invalidates the results.
Multivariate Test (MVT)
A test that changes multiple elements and analyzes different combinations at once to detect interaction effects. Like an A/B test, it’s also a type of controlled experiment.
Pillar 3: Scaling Challenges
These terms matter more as product teams mature to run many experiments simultaneously, run more complex tests, and focus on reporting accuracy.
Allocation (Traffic Allocation)
The percentage of users assigned to each variant. For example, a 50/50 split on 100% of traffic is the most common.
Interaction Effect (a.k.a. Experiment Collision)
When two or more experiments affect the same users in a way that impacts each other’s results.
Randomization Unit
The thing that you randomize on to run your A/B test. Common randomization units are:
- User
- Session
- Account
- Device
Choosing a poor randomization unit can cause bias in your result.
False Discovery Rate (FDR)
Put simply, a False Discovery Rate is the percentage of your “winning” experiments that are actually statistical flukes.
Guardrail Metrics
Metrics used to ensure an experiment doesn’t harm anything important (e.g., performance, churn rate, customer service contacts, cancellations, etc.). They help teams understand and identify any potential unwanted consequences from keeping a change that shows a desired effect on their primary metric.
Holdout Group
A persistent control group excluded from experiments to measure long-term experimentation program impact.
Novelty Effect
Temporary performance spikes caused by newness rather than true improvement. For example, if you put a new piece of UI on the screen, some people will interact with it just to understand what it does—not to get any value from it.
Winner’s Curse
Early winning experiments often overestimate true effect size due to random variation.
Feature Flag
A way to enable (or disable) features for specific users without deploying new (or reverting) code.
Experimentation Platform
Software that manages randomization, statistical analysis, documentation, governance, and experiment safety at scale.
Pillar 4: Product Strategy Layer
Experimentation isn’t just statistics—it’s how product teams learn. So, testing thoughtfully to answer important questions about customer behavior and business impact becomes a core focus. (Instead of, for example, haphazardly optimizing based on “best practices.”)
Hypothesis
A testable prediction.
For example: “If we do X, then Y metric will be impacted because Z reason.”
North Star Metric
The primary long-term value metric that guides a product strategy.
Metric Sensitivity
How quickly and easily a metric responds to changes.
Heterogeneous Treatment Effects (HTE)
When an experiment affects different user segments differently.
Experiment Velocity
How quickly a team can design, launch, and conclude experiments. It’s a key signal of experimentation maturity because running many tests quickly shows that process has been optimized and made affordable.
Final Takeaway
Teams that master this language avoid false wins, reduce risk, and learn faster. Those that don’t often mistake noise for insight.
At ABsmartly, we build experimentation infrastructure that enforces statistical rigor while supporting high experiment velocity, because modern product development runs on learning—not guesses.
Want to learn more? Get a demo.
Written By
Get a Demo
Check out ABsmartly in Action
If you've outgrown basic experimentation tools and need to ramp things up, fill out this form. We'll contact you to schedule a product demo.