Experimentation Glossary 2025: Terms Every Product Team Should Know

Published on 24 de fev. de 2026

by Zoë Oakes

Experimentation is no longer a growth hack, it’s how modern product organizations make decisions. The teams moving fastest in 2025 share one thing: a common language for testing, statistics, and scalable experimentation.

At ABsmartly, we work with companies running hundreds of experiments per year. The difference between teams that learn and teams that guess usually comes down to how well they understand the fundamentals below.

This glossary is structured around the four pillars of modern experimentation:

  1. Core Statistics

  2. Modern Methods

  3. Scaling Challenges

  4. Product Strategy Layer

Pillar 1: Core Statistics

These terms form the foundation of trustworthy A/B testing.

A/B Testing

A controlled experiment comparing two versions of an experience (control vs. variant) to determine which performs better against a defined metric.

Also called split testing.

Conversion Rate

The percentage of users who complete a desired action.

Formula:
[
\text{Conversion Rate} = \frac{\text{Conversions}}{\text{Users}} \times 100
]

Lift (Uplift)

The percentage improvement of a variant over control.

[
\text{Lift} = \frac{Variant - Control}{Control} \times 100
]

Statistical Significance

Indicates that an observed result is unlikely to be caused by random variation alone, based on a predefined threshold.

Significance does not mean the result is large, important, or permanent.

P-Value

The probability of observing results at least as extreme as those measured, assuming no real effect exists.

Common threshold: p < 0.05

Confidence Level

The probability that a result is not due to chance.

Typical levels:

  • 90% (exploratory)

  • 95% (standard)

  • 99% (high-risk decisions)

Power (Statistical Power)

The probability that an experiment detects a real effect if it exists.

Low-powered tests lead to missed wins and unreliable conclusions.

Sample Size

The number of users required to detect an effect with adequate power and confidence.

Minimum Detectable Effect (MDE)

The smallest effect size an experiment is designed to reliably detect.

A non-significant result may simply mean the effect is smaller than the MDE.

Type I Error (False Positive)

Concluding a change works when it does not.

Type II Error (False Negative)

Failing to detect a real improvement.

Pillar 2: Modern Methods

These techniques help teams move faster without sacrificing statistical rigor.

Bayesian Statistics

A framework that expresses results as probabilities and updates beliefs as data arrives.

CUPED (Controlled Experiments Using Pre-Experiment Data)

A variance reduction method using historical data to make experiments more sensitive and faster.

Sequential Testing

Analyzing results while the test is running without inflating false positives — when done correctly.

Group Sequential Testing

A structured form of sequential testing with planned interim analyses and statistical stopping boundaries.

Allows early stopping for:

  • Clear wins

  • Futility

  • Safety

Maintains control over false positive rates.

Fixed Horizon Testing

The traditional approach where:

  • Sample size is pre-defined

  • No interim analysis occurs

  • Results are analyzed only after completion

Peeking early invalidates results.

Multivariate Testing (MVT)

Testing multiple elements and combinations at once to detect interaction effects.

Counterfactual

What would have happened if the experiment had not been run. Control groups approximate this.

Pillar 3: Scaling Challenges

These terms matter once teams run many experiments simultaneously.

Allocation (Traffic Allocation)

The percentage of users assigned to each variant.

Experiment Collision

When multiple experiments affect the same users and interfere with each other’s results.

Randomization Unit

The entity being randomized:

  • User

  • Session

  • Account

  • Device

Incorrect choice causes bias.

Multiple Testing (Multiple Comparisons Problem)

The increased false positive risk when many tests or metrics are evaluated.

False Discovery Rate (FDR)

A statistical method to control the proportion of false positives across many tests.

Guardrail Metrics

Metrics used to ensure an experiment does not harm critical areas (e.g., performance, churn).

Holdout Group

A persistent control group excluded from experiments to measure long-term program impact.

Novelty Effect

Temporary performance spikes caused by newness rather than true improvement.

Winner’s Curse

Early winning experiments often overestimate true effect size due to random variation.

Feature Flag

A system for enabling or disabling features for specific users without deploying new code.

Experimentation Platform

Software that manages randomization, analysis, and experiment safety at scale, like ABsmartly.

Pillar 4: Product Strategy Layer

Experimentation is not just statistics, it’s how product teams learn.

Hypothesis

A testable prediction:

If we change X, then Y metric will change because Z reason.

North Star Metric

The primary long-term value metric guiding product strategy.

Metric Sensitivity

How quickly a metric responds to changes.

Heterogeneous Treatment Effects (HTE)

When an experiment affects different user segments differently.

Experiment Velocity

How quickly a team can design, launch, and conclude experiments.

A key signal of experimentation maturity.

FAQ 

What is A/B testing?
A/B testing is a controlled experiment comparing two versions of a product experience to determine which performs better.

What is the difference between fixed horizon and sequential testing?
Fixed horizon tests analyze results only after completion. Sequential testing allows interim analysis with statistical correction.

What is MDE in experimentation?
Minimum Detectable Effect (MDE) is the smallest improvement an experiment is designed to detect reliably.

Why are guardrail metrics important?
They ensure improvements in one metric do not cause hidden harm elsewhere.

Final Takeaway

Teams that master this language avoid false wins, reduce risk, and learn faster. Those that don’t often mistake noise for insight.

At ABsmartly, we build experimentation infrastructure that enforces statistical rigor while supporting high experiment velocity, because modern product development runs on learning, not guesses.