Experimentation Glossary 2025: Terms Every Product Team Should Know

The Four Pillars of Modern Experimentation

Experimentation is no longer just for “growth hacking.” It’s how modern product organizations make high-quality decisions that get them to their goals faster. The teams that make progress the fastest share one thing: a common language for testing, statistics, and scalable experimentation.

At ABsmartly, we work with companies running hundreds of experiments per year. And we’ve learned that the difference between teams that learn and teams that guess usually comes down to how well they understand the fundamentals below.

This glossary is structured around the four pillars of modern experimentation:

  1. Core Statistics
  2. Modern Statistical Methods
  3. Scaling Challenges
  4. Product Strategy Layer

Pillar 1: Core Statistics

These terms are used a lot by teams who practice trustworthy A/B testing.

A/B Testing

An A/B test is a controlled experiment that compares two versions of an experience (control vs. variant) to learn if one performs better than the other against a specific metric.

A/B testing is also commonly known as “split testing.”

Conversion Rate

The percentage of users who complete a specific action. Common “specific actions” are things such as when a person makes a purchase, signs up for an email list, books an appointment, etc.

This is the formula to calculate the conversion rate:

Lift (Uplift)

The percentage improvement of a treatment over the control.

Here’s the formula to calculate lift:

Statistical Significance

Indicates that an observed result is unlikely to be caused by random chance alone, based on a predefined threshold.

Significance does not mean the result is large, important, or permanent.

P-Value (a.k.a. Probability Value)

The probability of observing results at least as extreme as those measured, assuming no real effect exists. The most common threshold is typically p < 0.05 Typically, the smaller the p-value, the better.

Confidence Level

The probability that a result is not due to chance.

Typical levels many product teams use are:

  • 90% (exploratory)
  • 95% (standard)
  • 99% (high-risk decisions)

Power (a.k.a. Statistical Power, beta, or β) 

The probability that an experiment detects a real effect (if it exists).

Under-powered tests lead to missed wins and unreliable conclusions. A common value people use when designing their experiments is 80%.

Sample Size

The number of users needed to detect an effect with enough power and confidence.

Minimum Detectable Effect (a.k.a. MDE)

The smallest effect size an experiment is designed to reliably detect.

Good to Know: A non-significant result doesn’t necessarily mean that an effect doesn’t exist. It may simply mean the effect is smaller than the MDE.

Type I Error (False Positive)

When a test leads you to conclude that a real effect exists when—in fact—it does not.

Type II Error (False Negative)

When a test fails to detect a real effect.

Pillar 2: Modern Statistical Methods

These techniques are commonly used in product experimentation programs.

Bayesian Statistics

A framework that expresses results as probabilities and updates beliefs as data arrives.

CUPED (Controlled Experiments Using Pre-Experiment Data)

A variance reduction method that uses historical data to make experiments more sensitive and faster.

Sequential Testing

A method that allows experimenters to analyze results while the test runs without inflating false positives (when done correctly).

Group Sequential Testing (a.k.a GST)

A structured form of sequential testing with planned interim analyses and statistical stopping boundaries. It allows early stopping for:

  • Clear wins
  • Futility
  • Safety

It also maintains control over false positive rates.

Fixed Horizon Testing

A traditional, rigorous approach where:

  • Sample size is pre-defined
  • No interim analysis occurs
  • Analysis happens only after test completion

If you “peek” early, it invalidates the results.

Multivariate Test (MVT)

A test that changes multiple elements and analyzes different combinations at once to detect interaction effects. Like an A/B test, it’s also a type of controlled experiment.

Pillar 3: Scaling Challenges

These terms matter more as product teams mature to run many experiments simultaneously, run more complex tests, and focus on reporting accuracy.

Allocation (Traffic Allocation)

The percentage of users assigned to each variant. For example, a 50/50 split on 100% of traffic is the most common.

Interaction Effect (a.k.a. Experiment Collision)

When two or more experiments affect the same users in a way that impacts each other’s results.

Randomization Unit

The thing that you randomize on to run your A/B test. Common randomization units are:

  • User
  • Session
  • Account
  • Device

Choosing a poor randomization unit can cause bias in your result.

False Discovery Rate (FDR)

Put simply, a False Discovery Rate is the percentage of your “winning” experiments that are actually statistical flukes.

Guardrail Metrics

Metrics used to ensure an experiment doesn’t harm anything important (e.g., performance, churn rate, customer service contacts, cancellations, etc.). They help teams understand and identify any potential unwanted consequences from keeping a change that shows a desired effect on their primary metric.

Holdout Group

A persistent control group excluded from experiments to measure long-term experimentation program impact.

Novelty Effect

Temporary performance spikes caused by newness rather than true improvement. For example, if you put a new piece of UI on the screen, some people will interact with it just to understand what it does—not to get any value from it.

Winner’s Curse

Early winning experiments often overestimate true effect size due to random variation.

Feature Flag

A way to enable (or disable) features for specific users without deploying new (or reverting) code.

Experimentation Platform

Software that manages randomization, statistical analysis, documentation, governance, and experiment safety at scale.

Pillar 4: Product Strategy Layer

Experimentation isn’t just statistics—it’s how product teams learn. So, testing thoughtfully to answer important questions about customer behavior and business impact becomes a core focus. (Instead of, for example, haphazardly optimizing based on “best practices.”)

Hypothesis

A testable prediction.
For example: “If we do X, then Y metric will be impacted because Z reason.”

North Star Metric

The primary long-term value metric that guides a product strategy.

Metric Sensitivity

How quickly and easily a metric responds to changes.

Heterogeneous Treatment Effects (HTE)

When an experiment affects different user segments differently.

Experiment Velocity

How quickly a team can design, launch, and conclude experiments. It’s a key signal of experimentation maturity because running many tests quickly shows that process has been optimized and made affordable.

Final Takeaway

Teams that master this language avoid false wins, reduce risk, and learn faster. Those that don’t often mistake noise for insight.

At ABsmartly, we build experimentation infrastructure that enforces statistical rigor while supporting high experiment velocity, because modern product development runs on learning—not guesses.

Want to learn more? Get a demo.

Written By

ABsmartly

ABsmartly is a leading experimentation platform built for smart teams.

Leave a Comment





Get a Demo

Check out ABsmartly in Action

If you've outgrown basic experimentation tools and need to ramp things up, fill out this form. We'll contact you to schedule a product demo.