Test and Learn Glossary: Key Terms Every Beginner Should Know
Reading time: ~10 min
Table of Contents
- Experiment Design Terms
- Statistical Terms Explained Simply
- Retail-Specific Testing Vocabulary
- Terms You Will Use With Your Team
- The Bottom Line
Every discipline has its own language, and test and learn is no exception. Walk into a conversation between a retail merchant and an analytics team discussing a live experiment, and you will encounter a mix of experiment design terminology, statistical vocabulary, and retail-specific jargon that can feel like three languages happening simultaneously.
This glossary is for the people in that room who want to follow along — and eventually lead the conversation. It is designed for merchants, store operators, marketers, and senior leaders who are new to structured experimentation and want a reliable reference for the terms they will encounter most often.
The definitions here are written in plain language with retail context throughout. Statistical depth is kept to a minimum — that is covered in the companion article, Advanced Test and Learn Glossary, which addresses the more technical vocabulary you will encounter once you start working closely with data scientists and experimentation platforms.
Experiment Design Terms
These are the building blocks of every test and learn experiment. Understanding them is the foundation for everything else.
Hypothesis A specific, testable prediction about what will happen if you make a change — and why. A strong hypothesis is not “let’s see if this promotion drives sales.” It is: “If we reduce the price of this item by 10%, units sold will increase by at least 15%, because our sales data suggests this product is highly price sensitive.” The quality of your hypothesis determines the quality of your experiment. Vague hypotheses produce vague results — and vague results do not drive confident rollout decisions.
Test Group (Treatment Group) The stores, customers, or markets that experience the change you are testing. If you are testing a new end-cap display, the test group is the stores where the new display goes up. Everything you learn about the effect of your change comes from what happens in this group.
Control Group The stores, customers, or markets where nothing changes. The control group is your baseline — the yardstick you measure against to determine whether any difference in results was actually caused by your change, rather than by something external like a competitor promotion, a weather pattern, or normal seasonal variation. A test without a proper control group is not really a test. It is an observation.
Variable The specific element you are changing in a test. Sound experiment design means changing only one variable at a time. If you simultaneously change the price, the placement, and the promotional messaging on an item and sales go up, you will have no idea which change drove the result.
Variant (Variation) A specific version of the thing being tested. In a standard A/B test, you have two variants: Version A (usually the current state, or control) and Version B (the new version you are testing). More complex experiments may test several variants simultaneously.
A/B Test The most common form of controlled experiment. You split your test population into two groups — one experiences Version A, the other experiences Version B — and measure the outcome in both. As Optimizely defines it, A/B testing is a methodology for comparing two versions against each other to determine which performs better — a definition that applies as cleanly to retail store tests as it does to digital experience optimization.
Multivariate Test (MVT) A more complex experiment that tests multiple variables simultaneously to understand how different changes interact with each other. MVTs require larger sample sizes and more analytical expertise, but they can reveal interaction effects — situations where two changes together produce a different result than either would produce alone — that a simple A/B test would never surface.
Randomization The process of assigning stores, customers, or markets to test and control groups in a way that is not systematically biased. Proper randomization ensures the two groups are genuinely comparable before the test begins, so that any difference in outcomes during the test can be attributed to your change rather than to a pre-existing difference between the groups.
Matched Store Panel A method of store selection used commonly in retail experimentation where test stores are paired with control stores that are as similar as possible — in size, sales volume, customer demographics, geography, and competitive environment. Good matching reduces the risk that differences between stores will distort your results. Poor matching is one of the most common sources of error in retail experiments.
Test Duration The length of time a test runs before results are evaluated. Duration matters because too short a run produces unreliable data, while too long a run delays a decision unnecessarily. The right duration depends on sample size requirements, expected effect size, and the need to account for day-of-week patterns, seasonal variation, and the novelty effect.
Novelty Effect The tendency for customers or store associates to behave differently in response to a change simply because it is new — not because the change itself is better. A new display or process often generates inflated early results that fade as the novelty wears off. Tests need to run long enough for behavior to stabilize before results can be trusted.
Statistical Terms Explained Simply
You do not need a statistics degree to run good experiments. But you do need to understand these terms well enough to ask the right questions and interpret results honestly.
Statistical Significance A measure of confidence that your test result reflects a real effect rather than random variation in the data. A result is typically considered statistically significant when there is a 95% or greater probability it did not happen by chance. This is one of the most misunderstood concepts in retail testing — statistical significance does not tell you whether the effect is large enough to matter commercially. It only tells you whether you can trust that it is real.
P-Value The probability that you would see a result as large as yours — or larger — if your change had no effect at all. A p-value of 0.05 means there is a 5% chance your result is just random noise. Lower p-values mean greater confidence in the result. In retail experimentation, most organizations use a p-value threshold of 0.05 or 0.10 depending on the stakes of the decision.
Confidence Level The complement of the p-value threshold, expressed as a percentage. A 95% confidence level means you accept a 5% chance of being wrong. In retail, most organizations use 90% or 95% as their standard threshold — higher-stakes decisions, like major pricing changes or large capital investments, typically warrant 95% or 99%.
Confidence Interval A range within which the true effect of your change is likely to fall. If your test shows a lift of 8% with a 95% confidence interval of 4% to 12%, it means you are 95% confident the true lift is somewhere between 4% and 12%. Confidence intervals communicate uncertainty in a way that point estimates alone cannot — and they matter when you are deciding how much to commit to a rollout.
Statistical Power The probability that your test will detect a real effect if one actually exists. A test with low power may miss real effects — producing a negative result not because your change did not work, but because you did not have enough data to see it. Most statisticians recommend designing tests with at least 80% power. Scribbr’s explanation of statistical power and Type I and Type II errors is one of the clearest plain-language treatments of this concept available online.
Sample Size The number of stores, customers, or transactions included in your test. Sample size is one of the most important design decisions in any experiment — too small and your results will not be statistically reliable; too large and you are spending more time and resources than the decision warrants. The right sample size depends on the expected effect size, your confidence level target, and your statistical power requirement.
Null Hypothesis The starting assumption that your change has no effect. Statistical testing begins by assuming the null hypothesis is true and gathering evidence to determine whether you can reject it. If results are statistically significant, you reject the null hypothesis and conclude the change had a real effect. If not, you fail to reject it — which does not mean the change did nothing, only that you could not prove it did with the data you collected.
Type I Error (False Positive) Concluding that your change had an effect when it actually did not. In retail terms: rolling out an initiative because your test showed a lift that was actually just random variation. Type I errors are controlled by your significance threshold — the higher your confidence requirement, the lower your false positive risk.
Type II Error (False Negative) Failing to detect a real effect — concluding your change did nothing when it actually worked. In retail terms: abandoning a good idea because the test came back inconclusive, when in reality the test just did not have enough data to detect the lift. Type II errors are controlled by statistical power.
Retail-Specific Testing Vocabulary
These terms appear constantly in retail experimentation conversations. Some are unique to the retail context. Others have meanings that differ subtly but importantly from how they are used elsewhere.
Incrementality The additional sales, visits, or outcomes that would not have occurred without your change. Incrementality is the gold standard of retail measurement because it isolates the true causal impact of a change. Nielsen describes it as the lift above native demand — the sales you genuinely drove that would not have happened on their own. A promotion that drove volume may not be incremental if those customers would have purchased at full price anyway.
Baseline The expected level of performance in the absence of any change. Establishing a reliable baseline — typically derived from pre-test performance or the control group’s results during the test period — is essential for measuring lift accurately. Without a clean baseline, you cannot determine how much of your result is truly incremental.
Lift The percentage improvement in your target metric observed in the test group relative to the control group. A lift of 10% means the test group performed 10% better on the metric being measured during the test period. Lift is the most commonly reported outcome in retail experiments — but always ask whether it is total lift or incremental lift, because the two can look very different.
Cannibalization When a change drives sales of one product or category at the expense of another, without growing the total. A promotion that lifts units on a private label item might simply pull customers away from the branded version rather than growing the category. Measuring cannibalization is one of the most important things to do in any retail experiment involving promotions, pricing, or assortment changes.
Halo Effect The positive spillover a change creates on nearby or related products — the opposite of cannibalization. A strong display for a flagship product might lift sales of complementary items around it. Capturing halo effects alongside direct lift gives a more complete picture of what an experiment actually delivered.
Pilot A small-scale implementation of a change — typically in a limited set of stores or markets — designed to evaluate operational feasibility and customer response before broader rollout. A pilot is often used interchangeably with “test,” but the distinction matters: a pilot tends to focus more on operational learning, while a structured test is focused on measuring causal impact with statistical confidence.
Rollout The process of implementing a change across the full fleet, channel, or customer base following a successful test. A disciplined rollout involves defining what “success” looks like before the test begins, confirming results hold at scale after deployment, and documenting the full process for future reference.
Store Matching The methodology used to select control stores that are as comparable as possible to test stores on relevant dimensions — volume, format, demographics, geography, and competitive context. Poor store matching is one of the most consistent sources of error in retail experimentation. If the two groups are not truly comparable before the test starts, the results cannot be trusted regardless of how well everything else is executed.
Terms You Will Use With Your Team
These are the terms that come up most often in experiment reviews, results readouts, and conversations with analytics partners.
Pre-Period Analysis An evaluation of how test and control stores performed relative to each other before the test began. Pre-period analysis validates that your matched store panel is well-constructed — if the groups were already diverging before the test started, any difference during the test may reflect a pre-existing trend rather than the effect of your change.
Holdout Group A subset of stores or customers deliberately held back from an initiative to serve as a long-term control. Unlike a standard control group, which typically runs only for the duration of a specific test, a holdout group may be maintained for an extended period to allow ongoing measurement of an initiative’s incremental value after full rollout.
Guardrail Metric A secondary metric monitored during a test to catch unintended negative consequences. If your primary metric is basket size, guardrail metrics might include transaction frequency, customer return rate, and margin per transaction. Guardrail metrics ensure that optimizing for one outcome is not creating hidden problems somewhere else.
Test Registry A shared, searchable record of every experiment your organization has run — what was tested, when, how it was designed, what the results showed, and what was decided. A test registry is one of the most underrated assets a retail experimentation program can build. It prevents repeated experiments, accelerates hypothesis development, and creates institutional memory that survives organizational change.
Rollout Criteria The pre-defined conditions a test must meet before a full rollout is approved. Defining these criteria before the test begins — not after results come in — is one of the most important disciplines in retail experimentation. It prevents confirmation bias from influencing the decision and gives the organization a clear, defensible basis for acting on results.
The Bottom Line
Learning the language of test and learn is not about becoming a statistician. It is about being able to participate fully in conversations that will increasingly shape how your organization makes decisions. When you understand the difference between lift and incrementality, you can push back on promotional ROI claims that are overstating true impact. When you understand statistical significance, you can ask whether a result is trustworthy before you commit to scaling it. When you understand what a control group actually does, you can spot the tests that should never have been trusted in the first place.
The goal is a shared vocabulary that lets merchants, operators, analysts, and leaders talk about experiments in the same language. That shared understanding is one of the most practical and underrated prerequisites for building an experimentation program that actually changes how decisions get made.
Where to next?
Want to learn more? Choose from the links to dive deeper into test and learn
Foundation
What Is Test and Learn?
Test and learn is a structured approach to decision-making that involves running controlled experiments, measuring results, and using that data to inform what happens next.
Foundation
Test and Learn Glossary: Advanced
If you are looking to get deeper into statistics and test modeling, this is a great place to learn more advanced test and learn terms.
Statistics
Statistics for Non-Statisticians
This article covers the core statistical concepts every retail merchant, operator, and leader needs to understand to participate fully in test and learn conversations — not to become a statistician, but to ask better questions, interpret results more honestly.