Advanced Test and Learn Glossary: Statistical and Analytical Terms for Retail Experimenters
Reading time: ~10 min
Table of Contents
- Advanced Statistical Concepts
- Advanced Experiment Design Concepts
- Advanced Analytical Methods
- Advanced Retail-Specific Concepts
- The Bottom Line
If you have been running retail experiments for a while — or if you are starting to work more closely with data scientists, experimentation platforms, or a dedicated analytics team — you will quickly encounter a layer of vocabulary that goes well beyond what the beginner glossary covers. These are the terms that show up in experiment design reviews, results readouts, and conversations about methodology. They are also the terms that most commonly create a gap between technical and non-technical stakeholders.
This glossary covers the statistical and analytical vocabulary that defines mature retail experimentation practice. It assumes familiarity with the concepts in the Beginner Glossary — terms like hypothesis, control group, statistical significance, and lift. The definitions here go deeper, covering the techniques, error types, and analytical methods that separate a rigorous experimentation program from one that produces results you cannot fully trust.
Advanced Statistical Concepts
Minimum Detectable Effect (MDE) The smallest effect size your test is designed to reliably detect, given its sample size, confidence level, and statistical power target. MDE is one of the most important and most overlooked design parameters in retail experimentation. If you design a test with an MDE of 5% but the true effect of your change is 2%, your test will not detect it — not because the change does not work, but because the test was not built to see something that small. Always define your MDE before finalizing your sample size, and make sure it is realistic given what you know about the expected magnitude of your change.
Statistical Power The probability that your test will detect a real effect if one actually exists, expressed as a percentage. Power is determined by your sample size, effect size, significance threshold, and the variability of your metric. The standard target for most retail experiments is 80% power — meaning a 20% chance of missing a real effect. Scribbr’s guide to statistical power explains the relationship between power, sample size, and significance level clearly. Power is often the first thing to cut when test design is under time or resource pressure — and the most common reason tests come back inconclusive.
Type I Error (False Positive) Concluding that your change had an effect when it did not — declaring a winner from noise. The risk of a Type I error is controlled by your significance threshold (alpha). Set alpha at 0.05 and you accept a 5% chance of a false positive. The practical consequence in retail is rolling out an initiative because it “tested well” when the test result was not real — an expensive outcome that erodes confidence in the testing program over time. Scribbr’s treatment of Type I and Type II errors includes a clear visual representation of how the two error types trade off against each other.
Type II Error (False Negative) Failing to detect a real effect — concluding a change did nothing when it actually worked. Type II errors are controlled by statistical power. The practical consequence in retail is abandoning a good initiative because the test came back flat, when in reality the test was simply underpowered. Underpowered tests are far more common in retail experimentation than most practitioners realize — particularly when effect sizes are small and sample sizes are constrained by the number of available stores.
Multiple Comparisons Problem When you test many hypotheses simultaneously — or measure many metrics in a single experiment — the probability of finding at least one false positive increases with every additional comparison. If you run 20 tests at a 95% confidence threshold, you should expect one false positive by chance alone. In practice, this means that retailers who measure a large number of metrics in every experiment, or who run many simultaneous tests without controlling for multiple comparisons, will regularly see significant-looking results that are not real. Techniques like the Bonferroni correction and false discovery rate adjustment are designed to address this problem.
Regression to the Mean The statistical tendency for extreme results to move back toward average over time. In retail, stores that performed unusually well or unusually poorly in a pre-period may show apparent improvement or decline during a test simply because their performance is normalizing — not because of anything you changed. Regression to the mean is particularly dangerous when test stores are selected based on recent performance rather than through proper randomization or matching.
Effect Size The magnitude of the difference you are measuring — how big the change in your target metric is. Effect size is distinct from statistical significance: a result can be statistically significant but commercially meaningless if the effect is tiny, and it can be commercially significant but statistically undetectable if the sample is too small. In retail experimentation, always evaluate results in terms of both statistical significance and practical effect size before making a rollout decision.
Variance A measure of how spread out your data is around the average. High variance means individual data points are scattered widely — which makes it harder to detect the signal of your change against the background noise. Reducing variance in your experiment metrics — either through better store matching, longer test duration, or analytical techniques like CUPED — is one of the most effective ways to make your tests more sensitive and reduce the sample sizes required.
Bayesian vs. Frequentist Statistics Two fundamentally different frameworks for interpreting experiment results that come up frequently in conversations with data scientists and experimentation platform vendors. The frequentist approach — the most common in retail experimentation — defines probability in terms of long-run frequency and uses p-values and confidence intervals to assess results. The Bayesian approach defines probability as a degree of belief that is updated as new data comes in, producing “probability of being best” metrics rather than p-values. Neither is universally correct. Frequentist methods are more conservative and easier to explain to non-technical stakeholders. Bayesian methods are often more intuitive and better suited to sequential decision-making. Knowing the difference helps you ask the right questions when a vendor or data team presents results.
Advanced Experiment Design Concepts
CUPED (Controlled-experiment Using Pre-Experiment Data) A variance reduction technique that uses historical data from before the experiment to adjust post-experiment results and reduce noise. Originally developed by Microsoft’s experimentation team in 2013, CUPED has become an industry standard at companies running experiments at scale. The core idea: if you know how a store or customer behaved before the test, you can use that knowledge to filter out natural variability and see the effect of your change more clearly. In practical terms, CUPED can allow you to reach statistical significance significantly faster — or with fewer stores — than a standard test. Optimizely’s explanation of CUPED covers how it works and when to use it.
Stratification A sampling technique where you divide your test population into subgroups — called strata — based on relevant characteristics (store size, format, geography, sales volume) before randomly assigning to test and control. Stratified randomization ensures that both groups contain a proportional mix of each subgroup, reducing the risk that random assignment produces imbalanced groups. It is particularly valuable in retail experiments with small sample sizes, where standard randomization may occasionally produce groups that are unrepresentative simply by chance.
Blocking A related technique where stores are grouped into blocks based on shared characteristics, and test and control assignments are made within each block. Blocking is a more structured version of stratification that ensures balance on specific dimensions the experimenter cares about most — like store format or geographic region.
Contamination What happens when the boundary between test and control groups breaks down — typically when customers, products, or information moves between the two groups in ways that compromise the integrity of the experiment. In retail, contamination can occur when customers shop at both test and control stores, when promotional information spreads beyond the intended test geography, or when store staff in control locations implement the test change independently. Contamination biases results toward zero — making your change look less effective than it actually is.
Washout Period A gap between the end of one test and the beginning of the next, designed to allow any carryover effects from the previous experiment to dissipate before the new one starts. Washout periods are particularly important when testing promotions or pricing changes, where customer expectations and purchase patterns may be affected by the previous test for some time after it ends.
Carryover Effect The lingering influence of one test on the results of a subsequent one. If you test a deep promotional discount in a set of stores and then immediately begin a second test in the same stores, customer purchase behavior may still be affected by the first promotion — compressing demand that would normally have occurred in the second test period. Carryover effects are common in retail and are a key reason for maintaining washout periods between sequential experiments.
Allocation Ratio The proportion of stores or customers assigned to the test group versus the control group. Most experiments use a 50/50 split, which maximizes statistical power for a given total sample size. But other ratios are sometimes appropriate — a 20/80 split, for example, limits the exposure of a risky change to a small number of stores while still allowing measurement. Unequal allocation is a legitimate design choice when the cost of being in the test group is high, but it comes at the expense of statistical efficiency.
Advanced Analytical Methods
Difference-in-Differences (DiD) An analytical method that estimates causal effects by comparing the change in outcomes for the test group before and after an intervention against the change in outcomes for the control group over the same period. DiD is powerful because it controls for time-invariant differences between groups and for trends that affect both groups equally. The World Bank’s DIME Wiki describes difference-in-differences as an approach that “facilitates causal inference even when randomization is not possible” — making it particularly useful in retail contexts where perfect store matching is difficult.
Parallel Trends Assumption The foundational assumption underlying difference-in-differences analysis — that in the absence of the intervention, test and control groups would have followed the same trend over time. If pre-period analysis shows the two groups were moving in different directions before the test started, the parallel trends assumption is violated and DiD results cannot be trusted. Pre-period trend analysis is therefore an essential step before relying on DiD for causal inference.
Synthetic Control A method for constructing a statistical control group by combining data from multiple comparison units to create a weighted composite that closely matches the test unit’s pre-period behavior. Synthetic control is particularly useful in retail when a clean matched control group is hard to find — for example, when testing a change in a single large market where no individual comparison market is sufficiently similar. Rather than picking one imperfect control market, synthetic control blends several together in proportions that minimize pre-period differences.
Covariate Adjustment The practice of including pre-test characteristics (covariates) in the statistical model used to estimate treatment effects, in order to reduce variance and improve precision. CUPED is one form of covariate adjustment. More generally, covariate adjustment can include store-level factors like pre-period sales, store size, customer demographics, and competitive intensity. Well-chosen covariates can meaningfully increase the statistical power of an experiment without adding stores or extending the test period.
Intent-to-Treat Analysis An analytical approach that evaluates results based on how stores or customers were originally assigned — regardless of whether they actually received the treatment as designed. In retail, this matters when compliance is imperfect: if some test stores did not properly implement the change, or some control stores accidentally received it, an intent-to-treat analysis counts them as they were assigned rather than as they actually behaved. This approach is more conservative but more defensible — it reflects the real-world conditions under which an initiative would be rolled out.
Sequential Testing A statistical approach that allows you to evaluate results as data accumulates during a test — rather than waiting until the end of a pre-specified test duration — while controlling the risk of false positives. Standard fixed-horizon testing requires you to define your sample size and test duration in advance and not look at results until the end. Sequential testing relaxes this constraint with statistical adjustments that maintain validity despite interim looks. It is more complex to implement but can significantly reduce the time to a confident decision in fast-moving retail environments.
Peeking Looking at test results before the planned evaluation date and making decisions based on incomplete data — without applying the statistical corrections that sequential testing methods provide. Peeking is one of the most common and damaging behaviors in retail experimentation. Because early results are noisy, peeking dramatically inflates the Type I error rate — you are far more likely to declare a winner (or loser) prematurely when results happen to be trending in one direction early in the test. The discipline of not peeking, or using proper sequential testing methods when interim looks are necessary, is a hallmark of a mature experimentation program.
Advanced Retail-Specific Concepts
External Validity The degree to which results from a test in one context can be expected to generalize to other contexts. A successful test in urban flagship stores may have limited external validity for rural convenience formats — the customers, competitive environments, and operational realities are different enough that the result may not transfer. Understanding the limits of external validity is critical for making intelligent rollout decisions and for designing experiments that produce findings you can actually use across the full fleet.
Network Effects and Interference What happens when the treatment of one unit — a store or customer — affects outcomes for another unit in the control group. In retail, interference can occur when customers switch between test and control stores in response to a promotion, when a pricing change in test stores causes competitors to respond in ways that affect control stores, or when social sharing spreads awareness of a test offer beyond the intended test population. Interference violates a core assumption of standard A/B testing methodology and can bias results in either direction.
SUTVA (Stable Unit Treatment Value Assumption) The formal statistical assumption that the outcome for any unit in a test depends only on whether that unit received the treatment — not on what treatment other units received. SUTVA is violated when interference or spillover effects are present. It comes up frequently in advanced experimental design conversations and is worth understanding if you are working with data scientists who are concerned about the validity of your store selection or geographic boundaries.
ACV (All Commodity Volume) A measure of retail distribution that represents the percentage of total store sales volume covered by stores carrying a particular product. ACV is foundational vocabulary in retail analytics and appears frequently in experiment design conversations about which stores to include in a test. A test conducted in stores representing 60% ACV means the results come from locations accounting for 60% of total category sales — which has implications for how broadly the findings can be applied.
Velocity The sales rate for a product, typically expressed as units sold per store per week. Velocity is the standard way retail merchants and analysts talk about product performance, and it is the metric most commonly used when discussing test results at the item level. Understanding velocity — and how it changes in test versus control — is more informative than total sales volume alone because it controls for differences in store count and test duration.
The Bottom Line
The gap between running experiments and running experiments well lives largely in this vocabulary. When you understand the difference between a Type I and Type II error, you can have an honest conversation about what it means to miss a real effect versus declare a false one — and set your thresholds accordingly. When you understand CUPED, you can ask whether your testing program is getting the most out of the data it already has. When you understand peeking, you can explain to a senior leader why looking at results after two weeks is not just impatient — it is statistically invalid.
These concepts are not academic abstractions. They are the practical tools that separate a testing program that produces results people trust from one that produces results people argue about. The retailers who build genuine experimentation capabilities almost always get here eventually — the question is whether they get here by design or by a series of painful mistakes.
Where to next?
Want to learn more? Choose from the links to dive deeper into test and learn
Foundation
Test and Learn Glossary: Advanced
If you are looking to get deeper into statistics and test modeling, this is a great place to learn more advanced test and learn terms.
Foundation
What Is Test and Learn?
Test and learn is a structured approach to decision-making that involves running controlled experiments, measuring results, and using that data to inform what happens next.
Statistics
Understanding P-Values
This article explains what a p-value actually is, what question it is answering, what it does not tell you, how to use it correctly in a retail context, and the most common misinterpretations that lead otherwise rigorous organizations astray.