Sample Size: How Much Data Do You Actually Need to Trust Your Results?

Reading time: ~10 min

Table of Contents


There is a temptation in retail experimentation to run tests with whatever stores are conveniently available — twenty stores here, thirty there — and trust that if the result looks meaningful, it probably is. It is an understandable impulse. Stores are finite resources. Tests compete with operational calendars. Leadership wants answers quickly.

But the number of stores in your test is not a logistical detail. It is a scientific parameter that determines whether your result is trustworthy. Too few stores and your test cannot reliably detect the effect you are looking for — even if that effect genuinely exists. Too many and you are spending more time and resources than the decision requires. Getting it right requires understanding how sample size connects to the statistical machinery that makes experimental results meaningful.

This article explains what determines the right sample size for a retail experiment, what happens when you get it wrong in either direction, and how to calculate what you actually need before the test begins rather than hoping the results justify the design after the fact.

Why Sample Size Is a Design Decision, Not an Afterthought

In most retail organizations, sample size gets decided by convenience: how many stores are willing to participate, how many the operations team can manage, how many fit into the regional structure of the rollout. The result is a test designed around operational constraints rather than statistical requirements — and the consequence is a program that regularly produces underpowered tests, inconclusive results, and a slow erosion of organizational confidence in experimentation.

Sample size needs to be determined before the test is designed around operational convenience — not as a parameter that gets squeezed into whatever stores happen to be available.

The reason is straightforward. The statistical analysis that will be used to evaluate your results is built on assumptions about how much data is being analyzed. When those assumptions are violated — when the actual store count is lower than the required store count — the analysis does not automatically flag the problem. It produces a result. That result just cannot be trusted the way a properly powered test result can.

An underpowered test that comes back flat does not mean your change did not work. It may mean you did not have enough stores to detect the effect. An underpowered test that comes back positive does not necessarily mean your change worked — small samples produce noisier estimates, and early positive results in underpowered tests are systematically more likely to be inflated and less likely to replicate at rollout. This is what statisticians call the winner’s curse: the effects measured in underpowered studies tend to be larger than the true effect, because only the tests that happened to capture noise-inflated results reach significance.

Scribbr’s guide to statistical power explains the relationship clearly: power is the probability that a test will detect a real effect if one actually exists. A test with 80% power has an 80% chance of correctly identifying a genuine lift. A test with 40% power — which is common in underpowered retail experiments — has only a 40% chance of detecting the same genuine lift. The other 60% of the time it will come back inconclusive, and the organization will have no reliable information about whether the change works or not.

The Four Factors That Determine Required Sample Size

Sample size in a retail experiment is determined by four factors working together. Understanding each one is essential to understanding why the required store count can vary so dramatically across different test scenarios.

1. Effect size — how large a lift you need to detect.

Effect size is the magnitude of the difference you are trying to measure. If you expect a change to drive a 20% lift in category sales, you need fewer stores to reliably detect it than if you expect a 3% lift. Smaller effects are harder to see against the background noise of natural store-level variability, and you need more data to be confident that a small difference is real rather than random.

This is one of the most important and most frequently neglected considerations in retail experiment design. Before calculating your sample size, you need to define the minimum effect size that would justify a rollout — the lift below which you would not implement the change even if you saw it. That minimum detectable effect (MDE) is the number that goes into your power calculation.

Scribbr’s treatment of effect size puts it directly: a large effect size means a research finding has practical significance, while a small effect size indicates limited practical application. In retail terms, defining your MDE upfront forces you to connect your statistical design to your commercial decision criteria — which is exactly where they should be connected.

2. Metric variability — how much natural fluctuation exists in what you are measuring.

Every retail metric fluctuates naturally week to week and store to store. Category sales vary with traffic patterns, competitive activity, weather, seasonal drift, and dozens of other factors. The more naturally variable your metric, the harder it is to see a genuine treatment effect against the background noise — and the more stores and data points you need to average out the noise.

This is expressed statistically as variance — a measure of how spread out your metric values are around the mean. High variance means you need more stores. Low variance means the same sample can produce equally reliable estimates with fewer stores. Before finalizing your sample size, pull the historical week-to-week variability of your target metric across comparable stores. The variance of your metric is as important to your power calculation as the effect size you are targeting.

3. Confidence level — how willing you are to accept a false positive.

The standard confidence threshold for most retail experiments is 95% — meaning you accept a 5% chance that a significant result is actually noise. Higher confidence requirements (99%) require larger samples to achieve because the statistical threshold for significance is stricter. Lower confidence requirements (90%) require smaller samples. As discussed in detail in What Is Statistical Significance?, the right confidence level depends on the stakes of the decision being made. The connection to sample size is direct: choosing your confidence level before calculating your sample size ensures that your test design is calibrated to the risk tolerance appropriate for the decision.

4. Statistical power — how willing you are to accept a false negative.

Power is the complement of the false negative risk. At 80% power, you accept a 20% chance of missing a real effect — a false negative, where you conclude the change did not work when it actually did. At 90% power, you accept a 10% false negative risk. Higher power requires larger samples.

The standard for most retail experiments is 80% power. This means that if your change truly produces the minimum effect size you defined, your test has an 80% probability of detecting it. That 80% is a floor, not a ceiling — higher-stakes decisions warrant higher power targets — but it is the right baseline for most category-level retail tests.

These four factors interact in a specific mathematical relationship: as effect size decreases, required sample size increases. As variance increases, required sample size increases. As confidence level increases, required sample size increases. As power target increases, required sample size increases. Changing any one of them changes the required store count, and understanding those trade-offs is the foundation of principled sample size planning.

What Happens When You Have Too Little Data

The consequences of running underpowered retail experiments are concrete and consistent, and they compound over time in ways that are genuinely damaging to an organization’s ability to make good decisions.

You abandon good initiatives. When an underpowered test comes back inconclusive — showing a small but genuine lift that the test was not large enough to confirm statistically — the organization treats it as a failed test and moves on. The change that would have driven meaningful improvement across the full fleet never gets rolled out, because the test design was not capable of seeing it. This is the false negative at scale.

You roll out initiatives based on inflated estimates. The winner’s curse works in both directions. Underpowered tests that do produce significant results tend to overestimate the true effect size — because only the tests that captured noise-inflated results reach the significance threshold. When those overstated results get rolled out, they underperform. The rollout shows a 4% lift when the test showed 9%, and the narrative becomes “testing doesn’t predict real-world results” — which is the wrong lesson. The right lesson is that the test was underpowered.

You erode organizational trust in the testing program. When experiments consistently produce either inconclusive results or rollout underperformance, the organization stops trusting the methodology. Leaders stop asking “what did the test show?” and start relying on their own judgment again. The testing program becomes a box-checking exercise rather than a genuine decision-support tool. Rebuilding that trust requires showing that the tests are properly designed — which starts with proper sample sizing.

What Happens When You Have Too Much Data

It is worth spending a moment on the other direction, which is less discussed but also real.

Overpowered tests — tests with far more stores than the statistical requirements demand — produce reliable results but at unnecessary cost. More stores means more operational overhead, more stores held out of normal operations during the test period, more analytical resources required to process and present results, and a longer time before the test delivers actionable information.

The more subtle problem is the flip side of what makes significance valuable: with a very large sample, you will detect statistically significant effects that are far too small to be commercially meaningful. A test with 500 stores might reliably detect a 0.5% lift — but a 0.5% lift is almost certainly below the practical significance threshold for any implementation decision. When a result is statistically significant but practically trivial, and the organization acts on it anyway, the result is investment in changes that produce no discernible commercial benefit.

The discipline of setting your MDE — defining the smallest effect worth detecting before the test begins — prevents both under-design and over-design. It gives you the number that, combined with your variance estimate, confidence level, and power target, produces a required sample size that is right-sized for the decision, not just the statistics.

Calculating Your Required Store Count: A Practical Approach

Most retail practitioners do not need to understand the underlying mathematics of power calculations — the formula is complex and is best left to a statistician or an experimentation platform that computes it automatically. What retail leaders do need to understand is the inputs and what drives the output.

The inputs to a retail power calculation are:

  • Baseline metric value: What is the current level of the metric you are measuring? Category sales per store per week, units per transaction, transaction frequency among loyalty members — whatever your target metric is, you need its baseline.
  • Minimum detectable effect: What is the smallest lift that would justify a rollout? Express this as a percentage of the baseline — a 5% lift, a 10% lift, a 15% lift.
  • Metric variance: What is the typical week-to-week standard deviation of your target metric across comparable stores? Your historical data should provide this.
  • Confidence level: What significance threshold are you using? 90%, 95%, or 99%?
  • Power target: What probability of detecting a real effect are you requiring? 80% is standard.

With these five inputs, a power calculation produces the required number of stores per group. Double it for the total test plus control store count.

Optimizely’s guide to sample size calculations for experiments provides a clear walkthrough of how these components interact — framed for digital experimentation but directly applicable to physical retail with the appropriate retail-specific adaptations. The core logic is identical: smaller MDE requires more stores, higher variance requires more stores, higher confidence requires more stores, higher power requires more stores.

For physical retail, the practical guidance is to always run this calculation before finalizing the test design, and to treat the output as a hard constraint rather than a suggestion. If the required store count exceeds what is operationally available, the right response is not to run the test anyway with fewer stores — it is to either expand the available store set, raise the MDE (accept that you are only testing for larger effects), or lower the confidence threshold with explicit acknowledgment of the increased false positive risk.

Effect Size in Retail: What Lift Are You Actually Trying to Detect?

Setting the right MDE for a retail experiment is both a statistical question and a business question, and getting the business side right is as important as the statistical mechanics.

The MDE should be set at the minimum lift that would meaningfully change the rollout decision. Not the lift you hope to see. Not the lift that would make the initiative financially attractive. The lift below which you would not roll out, even if you observed it.

This distinction matters because it directly determines how many stores you need. If you set your MDE too low — targeting effects smaller than your commercial decision criteria actually require — you will calculate a store requirement that is larger than necessary, consuming more resources than the decision demands. If you set it too high — accepting that you can only detect very large effects — you will miss real but modest improvements that would justify rollout.

A practical approach to setting the MDE for retail experiments:

Start with the financial break-even. For any initiative with an implementation cost, calculate the minimum lift required for the initiative to be financially positive. That break-even lift is your floor for the MDE. There is no point designing a test to detect a lift below the point where implementation makes financial sense.

Consider historical benchmarks. What lifts have similar initiatives produced in past tests? If prior promotional tests in this category have ranged from 3% to 12%, designing a test with an MDE of 2% is over-engineered. An MDE of 4% to 5% captures the realistic range of effects while keeping the store count manageable.

Account for expected effect size decay at rollout. As noted in the discussion of underpowered tests and the winner’s curse, test results typically overestimate true effects because of selection effects and novelty inflation. A 10% lift in the test may translate to a 6% to 8% lift at full rollout. Designing your MDE around the rollout effect rather than the test effect produces a more accurate calibration.

Sample Size Calculators: Useful Tools With Real Limitations

Free sample size calculators are widely available online and are useful starting points for back-of-envelope estimates. They range from simple tools that ask for a baseline conversion rate and desired lift to more sophisticated implementations that incorporate power targets and variance estimates.

Optimizely’s sample size calculator is one of the most widely used, and it is a reasonable tool for digital experimentation scenarios. For physical retail experiments involving store-level metrics — weekly category sales, average basket size, transaction frequency — most online calculators are not directly applicable without adaptation. The metrics, variance structures, and sample units of physical retail are different enough from website conversion rates that direct application of digital experimentation calculators will often produce misleading results.

The most reliable approach for physical retail sample size planning is either to use experimentation software that incorporates physical retail-specific power calculations, or to work with a data scientist who understands both the statistical mechanics and the specific variance characteristics of your retail metrics. The calculation is not difficult — it is a standard power analysis — but it needs to be done with inputs specific to your retail context, not imported from a digital template.

What calculators cannot do, regardless of how sophisticated they are, is substitute for the business judgment required to set appropriate inputs. The MDE, the confidence level, and the power target are all decisions that involve commercial judgment, risk tolerance, and organizational context. The calculator gives you the store count once you have those decisions made. It does not make them for you.

The Relationship Between Sample Size, Duration, and the Full Test Design

Sample size does not live in isolation. It connects directly to test duration — the longer the test runs, the more data accumulates per store, which affects the effective sample for some metric types — and to the store selection decisions covered in the previous article.

For count-based metrics like transactions and units sold, the effective sample grows with time: four weeks of data from 40 stores is a larger effective sample than two weeks from the same 40 stores. This is why the required store count and the required test duration need to be calculated together rather than independently. If you have fewer stores than ideal, a longer test duration can partially compensate by accumulating more data per store — though it cannot fully substitute for more stores, because week-to-week variance within a store is different from store-to-store variance across a fleet.

For rate-based metrics — conversion rate, basket size, promotional attachment rate — the effective sample is determined more by the number of independent observations than by duration alone. More stores with shorter duration may be preferable to fewer stores with longer duration, depending on the specific metric structure.

These nuances are why the right answer to “how many stores do I need?” is always “it depends” — and why the work of characterizing your metric, understanding its variance structure, and connecting it to the decision criteria before the test begins is the most valuable analytical investment your experimentation program can make.

The Bottom Line

Sample size is where statistical theory meets operational reality in retail experimentation. Every decision that determines the required store count — the MDE, the confidence level, the power target, the metric variance — is a decision that has commercial consequences. Getting them right means your test results are trustworthy. Getting them wrong means either wasting resources on an overpowered test or — far more commonly — making decisions based on underpowered results that cannot bear the weight being placed on them.

The most important habit to develop is calculating the required store count before the test design is finalized — not after the operational constraints have already determined how many stores are available. When the calculation and the constraints conflict, that conflict is the signal to resolve before the test runs, not a detail to overlook on the way to a result.

Where to next?

Want to learn more? Choose from the links to dive deeper into test and learn

White Paper

Mitigating risk and optimizing opportunity with in-store testing

Retail

In the retail world, when you learn from hindsight, you’ve already lost money. Want the gift of foresight?

Case Study

Woolworths innovates to improve its customer experience, driving gains in a key product category

Grocery

An Australian supermarket searched for a competitive advantage in a hyper-competitive market. What they found drove sales through employee engagement and customer experience.

News

MarketDial Announces New Partnership with Casey’s General Stores

Retail

The partnership will empower Casey’s to democratize in-store testing by providing a centralized, easy-to-use solution that automates the data science needed to develop and analyze statistically valid brick-and-mortar tests.