Understanding P-Values Without a Math Degree
Reading time: ~10 min
Table of Contents
- What a P-Value Actually Measures
- A Retail Example That Makes It Concrete
- The Five Most Common P-Value Misinterpretations
- The 0.05 Threshold: Convention, Not Law
- What P-Values Cannot Do Alone
- P-Hacking: What It Is and Why It Matters in Retail
- Using P-Values Correctly: A Practical Framework
- When a Non-Significant Result Is Still Informative
- The Bottom Line
If there is one statistical concept that causes more confusion, more misinterpretation, and more bad decisions in retail experimentation than any other, it is the p-value. It appears in virtually every test results readout. It is the number that determines whether a result is declared statistically significant. And in most retail organizations, it is treated as a simple pass/fail gate — below 0.05 means it worked, above 0.05 means it did not — in a way that is technically incorrect and commercially dangerous.
The p-value is not a difficult concept at its core. It is a probability — a single number that answers a specific, carefully worded question about your data. The confusion comes not from the mathematics but from the consistent misstatement of what question it is answering. Most people who use p-values routinely believe they are asking one question when they are actually asking another, and that gap produces systematic errors in how retail experiment results get interpreted and acted on.
This article explains what a p-value actually is, what question it is answering, what it does not tell you, how to use it correctly in a retail context, and the most common misinterpretations that lead otherwise rigorous organizations astray.
What a P-Value Actually Measures
Start with the question a p-value is designed to answer. It is not “did my change work?” It is not “how likely is my result to be real?” It is this:
If my change had absolutely no effect, how likely would I be to see a result as large as the one I observed — or larger — just from random variation in the data?
That is the question. Nothing more, nothing less.
When that probability is low — below your pre-specified threshold, typically 5% — you conclude that a result as large as yours would be very unlikely to occur by chance if your change did nothing. That gives you grounds to reject the null hypothesis — the assumption that your change had no effect — and conclude that your change probably produced a real result.
When that probability is high — above your threshold — the data is not giving you sufficient evidence to rule out the possibility that the result happened by chance. You fail to reject the null hypothesis. That does not mean your change did nothing. It means your data is not strong enough to conclude that it did something.
Scribbr’s definition of the p-value puts it cleanly: a p-value tells you how likely you are to have found a particular set of observations if the null hypothesis were true. It is a conditional probability — the probability of your data, given the assumption that nothing happened. Small p-values mean the data is unlikely under that assumption. Large p-values mean the data is consistent with it.
A Retail Example That Makes It Concrete
You run a test of a new end-of-aisle display in 50 matched stores for six weeks. At the end of the test, category sales in the test group are 8% higher than in the control group. The statistical analysis returns a p-value of 0.03.
What does that 0.03 mean?
It means: if this display actually had no effect on category sales — if the true lift were exactly zero — there would be only a 3% probability of observing a test-to-control difference as large as 8% (or larger) purely due to random store-level variation. Since 3% is below the standard threshold of 5%, you reject the null hypothesis and conclude the lift is real.
Now run the same test with a p-value of 0.14. That means: if the display had no effect, there would be a 14% probability of seeing an 8% difference just from natural variation in store performance. Since 14% is above the 5% threshold, the data does not give you sufficient grounds to rule out chance as an explanation. The test is inconclusive — not a proof that the display did nothing, but insufficient evidence to conclude it did something.
The p-value is the same whether the observed lift was 8% or 2%. What changes it is the amount of data, the variability of the stores, and the statistical design of the test. A small p-value means the evidence is strong enough to rule out chance. A large p-value means it is not. Neither tells you directly how large the real effect is — that is what confidence intervals are for.
The Five Most Common P-Value Misinterpretations
The misuse of p-values is so widespread that the American Statistical Association issued a formal statement in 2016 specifically to address it — the ASA’s Statement on P-Values identified six principles for proper interpretation and warned explicitly that “scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” Here are the five misinterpretations that most commonly show up in retail experimentation contexts.
Misinterpretation 1: “The p-value is the probability that my result is due to chance.” This is the most common misstatement of what a p-value means, and it is wrong. The p-value is not the probability that your result is noise. It is the probability of seeing a result this large if your change had no effect. The distinction is subtle but important. A p-value of 0.03 does not mean there is a 3% chance your result is random. It means that if the null hypothesis were true, there would be a 3% chance of seeing this result. Those are different statements.
Misinterpretation 2: “A low p-value means the effect is large.” Statistical significance and practical significance are entirely separate assessments. A p-value of 0.001 tells you the result is very unlikely to be noise. It says nothing about whether the effect is large enough to matter commercially. In a test with a very large number of stores, even a trivially small lift — 0.5% — can produce a very small p-value. That result is statistically significant but may be commercially meaningless. Always evaluate effect size alongside the p-value.
Misinterpretation 3: “A p-value above 0.05 means the change did nothing.” Failing to reject the null hypothesis is not the same as confirming it. A p-value of 0.09 does not mean your change had no effect. It means your data was not strong enough to meet the 95% confidence threshold. The result could reflect a real effect that your test was underpowered to detect — a false negative — rather than the absence of any effect. As Simply Psychology’s treatment of p-values notes, when a study has low statistical power, researchers may miss detecting an actual effect, which can be mistaken for evidence that no effect exists.
Misinterpretation 4: “The p-value tells me the probability that my hypothesis is true.” P-values say nothing about the probability of a hypothesis being true or false. They say something about the probability of observed data given an assumption about the world. This distinction trips up even experienced practitioners. A p-value of 0.03 does not mean there is a 97% chance that your display improved category sales. It means the data is inconsistent with the assumption that it did not.
Misinterpretation 5: “I can look at results multiple times and use the p-value each time.” The p-value calculation assumes a single pre-specified evaluation at a pre-specified sample size. Every time you look at results mid-test and evaluate them against the significance threshold — peeking — you are conducting an additional hypothesis test that inflates your false positive rate. Run a 95% confidence test and look at results ten times during the test, and your actual false positive rate can be significantly higher than 5%. Mida’s explanation of p-hacking in A/B testing describes the consequence directly: repeatedly testing data until you find significance produces misleading findings and exaggerated statistical results. Significance should be evaluated once, at the pre-specified endpoint.
The 0.05 Threshold: Convention, Not Law
The 0.05 significance threshold — the standard that produces 95% confidence — is not a scientific law. It is a convention that dates to Ronald Fisher’s 1920s statistical work and has been the default in scientific testing ever since. It is also, in many contexts, the wrong threshold for the decision being made.
In retail experimentation, the right threshold depends on the stakes and reversibility of the rollout decision. A promotional mechanic test that can be quickly reversed if it underperforms? 90% confidence may be sufficient — the cost of a false positive is limited. A permanent store format change that requires capital investment and is operationally difficult to reverse? 99% confidence is more appropriate — the cost of a false positive justifies the additional stores and time required to reach a higher threshold.
The practical implication is that the significance threshold should be set before the test begins, as part of the experiment design — not after results come in. When the threshold is set before results are seen, it is a genuine decision criterion. When it is set after — or when the threshold gets quietly adjusted because results came in at p = 0.06 rather than p = 0.05 — it is a rationalization. The discipline of pre-specification is one of the most important and least observed practices in retail experimentation.
What P-Values Cannot Do Alone
P-values are valuable tools when used correctly. But they are frequently asked to do more than they can. Here is a clear accounting of what they cannot tell you on their own.
They cannot tell you whether the effect is commercially meaningful. A p-value of 0.001 and a p-value of 0.04 both indicate statistical significance at the 95% threshold. They say nothing about whether a 0.5% lift or an 18% lift is what they detected. Effect size and the commercial decision criteria are separate questions.
They cannot tell you whether the result will replicate at rollout. Statistical significance is calculated based on your test group during the test period. Whether that result generalizes to the full fleet, persists over time, and survives the operational realities of full deployment is a question about external validity and novelty effects — not a question the p-value answers.
They cannot compensate for a poorly designed test. A test with the wrong control group, an insufficient sample size, a contaminated store set, or a design that violated any of the core assumptions of the statistical model will produce a p-value — but that p-value will not mean what you think it means. The p-value is only as reliable as the test design that produced it.
They cannot resolve a business decision on their own. The p-value is one input into a rollout decision. It needs to be combined with an assessment of commercial significance, implementation cost, strategic alignment, operational feasibility, and organizational risk tolerance. A statistically significant result for a change that costs more to implement than it will ever recover in lift is not a good rollout decision, regardless of its p-value.
P-Hacking: What It Is and Why It Matters in Retail
P-hacking — also called data dredging — is the practice of analyzing and re-analyzing data in different ways until a statistically significant result is found. In retail experimentation, it typically takes one of three forms.
Testing many metrics and reporting only the significant ones. If you measure twenty metrics in a test and three show p < 0.05, those three results are consistent with what you would expect from random chance at the 5% threshold even if none of the metrics were actually affected. Reporting only those three as significant is p-hacking — it produces misleading findings because the multiple comparisons problem has not been accounted for.
Segmenting results until something looks significant. A test that shows no overall lift might show a 12% lift in one geographic region, a 15% lift for one customer segment, or a 20% lift for one store format. If those segments were defined after results were seen — not pre-specified as primary analyses — the significance is not reliable. Exploratory segmentation is valuable for generating hypotheses. It is not a substitute for pre-specified primary analysis.
Adjusting the time window after results are observed. Changing the evaluation period from six weeks to four weeks because the four-week result looks more favorable, or extending the window because the six-week result missed significance — both are forms of p-hacking that inflate false positive rates in ways the standard significance calculation does not account for.
P-hacking is not always intentional. Much of it happens through well-meaning analysis — a team genuinely trying to understand their results by looking at them from multiple angles. The safeguard is pre-specification: define your primary metric, your evaluation window, and your analysis approach before the test begins. Document them. Hold to them. Treat anything else as exploratory.
Using P-Values Correctly: A Practical Framework
With all of the above in mind, here is a practical framework for using p-values correctly in retail experiment evaluation.
Step 1: Set the threshold before the test begins. Decide on your significance threshold — 90%, 95%, or 99% — based on the stakes of the rollout decision. Document it. Do not change it after you see results.
Step 2: Define your primary metric before the test begins. The p-value calculation that determines your rollout decision applies to one pre-specified metric. Secondary metrics provide context but do not drive the rollout decision independently.
Step 3: Do not look at results before the evaluation date. If you must monitor results during the test for operational reasons, use sequential testing methods that account for interim looks statistically. Do not apply a standard fixed-horizon p-value to interim data.
Step 4: Evaluate the p-value alongside effect size and confidence intervals. A p-value below your threshold tells you the result is reliable. The confidence interval tells you the range of plausible true effects. The effect size tells you whether the lift is large enough to matter commercially. All three are needed to make a sound rollout decision.
Step 5: Treat a non-significant result as inconclusive, not as evidence of no effect. If the p-value is above your threshold, the right interpretation is “we did not have sufficient evidence to detect an effect” — not “this change did nothing.” Consider whether the test was adequately powered before drawing conclusions about what the null result means.
Step 6: Document your interpretation, not just the number. A results readout that says “p = 0.04, significant at 95% confidence” is less useful than one that says “p = 0.04, significant at the 95% threshold we set before this test began; the estimated lift is 7% with a 95% confidence interval of 3%–11%; we recommend rollout based on both statistical and commercial significance.” The number alone is not the finding. The interpretation in context is.
When a Non-Significant Result Is Still Informative
A p-value above the threshold is often treated as a failed experiment and filed away. This is one of the most persistent wasteful habits in retail experimentation, and understanding p-values correctly is the antidote.
A non-significant result from a well-powered test is genuinely informative. It tells you that the effect of your change, if any, is likely smaller than the minimum detectable effect your test was designed to find. That is useful information. If your MDE was 8% and the test was properly powered, a non-significant result provides reasonable evidence that the true effect is below 8% — which may be exactly the information you need to deprioritize the initiative and allocate resources elsewhere.
A non-significant result from an underpowered test is less informative — it may reflect the absence of a large effect, or it may reflect insufficient data to detect a moderate one. The distinction matters because the responses are different. An underpowered null result should prompt a conversation about whether to rerun the test with more stores, not a conclusion that the change does not work.
Either way, a non-significant result deserves analysis and documentation — not dismissal. The retailers who treat inconclusive results with the same analytical rigor as significant ones build a richer institutional understanding of what is and is not working in their business. That understanding is one of the most valuable outputs of a mature experimentation program.
The Bottom Line
The p-value is one of the most useful and most consistently misused tools in retail experimentation. Used correctly — as a conditional probability that answers a specific question about the likelihood of your data under the null hypothesis, evaluated alongside effect size and confidence intervals, set against a pre-specified threshold, and interpreted with appropriate acknowledgment of what it cannot tell you — it is a powerful instrument for making confident rollout decisions.
Used incorrectly — as a pass/fail gate that substitutes for commercial judgment, applied to data that was analyzed multiple times before results were finalized, or interpreted as the probability that the result is real — it produces a false sense of rigor that is arguably worse than using no statistical framework at all.
The discipline of using p-values well is not technical. It is organizational. It is the discipline of pre-specification, of holding to thresholds, of treating non-significant results as findings rather than failures, and of insisting that the number always be interpreted alongside the context that gives it meaning.
Where to next?
Want to learn more? Choose from the links to dive deeper into test and learn
Statistics
Statistics for Non-Statisticians
This article covers the core statistical concepts every retail merchant, operator, and leader needs to understand to participate fully in test and learn conversations — not to become a statistician, but to ask better questions, interpret results more honestly.
Foundation
Test and Learn Glossary: Advanced
If you are looking to get deeper into statistics and test modeling, this is a great place to learn more advanced test and learn terms.
Results
How to Read Your Test Results
This article covers the specific cognitive traps that most commonly distort how retail test results get read and acted on — and the structural practices that protect against them.