December 22, 2018

# Implications of Multiple Control Matches on A/B Testing Accuracy and Frequency

We conducted a simulation experiment to examine the effect of multiple control matches on A/B testing accuracy and frequency. Our simulation experiment shows that 1-to-many matching offered slight improvements in noise reduction when compared with 1-to-1 matching. However, the improvements are not substantial enough to warrant a decrease in testing frequency, representativeness, and comparability that occurs in 1-to-many matching conditions. We recommend 1-to-1 matching as the best method to optimize return on investment for A/B retail testing. We conclude with recommendations for retailers and future researchers.

### I. INTRODUCTION

Brick-and-mortar retailers need robust statistical methods and rigorous experimental design to conduct accurate A/B tests to appropriately inform decision making (1). These methods could include trade area analysis using drive times to store locations, propensity score matching to reduce a host of covariates to a propensity score that can be used to match stores (2), unsupervised machine learning methods to group latent classes of stores into similar groups, or regression models, such as lasso regression, ridge regression, multiple linear regression, or XG Boost regression.

To make the problem more complex, retail data is unique in a number of ways complicating statistical analysis. For example, retail data is hierarchically nested within customer, store, and time. Time series data is serially autocorrelated, meaning that individual data points are not independent over time. This is particularly true in retail (high intraclass correlation coefficient, which reduces effective sample size and results in biased confidence numbers from statistical analyses).

In this paper, we focus on how the number of control matches used in treatment-control matching impacts the accuracy and testing frequency of an A/B test.

### II. METHOD

*A. Rationale*

In order to compare control matching methods (3), we used a non-parametric block bootstrapping approach. We selected a random time point in the past 2 years as the test start date, selected a random number of weeks for the duration of the test, selected treatment stores using the MarketDial treatment selection method, selected control stores using the MarketDial control store selection method, and then analyzed the test to get lift results. We expected these lifts to be close to zero with a small standard deviation because treatment stores were not impacted by an actual test, thus resulting in no lift. The method that resulted in an average lift closest to zero with the smallest standard deviation was considered the most accurate method.

*B. Simulation Method*

In our simulation we varied the duration of the post-period, the number of products tested, whether the products are clustered within a hierarchy or randomly chosen, the number of controls, whether we select control stores with replacement or without replacement, and how we aggregated control stores in a multiple selection scenario. A detailed description of our simulation can be found below:

We conducted 300 iterations for each condition. For each iteration:

- Use 2 years of historical data
- Use 13 weeks as the pre-period
- Randomly select post-period duration between 4 and 52 weeks
- Randomly select test start date such that pre-period duration and post-period duration fit within the 2 years of historical data
- Select a product set based on a random hierarchy level and a percentage of products from that hierarchy
- Select treatment stores using the MarketDial treatment store selection method
- Vary 1, 2, 5, and 10 control matches per treatment store for this condition
- Vary with and without replacement conditions
- Vary how control stores are aggregated in multiple matching scenarios
- Select control stores using the MarketDial control store selection method
- Analyze the results using the MarketDial lift calculation method
- Save lift and confidence estimate

The simulation was conducted across two different sets of retail data.

### III. RESULTS

*A. All Methods Comparison*

The first table (Table 1) shows the percent change in standard deviation of lift estimates across the 300 iterations of our simulation as sample size changes and as the number of control matches changes. In addition, the first row shows the without and with replacement comparison (see Appendix A for a more complete definition). WO to W is without replacement compared to with replacement. Random to 1 is comparing the random condition to 1 control match. 1, 2, 5, and 10 indicate each control match comparison. The process to compute values for this table is as follows:

- Conduct 300 simulation iterations (spanning across all cells in Table 1).
- Compute the standard deviation of the lift estimates resulting from all 300 tests. This metric is a numeric representation of the variability for each method.
- Group and average the standard deviations by the sample size and matching conditions listed in Table 1.
- Compare the first condition listed in each row to the second condition listed in each row (without replacement to with replacement, random to 1 control store match, 1 to 2 control store matches, 1 to 5 control store matches, and 1 to 10 control store matches) and compute a percent difference
- The percent difference between the first and second method on each row is represented in Table 1.

Table 1 shows that the MarketDial matching method (see Appendix A for a definition) is significantly better than random matching, especially with smaller sample sizes (22% and 21.5% decrease in noise in the 10 and 20 sample size condition respectively). There is also a marked improvement in higher sample sizes, though not as drastic as the smaller sample sizes, which is to be expected as increasing sample size will also reduce noise (due to the central limit theorem).

*Table 1.* Standard deviation of lift estimates percent changes across sample size and condition.

There was no consistent difference across sample size between the without and with replacement conditions.

The 2, 5, and 10 control match conditions, on average, had consistent improvements in noise reduction when compared with 1 control match (see the last 3 rows in Table 1), with 6.7%, 12.3%, and 13.2% noise reduction respectively. These noise reduction percentages seem more impactful than they actually are. Figures 1, 2, and 3 show three randomly chosen tests with 1 control store match and 5 control store matches.

Looking at Figures 1, 2, and 3, it is hard to find consistent differences between these methods. The main conclusion from looking at these figures is that a 10% reduction in noise is not a significant improvement in test accuracy and cannot be discerned visually. Also notice that is not possible to build tests with 10 control matches and sample size greater than 30 or 5 control matches and a sample size greater than 60, shown by NA in Table 1. On top of this, there are downsides that we believe are more impactful than the minor improvement in noise reduction. We next discuss how representativeness and comparability are negatively impacted as the number of control matches increases.

Representativeness is a metric indicating how representative treatment stores are of the final roll-out group for a test (see Appendix A for a more complete definition). Comparability is a metric indicating how comparable control stores are to the treatment group (see Appendix A for a more complete definition). Figures 4 and 5 show the representativeness and comparability charts over time as a retailer builds more tests over the course of a year, broken up by control matching condition. The x-axis indicates the test number as a retailer successively builds more tests. The y-axis is the representativeness or comparability score. These scores range between 0 and 100 with 100 being best.

Notice that representativeness and comparability both decreased over time. Furthermore, the decline is more drastic as the number of control matches increased. The decline occurred because treatment stores cannot be simultaneously reused as treatment stores (tests may impact each other and bias results), so your total consideration set (sample size) decreases. This results in fewer stores to select from when optimizing the test and control group. This is an important consideration as lower representativeness scores will result in less consistency between test lift results and actual roll-out lifts. Also, lower comparability adds bias to a test as the control group is no longer as comparable to the treatment group.

Finally, as the number of control matches increases, there were fewer data points along the x-axis. This is a decrease in the number of tests that can be conducted. This will more fully be explored in III. Results C.

*B. Testing Frequency Impact of 1-to-Many Matching*

Although there are slight improvements from using 1-to-many matching methods, these methods do not overcome the downsides of using 1-to-many matching on testing frequency. Table 3 shows the testing frequency reduction by using a 1-to-many matching strategy. The parameters used for Table 3 are 300 store locations, 13 weeks of per-period, 4 weeks of implementation, 8 weeks of test duration, 13 weeks of burn-down, and 10 treatment stores per test over the course of 1 year.

Table 3 assumes control stores can be selected as long as they are not used as treatment stores in other tests, meaning control stores can be reused as control stores in additional tests. The lower bound in this table refers to the maximum number of tests that could be conducted if you never reused any control stores. Essentially, each test has a unique set of control stores. The upper bound in Table 3 refers to the maximum number of tests that could be conducted if the exact same control stores were used for all tests. Essentially, each test had the exact same unique set of control stores. The upper and lower bounds in this case are far from realistic, based on the MarketDial treatment and control selection method (see Appendix A for more information), so an Average column has been provided as a realistic scenario between the two.

*Table 3.* Testing frequency reduction caused by 1-to-many matching with 300 store locations.

Table 4 provides another example, keeping all parameters the same as Table 2 except with 1000 store locations rather than 300 store locations.

*Table 4.* Testing frequency reduction caused by 1-to-many matching with 1000 store locations.

In both cases (the 1000-store chain and the 300-store chain), increasing the number of control stores that are matched to treatment stores decreased the potential number of tests that could be conducted by over 10%, whether changing from 1 to 2, 2 to 5, or 5 to 10. In many cases, increasing the number of control stores decreased the potential number of tests by over 10% or even up to 20% in a few cases.

*C. Summary and Discussion*

By comparing the differences found in Table 1 with the reduction in testing reported in Tables 3 and 4, we conclude that 1-to-1 matching is the best method to maximize return on investment from A/B tests by both maximizing test accuracy and the number of potential tests that can be conducted per year.

### IV. LIMITATIONS

There are a number of limitations associated with this simulation analysis. First, the rationale to compare methods according to the mean and standard deviation of the lift estimates resulting from historical simulated tests may not be the best way to compare multiple control store matching techniques. Future research should consider additional ways to compare multiple control store matching techniques.

Second, the MarketDial method for selecting treatment and control stores may have biased the results of this experiment. While we have shown the MarketDial method for selecting treatment and control stores was significantly better than a random sample, we acknowledge that improvements to the MarketDial matching algorithms may result in improved 1-to-many matching results as better matches may improve method accuracy as more control stores are added. Future research should continue to make improvements to the MarketDial matching methods and updates should be made to this control matching study as these matching methods improve.

### V. CONCLUSION

This paper investigated the effect of multiple control store matches on test accuracy and testing frequency. We conducted a rigorous simulation experiment comparing 32 different matching strategies across a number of different test conditions. On average, we found that 1-to-1 matching and 1-to-many matching were significantly better than random chance using the MarketDial treatment and control store selection methods. Furthermore, we found that while 1-to-many matching resulted in slightly more accurate estimates, the reduction in representativeness, comparability, and testing frequency that occurred in 1-to-many matching scenarios is not advisable.

More specifically, we compared the number of tests that can be conducted across all stores over the period of one year. We found that 1-to-many matching (switching from 1 to 2, 2 to 5, or 5 to 10) resulted in at least a 10% decrease and at most a 25% decrease in the number of tests that can be conducted per year. This is a significant decline in potential testing due to using multiple control stores.

We recommend 1-to-1 matching as the best method to maximize return on investment in store marketing campaigns, promotions, pricing tests, new product tests, or other initiatives. 1-to-1 matching is almost as accurate as its 1-to-many counterparts, allowed for higher representativeness and comparability metrics, and allowed for substantially more tests to be conducted per year.

### VI. CONCLUSION

The authors thank Morgan Davis, Joe Turner, and Josh Baran for their insightful critique of our methods and results.

### VII. REFERENCES

(1) Kirk, R. E. (2007). Experimental design. The Blackwell Encyclopedia of Sociology.

(2) White, H., & Sabarwal, S. (2014). Quasi-experimental design and methods. Methodological Briefs: Impact Evaluation, 8, 1-16.

(3) Gu, X. S., & Rosenbaum, P. R. (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2(4), 405-420.

### VIII. APPENDIX A

*A. Definitions*

*Representativeness:* This score is calculated by comparing the demographics and relevant business metrics of treatment stores to the entire fleet of stores. The more similar the treatment group is to the entire fleet, the higher the representative score. The score ranges from 0 and 100, 100 meaning perfectly representative and 0 meaning not representative at all.

*Comparability:* This score is calculated by comparing the demographics and relevant business metrics of treatment stores to control stores. The more similar the treatment group is to the control stores, the higher the comparability score. The score ranges from 0 and 100, 100 meaning perfectly comparable and 0 meaning not comparable at all.

*With replacement:* The same control store can be matched to multiple treatment stores in the same test. After you select a control store, you “replace” it back into your consideration set (or the group of stores you are drawing from) so you can select it again if it is the best match.

*Without replacement:* Each control store is only matched to a single treatment store and can only be used once per test. After you select a control store, you do not “replace” it in the consideration set; you leave it out so it is only selected once per test.

*MarketDial treatment selection method:* Algorithm to cluster stores into similar groups based on demographics and relevant business metrics in order to maximize the representativeness between treatment stores and the entire fleet.

*MarketDial control selection method:* Algorithm to compare control stores to treatment stores in order to select the most comparable control stores for each treatment store. Selection criteria includes demographic data and sales trend data.

Categorised in: White Paper