Policymakers should take steps to ensure successful pilots can be scaled up effectively.
In recent years, citizens and lawmakers have become increasingly enthusiastic about the adoption of evidence-based policies and programs. Social scientists—aided in part by an increased use of field experiments—have delivered evidence of countless interventions that positively impact peoples’ lives. And yet these programs, when expanded, have not always delivered the dramatic societal impacts promised.
Should we assume, then, that we have been too quick to embrace evidence-based policymaking?
Not at all. Instead, we should acknowledge what is missing: a proper understanding of how, why, and when promising results can be delivered at scale.
Unsurprisingly, when an intervention moves from the research setting to population-wide implementation, the magnitude of the treatment effect—or the benefit-cost estimate—might change. The change can be negative or positive and is influenced by a host of factors. In a series of new papers, my colleagues and I refer to this phenomenon as the “scale-up effect.”
To understand what causes the scale-up effect, we approach the problem through the lens of economics. By recognizing the incentives of various actors in the scientific market of knowledge creation, we identify three key ingredients to understanding the scale-up effect:
What constitutes actionable evidence? Policymakers should carefully evaluate the results of policy experiments, and should avoid scaling programs before there is sufficient evidence of efficacy. The scale-up effect becomes evident when an intervention tested in a small-scale randomized controlled trial suffers from an inference problem, such as a false positive, and then is implemented at a larger scale.
How do the properties of the population affect scaling? Threats to scalability emerge when an experiment’s participants are not representative of the policy population. A non-representative participant pool can be caused by several factors, including researcher incentives. In a competitive scientific marketplace, researchers may be more likely to choose a subject pool with a greater chance of showing a treatment effect than would a random sample or the policy population.
For example, after research found that fortified salt was successful at reducing anemia rates, a program based on that research rolled out to broader Indian populations had no effect. Why? Because it turns out the initial studies sought out adolescent women, not a representative population. Although the positive effect remained intact for adolescent women, it was absent for other groups.
How do the properties of the situation affect scaling? Threats also emerge when the experimental situation is not representative of the policy domain. Properties of the situation include specifics of the program, correct dosage, correct delivery, and implementation costs, as well as spillover effects—unanticipated consequences on participants receiving the treatment, individuals in the control group, or even people who are not participating in the experiment at all.
For example, the Student Teacher Achievement Ratio (STAR) project reduced class sizes in early grades in Tennessee, delivered promising results. But when California tried to replicate its success by reducing class sizes statewide, it failed to do so. Why? Because implementing the program at this scale required California to hire many additional teachers, and this led to the hiring of teachers with little teaching experience compared to those in the initial STAR project—a significant difference in the situation.
Careful consideration of these three categories of major threats to scalability leads to a set of recommendations for policymakers to reduce the scale-up effect and better understand which policies will scale effectively.
PROPOSAL 1: Before advancing policies, the post-study probability should be at least 0.95—meaning that there should be at least a 95 percent likelihood that the experimental research finding is true.
In our experience, some decision-makers in government and the private sector wish to rush new insights into practice before they are fully tested. This can contribute to the loss of impact at scale. Fortunately, an initially low post-study probability from a single study can be raised substantially if the initial positive findings pass as few as two or three independent well-powered replications, which usually produces a post-study probability of 0.95 or higher.
PROPOSAL 2: Reward scholars for prioritizing replication in their research.
Powerful incentives such as tenure decisions and public grant money should inspire scholars to attempt to replicate others’ findings, to produce initial results that do independently replicate, and to report on null results—especially “tight zeros”—as such results contain valuable policy information. A “tight zero” is a research study that shows with great accuracy that the intervention does not improve outcomes.
Proposal 2 is a call to policymakers, funders, and the academic community to begin to recognize the import of replication studies.
PROPOSAL 3: Leverage multi-site trials to learn about the variation of program impacts across both populational and situational dimensions.
By using appropriate variation in individual-specific characteristics, multi-site trials can provide empirical information about why effects might not scale and give empirical hints about where more research is needed before scaling.
PROPOSAL 4: Use of appropriate technology should be encouraged to promote standardization across experiments.
PROPOSAL 5: When scaling a program, the original scientist should participate on the implementation team to enhance fidelity, to teach policymakers why the result occurs, and for general consultation.
PROPOSAL 6: Policymakers must understand the non-negotiable components of a program that are required for the program to work, as opposed to negotiable components that can be altered without affecting the results. They should treat non-negotiable components as necessary conditions from scientists before scaling a program based on their research.
PROPOSAL 7: When the program is actually scaled, follow-up studies should use the correct empirical approach to measure efficacy, and continuous measurement should be a priority. The best approach to estimating the effects of the program at scale is to perform a large-scale randomized controlled trial. If this approach is untenable, then policymakers should adopt an empirical approach that allows stakeholders to measure efficacy without unrealistic assumptions.
These seven proposals are primarily intended for policymakers and program implementers. But researchers also have an important role to play in addressing threats to scalability, and in some cases they can take preemptive steps to avoid inadvertently suffering from them.
Our theoretical framework begins to tackle what we consider the most important question facing evidence-based policymaking today: How can we combine economics with the experimental method to inform policy at scale? We hope this framework helps to explain the potential causes of both successes and failures in scaling.
Our goal is to empower policymakers and practitioners to make more informed decisions. We also aim to educate and encourage funders of research, both public and private, to support and encourage studies that embrace this approach to the science of scaling.
Providing insights into how results scale to the broader population is critical to ensuring a robust relationship between scientific research and policymaking and a bright future for evidence-based policy. The entire science-based community—from scholars to funders to policymakers—must join forces to tackle the weakest link in successful evidence-based policy: the scale-up effect.
This essay is part of a 13-part series, entitled Using Rigorous Policy Pilots to Improve Governance.