5 Experiments

Bit By Bit - Running experiments - Introduction

In this chapter, the focus shifts from merely observing behavior or asking questions to actively intervening in the world to gather data that can answer cause-and-effect questions.

Cause-and-Effect Questions: These are prevalent in social research. Examples include understanding the impact of teacher salaries on student learning, the effect of minimum wage on employment rates, or how a job applicant’s race influences her chances of getting hired.

Limitations of Existing Data: While one might be tempted to answer these questions by observing patterns in existing data, such an approach can be misleading. For instance, students might perform better in schools with higher teacher salaries, but this doesn’t necessarily mean higher salaries cause better performance. There could be confounders, like socio-economic status, that aren’t accounted for.

Challenges with Confounders: Confounders are unmeasured variables that can distort the perceived relationship between two other variables. Addressing all possible confounders is challenging, as the list can be virtually endless.

Experiments in the Digital Age: While experiments were logistically challenging in the analog age, the digital age offers more flexibility. Not only does it simplify traditional experiments, but it also enables new types of experiments.

Randomized Controlled Experiments: It’s essential to differentiate between general experiments and randomized controlled experiments. In the latter, researchers decide by randomization who receives an intervention. This approach creates a fair comparison between those who received the intervention and those who didn’t, addressing the issue of confounders.

Chapter Overview: The chapter will delve into the logic of experimentation, differentiate between lab and field experiments, discuss the importance of validity, heterogeneity of treatment effects, and mechanisms, and finally offer design advice for digital experiments.

5.1 What Are Experiments?

Randomized controlled experiments consist of four primary components:

Recruitment of participants
Randomization of treatment
Delivery of treatment
Measurement of outcomes

The digital age simplifies the logistics of these experiments. For instance, it’s now feasible to measure the behavior of millions of individuals, something that was challenging in the past.

An experiment by Michael Restivo and Arnout van de Rijt is highlighted. They investigated the impact of barnstars (informal peer rewards) on Wikipedia editorial contributions. Surprisingly, they found that those awarded barnstars made fewer edits afterward. However, through a randomized controlled experiment, they discovered that while contributions from both the control and treatment groups decreased, the control group’s contributions decreased much faster. This study underscores the importance of having a control group in experiments.

The experiment by Restivo and van de Rijt also showcases that while the basic logic of experimentation remains consistent, the logistics of digital-age experiments can differ significantly. Digital platforms, like Wikipedia in this case, offer easy ways to deliver treatments and measure outcomes at minimal costs.

In conclusion, while the foundational principles of experimentation remain unchanged, the digital age offers new logistical advantages and challenges, particularly in terms of scale and ethics.

5.2 Two Dimensions of Experiments

Experiments can be categorized along two main dimensions: the lab-field dimension and the analog-digital dimension.

Lab-Field Dimension

Lab Experiments: These are typically conducted in controlled environments, often with undergraduate students performing specific tasks for course credit. They allow researchers to create highly controlled settings to test specific theories about social behavior.
Field Experiments: These experiments are conducted in more natural settings with more representative groups of participants performing common tasks. They combine the strong design of randomized control experiments with real-world applicability.
Both lab and field experiments have their strengths and weaknesses. For instance, lab experiments offer control, but the participants and tasks might not be representative of broader populations or real-world scenarios.

Analog-Digital Dimension

Analog Experiments: These do not use digital infrastructure for any of the four main components of experiments (recruitment, randomization, treatment delivery, outcome measurement).
Digital Experiments: These make use of digital infrastructure for all four components. They can be conducted online or use digital devices in the physical world. Digital experiments offer the possibility of larger scales, often involving millions of participants.
Partially Digital Experiments: These use a mix of analog and digital systems.

The digital age has revolutionized the way experiments are conducted. For instance, Amazon Mechanical Turk (MTurk) has become a popular platform for researchers to recruit participants for online experiments. MTurk matches “employers” with “workers” for short tasks, and researchers can tap into this pool for experimental participants.

Digital systems also enable field-like experiments that combine the control of lab experiments with the diversity and natural settings of field experiments. They can involve millions of participants, allowing for more comprehensive data collection and insights.

Digital Experiments and Pre-treatment Information: Digital field experiments often utilize background information about participants during both the design and analysis stages. This information, termed pre-treatment information, is typically available because digital experiments run on always-on measurement systems. For instance, a researcher at Facebook would have more pre-treatment information about participants than a university researcher conducting an analog field experiment. This pre-treatment information enables:

More efficient experimental designs, such as blocking and targeted participant recruitment.
Insightful analysis, like estimating heterogeneity of treatment effects and adjusting covariates for improved precision.

Longitudinal Treatment and Outcome Data: Unlike many analog experiments that measure outcomes in a short time frame, some digital field experiments span longer durations. For example, Restivo and van de Rijt’s experiment measured outcomes daily for 90 days. Another experiment tracked outcomes over three years with minimal cost. These extended durations are possible due to always-on measurement systems.

Limitations of Digital Field Experiments

Experiments can’t study the past.
They can only estimate effects of manipulable treatments.
While experiments are useful for policy guidance, their exact recommendations can be limited due to factors like environmental dependence, compliance issues, and equilibrium effects.
Digital field experiments can amplify ethical concerns, which will be discussed later.

In summary, while the foundational principles of experimentation remain consistent, the digital age offers new logistical advantages and challenges, particularly in terms of scale and the blend of lab and field methodologies.

5.2.1 Beyond Simple Experiments

Experiments often start with a basic question: Does a treatment “work”? However, this narrow focus can limit the depth and breadth of insights. To address this, researchers can move beyond simple experiments by considering three key concepts: validity, heterogeneity of treatment effects, and mechanisms.

Simple Experiments: These focus on a specific question, such as the effect of a treatment on a particular outcome. For instance, does changing a website button’s color impact the click-through rate? Such experiments answer specific questions about a treatment’s effect on a particular group at a specific time.

Beyond Simple: An experiment by P. Wesley Schultz and colleagues on the relationship between social norms and energy consumption is highlighted. They used doorhangers with different messages to encourage energy conservation in households. One group received energy-saving tips and information about their energy usage compared to their neighborhood’s average. The results showed that heavy electricity users reduced their consumption, while light users increased theirs, a phenomenon known as the boomerang effect. By adding emoticons (:), the researchers could mitigate this effect, demonstrating the importance of nuanced experimental design.

Experimental Designs

Between-subjects designs: Participants are divided into treatment and control groups. For example, one group receives a treatment, and outcomes are compared between the two groups.
Within-subjects designs: Participants’ behavior is compared before and after a treatment. This design offers improved statistical precision but can be influenced by confounders.
Mixed designs: Combines the precision of within-subjects designs with the protection against confounding of between-subjects designs. In digital experiments, where pre-treatment information is often available, mixed designs offer improved precision.

In conclusion, while simple experiments can provide valuable insights, considering factors like validity, treatment effect variations, and underlying mechanisms can lead to richer, more informative experiments. By keeping these concepts in mind, researchers can design more comprehensive and insightful experiments.

Validity

Validity is crucial in experiments as it determines the extent to which the results support a broader conclusion. There are four main types of validity:

Statistical Conclusion Validity: This pertains to whether the statistical analysis of the experiment was done correctly. For instance, were p-values computed accurately? In the digital age, while the basic statistical principles remain unchanged, new opportunities arise, such as using machine learning to estimate treatment effects.
Internal Validity: This concerns whether the experimental procedures were executed correctly. Questions about internal validity might revolve around randomization, treatment delivery, and outcome measurement. The digital age can help reduce concerns about internal validity because it ensures that treatments are delivered correctly and outcomes are measured for all participants.
Construct Validity: This focuses on the alignment between the data and theoretical constructs. Constructs are abstract concepts that social scientists reason about, but these don’t always have clear definitions and measurements. In digital experiments, especially those partnering with companies, the match between the experiment and theoretical constructs might be less precise, making construct validity a potential concern.
External Validity: This pertains to the generalizability of the experiment’s results to other situations. For instance, would the results hold true in a different setting or with a different group of participants? The digital age allows researchers to empirically address concerns about external validity, especially when experiments are low-cost and outcomes are measured by always-on data systems.

An example highlighted is the Home Energy Reports experiment, which aimed to reduce electricity consumption. The experiment was replicated across various locations, involving about 8.5 million households. These experiments consistently showed that Home Energy Reports reduced average electricity consumption. However, the effect size varied by location, emphasizing the importance of external validity.

In conclusion, the four types of validity provide a framework for researchers to evaluate whether the results of an experiment support a broader conclusion. The digital age offers both advantages and challenges in addressing these validity concerns, especially when partnering with companies.

Heterogeneity

Experiments typically measure the average effect of a treatment, but the effect might not be the same for everyone. This variation in effects across different groups or conditions is termed as heterogeneity of treatment effects.

In the digital age, with larger sample sizes and more available data about participants, researchers can delve deeper into this heterogeneity. By doing so, they can gain insights into how a treatment works, how it can be improved, and how it can be targeted to those most likely to benefit.

Two examples from research on the Home Energy Reports illustrate this:

Energy Usage by Decile: A study by Allcott (2011) utilized a large sample size (600,000 households) to estimate the effect of the Home Energy Report based on pre-treatment energy usage. The heaviest users (top decile) reduced their energy usage twice as much as someone in the middle of the heavy-user group. Moreover, there was no boomerang effect, even for the lightest users.
Effectiveness Based on Political Ideology: Costa and Kahn (2013) speculated that the effectiveness of the Home Energy Report might vary based on a participant’s political ideology. By merging Opower data with third-party data (including political party registration and donations to environmental organizations), they found that the Home Energy Reports produced broadly similar effects for participants with different ideologies. There was no evidence of any group exhibiting boomerang effects.

These examples highlight that in the digital age, researchers can move beyond estimating average treatment effects. They can explore the heterogeneity of treatment effects, which can lead to more targeted and effective interventions and provide valuable insights for theory development.

Mechanisms

Experiments measure the outcomes of certain actions or treatments. However, understanding the mechanisms — the reasons why and how these outcomes occur—is crucial for deeper insights.

Mechanisms explain the pathways through which a treatment produces an effect. They are sometimes referred to as intervening variables or mediating variables. While experiments are adept at estimating causal effects, they aren’t always designed to unveil the underlying mechanisms.

Limes and Scurvy: A historical example is given about limes preventing scurvy. While it was known in the 18th century that limes prevented scurvy, the mechanism (vitamin C) wasn’t identified until the 20th century.

Studying Mechanisms:

Process Data: By collecting data on how a treatment impacts potential mechanisms, researchers can hypothesize about the underlying causes. For instance, while Home Energy Reports were found to reduce electricity usage, a follow-up study revealed that only a small percentage of households upgraded their appliances, indicating that appliance upgrades weren’t the primary mechanism for reduced consumption.
Testing Different Versions of Treatment: By tweaking the treatment components, researchers can isolate which parts are most effective. An experiment on water conservation, for example, tested various combinations of water-saving tips, moral appeals, and peer comparisons. The results showed that tips alone weren’t effective, but combining them with moral appeals and peer comparisons had a lasting impact.

Full Factorial Designs: This approach tests every possible combination of treatment components, allowing researchers to assess the effect of each component individually and in combination. Such designs were challenging in the past due to logistical constraints, but the digital age has made them more feasible.

In conclusion, understanding mechanisms is vital for researchers. The digital age, with its ability to collect vast amounts of data and conduct intricate experiments, offers new avenues to explore and identify these mechanisms, leading to richer insights and more effective interventions.

5.3 Making It Happen

Even if you’re not affiliated with a major tech company, it’s possible to run digital experiments. There are two primary approaches to this: doing it yourself or partnering with powerful entities.

Doing It Yourself

There are several ways to conduct experiments on your own:

Experiment in Existing Environments: Utilize platforms or environments that already exist.
Build Your Own Experiment: Create a unique environment tailored to your experiment.
Build Your Own Product: Design a product or platform that allows for repeated experimentation.

Partnering with the Powerful

Collaborate with entities or organizations that have the resources or platforms conducive for experimentation.

These approaches come with trade-offs across four dimensions:

Cost: Refers to the researcher’s expenditure in terms of time and money.
Control: The ability to manage aspects like participant recruitment, randomization, treatment delivery, and outcome measurement.
Realism: How closely the experimental environment mirrors real-life situations. High realism isn’t always essential for theory testing.
Ethics: The capacity for researchers to address and manage any ethical challenges that might arise during the experiment.

In essence, while the potential for digital experimentation is vast, the approach you choose should align with your resources, goals, and the specific questions you aim to answer.

USE EXISTING ENVIRONMENTS

Researchers can conduct digital experiments by overlaying their experimental design on top of existing online environments. This approach doesn’t require partnerships with companies or extensive software development.

Examples:

Racial Discrimination Study: Jennifer Doleac and Luke Stein utilized an online marketplace similar to Craigslist to study racial discrimination. They advertised thousands of iPods and varied the characteristics of the seller, such as the hand holding the iPod (white, black, white with tattoo), the asking price, and the quality of the ad text. The study found that white sellers received more offers and had higher final sale prices than black sellers. The results also showed variations based on other factors, such as the quality of the ad and the location of the advertisement.
Keys to Success: Arnout van de Rijt and colleagues explored the concept of cumulative advantage—where small initial successes can amplify over time. They intervened in four different online systems: Kickstarter (pledging money to random projects), Epinions (positively rating random reviews), Wikipedia (giving awards to random contributors), and change.org (signing random petitions). In all cases, those who received initial success had more subsequent success than their peers.

Advantages:

These experiments can be conducted at a relatively low cost.
They offer a high degree of realism.

Limitations:

Limited control over participants, treatments, and outcomes.
Effects could be influenced by system-specific dynamics.
Ethical concerns when intervening in working systems.

In summary, existing online systems offer a valuable platform for researchers to conduct digital field experiments without the need for extensive resources or partnerships.

BUILD YOUR OWN PRODUCT

Building your own experiment can be more costly than using existing platforms, but it offers the advantage of creating a tailored environment for your research.

Advantages:

Control: By constructing the experiment, researchers can create the environment and treatments they desire. This allows for opportunities to test theories that might be challenging to assess in naturally occurring settings.
Ethical Considerations: Crafting your own experiment can reduce ethical concerns associated with intervening in existing systems.

Challenges:

Recruitment: Researchers need a strategy to attract participants to their experiment. While existing systems bring experiments to participants, in custom-built experiments, participants must be brought to the experiment.
Realism Concerns: The bespoke environment might lack the realism of a naturally occurring system.
Cost and Time: Building a custom experiment can be resource-intensive.

Examples:

Voter Decision Making: An experiment by Gregory Huber, Seth Hill, and Gabriel Lenz explored potential biases in voter decision-making. They created a simplified voting environment to study biases like recent performance focus, susceptibility to rhetoric, and influence by unrelated events. The study found that even in this controlled setting, voters struggled to make informed decisions.
Networks and Behavior Spread: Damon Centola built a digital field experiment to study the impact of social network structures on behavior spread. He constructed an online health community and introduced a new behavior. The study found that behaviors spread faster in clustered networks than in random ones.

In summary, while building your own experiment can be demanding in terms of resources, it offers unparalleled control, allowing researchers to isolate and study specific processes in a way that might not be possible in pre-existing environments.

PARTNER WITH THE POWERFUL

Partnering with powerful organizations, such as companies, governments, or NGOs, can enable researchers to run experiments that might be challenging to conduct independently. For instance, some experiments can involve millions of participants, a scale individual researchers might find hard to achieve. However, while partnerships can amplify what researchers can do, they also come with constraints. Companies might not allow experiments that could harm their reputation or business. Additionally, when it’s time to publish, partners might exert influence on how results are presented or even attempt to block publication if it portrays them negatively.

Key Points:

Balancing Interests: Successful partnerships hinge on balancing the interests of both parties. Research can be both practically motivated and seek fundamental understanding, as illustrated by Pasteur’s Quadrant. Research that advances both these goals is ideal for collaborations.
Large Companies and A/B Tests: Tech companies have developed infrastructure for running complex experiments, often called A/B tests. These experiments can be used for research that advances scientific understanding. An example is a study conducted by Facebook and the University of California, San Diego, on the effects of different messages on voter turnout. The study involved 61 million Facebook users and examined the impact of social information on voting behavior.
Partnerships Beyond Tech Companies: Partnerships don’t always involve tech giants. For instance, a study partnered with an environmental NGO, the League of Conservation Voters, to test strategies for promoting social mobilization using the NGO’s Twitter account.
Advantages and Disadvantages: Partnering allows researchers to operate on a larger scale and can be easier than building an experiment from scratch. However, it can limit the types of participants, treatments, and outcomes that can be studied. Ethical challenges can also arise.

In summary, while partnerships offer vast potential, they come with their own set of challenges. The key is to find opportunities that align with both the researcher’s interests and those of the partner.

5.4 Advice for Experiments

When conducting experiments, whether independently or in collaboration, there are several key pieces of advice to consider:

Think Before Collecting Data: It’s essential to plan and think through the experiment before collecting any data. This approach contrasts with big data sources where most work is done post-data collection. One effective way to ensure thorough pre-experiment planning is to create and register a pre-analysis plan, detailing the analyses you intend to conduct.
Design a Series of Experiments: No single experiment is perfect. Instead of aiming for one comprehensive experiment, consider designing a series of smaller experiments with complementary strengths. This approach, often termed the armada strategy, involves creating multiple smaller experiments rather than one large one. The affordability of some digital experiments makes this strategy more feasible.
Create Zero Variable Cost Data: This advice is specific to digital age experiments. The idea is to design experiments in a way that the variable cost of collecting additional data is virtually zero.
Integrate Ethics into Your Design: Given the potential for digital experiments to impact large numbers of participants, it’s crucial to ensure that ethical considerations are integrated into the experiment’s design from the outset.

In summary, successful experimentation in the digital age requires careful pre-planning, a series of complementary experiments, and a strong emphasis on ethics.

CREATE ZERO VARIABLE COST DATA

Creating zero variable cost data is pivotal for running large-scale experiments. The main idea is to minimize the costs associated with adding each additional participant to the experiment. This can be achieved through automation and designing experiments that participants find enjoyable.

Cost Structures in Experiments: Experiments have two primary costs:
- Fixed costs: These remain constant regardless of the number of participants. For instance, in a lab experiment, renting space and buying furniture might be fixed costs.
- Variable costs: These change based on the number of participants. In a lab experiment, these might include payments to staff and participants. Analog experiments typically have low fixed costs and high variable costs, while digital experiments often have high fixed costs but low variable costs. The goal is to reduce the variable costs to virtually zero.
Driving Variable Costs to Zero:
- Payments to Staff: This can be minimized by automating tasks. For instance, if an experiment can run without human intervention (e.g., while the research team is asleep), it indicates successful automation.
- Payments to Participants: Instead of paying participants, researchers can design experiments that are inherently enjoyable or provide value, so participants willingly take part without monetary compensation.
MusicLab Example: The MusicLab experiment, designed to study the unpredictability of success in cultural products, is a prime example of zero variable cost data. The experiment was fully automated, and participants were compensated with free music, driving the variable cost to zero. The experiment was designed to study how much of a song’s success is due to its quality versus luck. The results showed that luck played a significant role in a song’s success, with the same song ranking differently in parallel worlds.
Benefits and Challenges: Zero variable cost data allows for large-scale experiments, but it also comes with high fixed costs. For instance, the MusicLab experiment required significant web development work, which was possible due to collaboration with a skilled web developer.

In summary, zero variable cost data can revolutionize the scale and design of experiments, allowing researchers to study collective outcomes arising from individual decisions. However, it requires careful planning and potentially high upfront costs.

ETHICS: REPLACE, REFINE, REDUCE

Researchers should make their experiments more humane by incorporating three principles:

Replace: Opt for less invasive methods if possible. For instance, instead of running a randomized controlled experiment, consider using a natural experiment. A natural experiment involves situations where external factors, like weather, create conditions that approximate random assignment of treatments. This approach eliminates the need for researchers to intervene directly.
Refine: Make treatments as harmless as possible. For instance, instead of blocking certain content, consider boosting other content. This approach can change the content participants see without causing them to miss important information.
Reduce: Minimize the number of participants in the experiment. In digital experiments, especially those with zero variable costs, there’s no cost constraint on the size of the experiment, which can lead to unnecessarily large experiments. By using more efficient statistical methods, researchers can achieve the same results with fewer participants.

Emotional Contagion Experiment: One of the most debated digital field experiments was conducted by Adam Kramer, Jamie Guillroy, and Jeffrey Hancock on Facebook. The experiment aimed to understand how a person’s emotions are impacted by their friends’ emotions. They manipulated the News Feed of about 700,000 users to show more positive or negative posts. The results indicated that users’ emotions were influenced by the content they saw, but the effect size was very small. This experiment raised both scientific and ethical concerns.

The desire to reduce the size of your experiment doesn’t mean you should avoid running large, zero variable cost experiments. Instead, your experiments should be as large as necessary to achieve your scientific objective. A crucial step to ensure an experiment’s appropriate size is to conduct a power analysis. In the analog age, power analysis was typically done to ensure a study wasn’t too small (i.e., under-powered). Now, it’s essential to ensure a study isn’t too large (i.e., over-powered).

In conclusion, the three R’s—replace, refine, and reduce—provide principles that can help researchers integrate ethics into their experimental designs. Every suggested change, like those for the Emotional Contagion experiment, introduces trade-offs. For instance, evidence from natural experiments might not be as clear-cut as that from randomized experiments, and boosting content might be more challenging to implement than blocking content. The goal isn’t to second-guess other researchers’ decisions but to showcase how the three R’s can be applied in real situations. Trade-offs are common in research design, and in the digital age, these trade-offs will increasingly involve ethical considerations. Later chapters will delve into principles and ethical frameworks to help researchers understand and discuss these trade-offs.

In conclusion, the three R’s (replace, refine, reduce) offer a framework for designing more ethical digital experiments. By considering these principles, researchers can conduct experiments that are both scientifically rigorous and ethically sound.

5.5 Conclusion

The digital age has revolutionized the capacity for researchers to conduct experiments that were previously unfeasible. These experiments can now be executed on a massive scale, leveraging the unique attributes of digital platforms to enhance validity, pinpoint heterogeneity of treatment effects, and determine underlying mechanisms. Such experiments can be conducted in purely digital settings or by integrating digital tools within the physical realm.

Key takeaways from the chapter include:

Experiments can be orchestrated in collaboration with influential companies or can be independently managed by researchers. There’s no necessity to be affiliated with a major tech company to initiate a digital experiment.
When designing an experiment, it’s possible to minimize variable costs to virtually zero. This allows for large-scale experiments without escalating costs per participant.
The three R’s—replace, refine, and reduce—serve as guiding principles to integrate ethics into experimental design. As researchers gain the power to influence the lives of millions, there’s a heightened responsibility to ensure ethical considerations are at the forefront of experimental design.

In essence, the newfound capabilities brought about by the digital age come with increased responsibilities, emphasizing the importance of ethical considerations in research design.

MATHY NOTES

The potential outcomes framework is a robust method to understand experiments. This framework consists of three main elements: units, treatments, and outcomes. For instance, in Restivo and van de Rijt’s experiment on Wikipedia, the units were deserving editors, treatments were “barnstar” or “no barnstar”, and outcomes were the number of edits made by an editor in either condition.
The potential outcomes framework helps define causal effects. For each individual, one can imagine the outcome in the treatment condition and the outcome in the control condition. The difference between these two outcomes is the causal effect for that individual.
The Fundamental Problem of Causal Inference arises because, in most cases, we can’t observe both potential outcomes for an individual. However, by having multiple participants, we can estimate the average treatment effect across the population.
Randomization ensures that the comparison between treatment and control groups is fair. This balance holds for both observed and unobserved factors, making experiments robust against unknown variables.
The Stable Unit Treatment Value Assumption (SUTVA) is crucial. It assumes that the outcome for an individual depends only on their treatment and not on the treatment of others. This assumption can be violated if there are spillover effects, where the treatment of one individual affects another.
Another aspect of SUTVA is the assumption that the only relevant treatment is the one delivered by the researcher. This means that any other treatments or factors that could affect the outcome should be considered.
The precision of experimental results is essential. By estimating the average treatment effect as the difference between two sample means, the variability of those estimates can be determined. The design of the experiment, including the allocation of participants to treatment and control groups, can influence the precision of the results.
In the main text, a difference-in-differences estimator is discussed, which is typically used in a mixed design. This estimator can lead to smaller variance than a difference-in-means estimator, which is typically used in a between-subjects design. If $X_i$ is the value of the outcome before treatment, the quantity estimated with the difference-in-differences approach is represented by the equation: $ATE' = \frac{1}{N} \sum_{i=1}^N ((Y_i(1) - X_i) - (Y_i(0) - X_i))$
The standard error of that quantity is derived from the equation: $SE(\widehat{ATE'}) = \sqrt{\frac{1}{N-1} \left( \text{Var}(Y_i(0) - X_i) + \text{Var}(Y_i(1) - X_i) + 2\text{Cov}(Y_i(0) - X_i, Y_i(1) - X_i) \right)}$
The difference-in-differences approach will have a smaller standard error when: $\frac{\text{Cov}(Y_i(0), X_i)}{\text{Var}(X_i)} + \frac{\text{Cov}(Y_i(1), X_i)}{\text{Var}(X_i)} > 1$
Essentially, when $X_i$ is very predictive of $Y_i$ , more precise estimates can be obtained from a difference-of-differences approach than from a difference-of-means one. In the context of Restivo and van de Rijt’s experiment, there’s a lot of natural variation in the amount that people edit. This makes comparing the treatment and control conditions challenging. However, if you difference-out this naturally occurring variability, there’s much less variability, making it easier to detect a small effect.
For a more detailed comparison of difference-of-means, difference-of-differences, and ANCOVA-based approaches, especially when there are multiple measurements pre-treatment and post-treatment, researchers can refer to works by Frison and Pocock (1992). They strongly recommend ANCOVA, which hasn’t been covered in this section. Additionally, McKenzie (2012) discusses the importance of multiple post-treatment outcome measures.