3 Observing Behavior

In the digital age, the behaviors of billions of people are continuously recorded and stored, transforming the landscape of data collection. Here’s an overview:

Transition from Analog to Digital: Previously, in the analog age, collecting behavioral data was expensive and infrequent. Now, digital traces of our actions, such as website clicks, mobile calls, or credit card transactions, are constantly being recorded by businesses. These digital traces, combined with vast records held by governments, are commonly referred to as “big data.”

Abundance of Behavioral Data: The digital era has shifted us from a world of scarce behavioral data to one where such data is abundant.

Observational Data: Big data is a subset of observational data, which is derived from observing a social system without any intervention. Observational data encompasses not just big data but also other sources like newspaper articles and satellite images. It excludes data from surveys (where people are directly questioned) and experiments (where environments are manipulated).

3.1 Big Data

Big data is generated and collected by companies and governments for purposes other than research. This data, when repurposed, can offer valuable insights for researchers.

While many definitions of big data focus on the “3 Vs” (Volume, Variety, and Velocity), the chapter suggests that for social research, it’s more useful to consider the “5 Ws”: Who, What, Where, When, and Why. The most crucial aspect is “Why” - understanding the purpose behind the data’s creation.

Repurposing Data: In the digital age, a significant amount of data is generated by companies and governments for non-research purposes, such as service provision, profit generation, and law administration. However, researchers can repurpose this data for their studies. This repurposing is akin to artists using found objects to create art.

Comparing Data Sources: The chapter contrasts data sources like Twitter, which aims to provide a service and generate profit, with traditional public opinion surveys like the General Social Survey, which is designed for research. Each has its strengths and weaknesses, making neither inherently superior.

Beyond Online Behavior: While many associate big data with online behavior, the chapter emphasizes that big data also includes digital records from the physical world, like supermarket check-out data. Additionally, government-created data, such as tax records and school records, are vital sources of big data.

Advice on Repurposing: The chapter offers two pieces of advice for researchers repurposing data: i) Understand the people and processes that created the data, and ii) Compare the re-purposed data with an ideal dataset for the research problem to identify gaps and potential issues.

Approaches of Social Scientists vs. Data Scientists: Social scientists often focus on the problems of repurposed data, while data scientists emphasize its benefits. The book advocates for a balanced approach, recognizing both the strengths and weaknesses of big data.

Even though the desired characteristics of a data source depend on the research goal, one finds it helpful to crudely group the ten characteristics into two broad categories:

  • Generally helpful for research: big, always-on, and nonreactive
  • Generally problematic for research: incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, and sensitive



Size matters, but not always. While many papers emphasize the sheer volume of data they’ve analyzed, it’s essential to question the actual utility of that size. For instance, a study on word-use trends in the Google Books corpus boasted about its vastness, but the real question is whether all that data significantly contributed to the research findings.


  1. Study of Rare Events: Large datasets can capture infrequent occurrences.
  2. Study of Heterogeneity: Big data allows for detailed analyses across different segments. An example given is a study by Raj Chetty and colleagues on social mobility in the U.S., where they used tax records from 40 million people to estimate regional differences in intergenerational mobility.
  3. Detection of Small Differences: In industries, small differences, like a 0.1% increase in click-through rates, can translate to significant revenue. Similarly, in research, slight variations can have substantial implications when viewed in aggregate.


Ignoring data generation processes is the biggest one. The sheer size of the data can sometimes lead researchers to overlook how the data was generated. This can result in systematic errors. An example mentioned is a study that used messages from September 11, 2001, to analyze emotional reactions. The vast amount of data seemed to show a clear pattern of increasing anger throughout the day. However, it was later discovered that a significant portion of the messages came from a bot, skewing the results.

While big datasets can be advantageous, they are not an end in themselves. They can enable specific types of research, but it’s crucial to understand the data’s origin and potential biases. Researchers should avoid being blinded by the size and instead focus on the data’s relevance and utility for their research objectives.



Many big data systems are perpetually active, collecting data at all times. This continuous data collection provides researchers with longitudinal data, allowing them to observe patterns and changes over time.


  1. Studying Unexpected Events: The always-on characteristic lets researchers study unforeseen events. An example is the study of the Occupy Gezi protests in Turkey in 2013. Researchers used Twitter data to observe protesters’ behavior before, during, and after the event, allowing for a comprehensive analysis.
  2. Real-time Estimates: Always-on data systems can produce instantaneous estimates, useful in scenarios where real-time responses are crucial, such as natural disasters or economic fluctuations.

Examples of Studies Using Always-on Data:

  • Occupy Gezi movement in Turkey: Used Twitter data to study participants’ behavior over time.
  • Umbrella protests in Hong Kong: Utilized Weibo data.
  • Shootings of police in New York City: Analyzed Stop-and-frisk reports.
  • Person joining ISIS: Studied Twitter data.
  • September 11, 2001 attack: Used livejournal.com and pager messages to analyze reactions.


While always-on data systems are beneficial for studying unexpected events and providing real-time insights, they might not be ideal for tracking changes over extended periods due to the constant evolution of many big data systems.

In summary, the always-on nature of big data systems offers unique opportunities for researchers to study unexpected events and provide real-time insights. However, it’s essential to be aware of the limitations, especially when considering long-term studies.



Reactivity refers to the phenomenon where people change their behavior when they know they are being observed. For instance, individuals might act more generously in a lab setting compared to real-world scenarios due to the awareness of being watched.

One of the promising aspects of big data is that participants are often unaware that their data is being captured, or they’ve become so used to this data collection that it no longer influences their behavior. This nonreactivity allows researchers to study behaviors that might have been challenging to measure accurately in the past.

Examples of Nonreactive Research: A study by Stephens-Davidowitz used the frequency of racist terms in search engine queries to gauge racial animus across different U.S. regions. The nonreactive and vast nature of the search data enabled measurements that would be challenging using other methods, like surveys.

Limitations and Concerns

  1. Not Always Reflective of True Behavior: Even if data sources are nonreactive, they might not always accurately represent people’s behaviors or attitudes. For instance, people might not share their problems on platforms like Facebook, even if they face them.
  2. Algorithmic Confounding: The behavior captured in big data sources can sometimes be influenced by the platform’s goals, leading to potential biases in the data.
  3. Ethical Concerns: Tracking people’s behavior without their knowledge or consent can raise ethical issues, which will be discussed in detail in chapter 6.

In summary, while the nonreactive nature of big data sources offers advantages for capturing genuine behaviors, researchers should be aware of potential biases and ethical considerations.



Nature of Big Data: Most big data sources are incomplete, primarily because they were created for purposes other than research. This incompleteness can be more pronounced in big data compared to traditional research data.

Three Types of Incompleteness:

  1. Demographic Information: Big data often lacks detailed demographic information about participants.
  2. Behavior on Other Platforms: Big data from one platform might not capture behavior on other platforms.
  3. Data to Operationalize Theoretical Constructs: This is the most challenging type of incompleteness. Theoretical constructs are abstract ideas that researchers study. Operationalizing them means finding observable data that captures these abstract ideas. For instance, while a claim like “people who are more intelligent earn more money” sounds straightforward, defining and measuring “intelligence” is complex.

Construct Validity: This term refers to the match between theoretical constructs and data. For instance, a study claiming that people who use longer words on Twitter are more intelligent might lack construct validity. The chapter emphasizes that more data doesn’t automatically solve problems with construct validity.

Solutions to Incompleteness:

  1. Collect Required Data: Researchers can gather the data they need, which will be discussed in chapter 3 about surveys.
  2. User-Attribute Inference/Imputation: This involves using available data to infer attributes of other people.
  3. Record Linkage: Combining multiple data sources can help fill gaps. This process can create a detailed portrait of an individual, akin to a “Book of Life.” However, such detailed data can also be a “database of ruin,” with potential unethical uses, which will be discussed in chapter 6 (Ethics).

In summary, while big data offers vast research opportunities, its inherent incompleteness poses challenges. Researchers need to be aware of these limitations and employ strategies to address them effectively.



Many big data sources, while rich in information, are inaccessible to researchers due to various reasons:

Utah Data Center Example: The US National Security Agency’s Utah Data Center is believed to have the capability to store and process vast amounts of communication data, from private emails to Google searches. While this data center represents a treasure trove of information, it remains inaccessible to researchers.

Reasons for Inaccessibility:

  1. Legal, Business, and Ethical Barriers: Data might be restricted due to terms-of-service agreements, potential business risks, or ethical concerns. For instance, sharing data could expose companies to lawsuits or public backlash.
  2. Past Mistakes: The release of supposedly anonymized search queries by AOL in 2006 serves as a cautionary tale. The data was not as anonymous as believed, leading to privacy breaches and significant repercussions for the company and its employees.

Gaining Access:

  1. Government Procedures: Some governments have established procedures allowing researchers to apply for data access.
  2. Partnerships with Companies: Collaborations can sometimes grant researchers access to corporate data. Successful partnerships often involve mutual interests and capabilities from both the researcher and the company.

Downsides of Access Through Partnerships:

  1. Data Sharing Limitations: Researchers might not be able to share their data with peers, limiting the verification and extension of results.
  2. Restricted Research Questions: Companies might not permit research that could portray them negatively.
  3. Perceived Conflicts of Interest: Collaborations can lead to perceptions that research results were influenced by partnerships.

In conclusion, while a lot of big data is rich in insights, it remains inaccessible to researchers due to various legal, business, and ethical barriers. Some pathways, like government procedures or corporate partnerships, can grant access, but they come with their own sets of challenges and limitations.



Representative vs. Nonrepresentative Data: Some social scientists value data that comes from a probabilistic random sample from a well-defined population, termed as “representative” data. In contrast, many big data sources are nonrepresentative, leading some to believe that little can be learned from such data.

John Snow’s Cholera Study: Snow’s study of the 1853-54 cholera outbreak in London is a classic example of effective use of nonrepresentative data. He compared cholera rates between households served by two different water companies, providing evidence for his theory about cholera’s cause, even though the data wasn’t representative of all of London.

Within-Sample Comparisons vs. Out-of-Sample Generalizations: Nonrepresentative data can be effective for within-sample comparisons but may not be suitable for out-of-sample generalizations. For example, the British Doctors Study, which demonstrated that smoking causes cancer, was based on nonrepresentative data but was effective due to its within-sample comparison.

Transportability of Patterns: Questions arise about whether patterns found in one group (e.g., male British doctors) will hold in other groups. This transportability is a nonstatistical issue and depends on understanding the mechanism linking the variables.

Pitfalls of Nonrepresentative Data: A study of the 2009 German parliamentary election analyzed over 100,000 tweets and found that the proportion of tweets mentioning a political party matched the proportion of votes that party received. However, this analysis was flawed as it excluded the most mentioned party, the Pirate Party. This example highlights the risks of making out-of-sample generalizations from nonrepresentative data.

In conclusion, while many big data sources are nonrepresentative, they can still be powerful for certain research questions, especially within-sample comparisons. However, researchers must be clear about their sample’s characteristics and support claims about transportability with evidence.



The chapter highlights the challenges posed by the drifting nature of many big data sources:

Longitudinal Data: Big data sources often collect data over time, providing valuable longitudinal data. However, to reliably measure change, the measurement system itself must remain stable.

Drift: Many big data systems, especially those designed for business, undergo constant changes, a phenomenon termed as “drift”. This drift can manifest in three primary ways:

  1. Population Drift: Changes in who is using the system. For instance, during the 2012 US Presidential election, the proportion of tweets about politics written by women fluctuated daily. Over the long term, certain demographic groups might adopt or abandon platforms like Twitter.
  2. Behavioral Drift: Changes in how the system is used. An example is the 2013 Occupy Gezi protests in Turkey, where protesters altered their use of hashtags as the protest evolved. Initially, hashtags were used to draw attention to the protest, but as it became a dominant story, the usage decreased.
  3. System Drift: Changes in the system itself. For instance, Facebook has increased the character limit for status updates over time, which can influence any longitudinal study of status updates.

Implications of Drift: Any observed pattern in a big data source could be due to genuine changes in the world or could be influenced by some form of drift. For example, researchers analyzing tweets with protest-related hashtags during the Occupy Gezi protests might mistakenly believe that the discussion of the protest decreased earlier than it actually did due to behavioral drift.

In summary, while big data sources offer rich insights, they are often subject to drift due to changes in user demographics, user behavior, and the systems themselves. This drift complicates the ability of big data sources to track long-term changes accurately.



Nature of Behavior in Big Data Systems: Behavior in these systems is not purely “natural.” It’s influenced by the engineering goals of the platforms. While many big data sources are nonreactive, it’s a mistake to consider behavior in these systems as “naturally occurring.”

Algorithmic Confounding: The ways that the goals of system designers introduce patterns into data is termed as “algorithmic confounding.” This phenomenon is relatively unknown to social scientists but is a significant concern among data scientists. It’s largely invisible and can mislead researchers.


  • Facebook’s Friend Count: A study discovered an anomalously high number of Facebook users with approximately 20 friends. This wasn’t a natural social phenomenon but a result of Facebook encouraging users to make more friends until they reached a count of 20. Transitivity in Online Social Networks: Transitivity, the idea that if you’re friends with Alice and Bob, then Alice and Bob are likely to be friends, was observed both offline and on Facebook. However, Facebook’s “People You May Know” feature, which suggests friends based on transitivity, artificially increases this pattern on the platform.
  • Performativity: This is when a theory changes the world in such a way that it aligns more with the theory. In the case of algorithmic confounding, the theory itself might be baked into how the system works, making it hard to distinguish between natural behavior and system-induced behavior.

Casino Metaphor: A more apt metaphor for big data sources is observing people in a casino. Casinos are engineered environments designed to induce certain behaviors. Ignoring this engineered aspect might lead to incorrect conclusions about human behavior.

Challenges: Algorithmic confounding is hard to address because many features of online systems are proprietary, poorly documented, and constantly changing. For instance, the inner workings of Google’s search algorithm are proprietary, making it challenging to assess certain claims.

In essence, while big data sources can provide valuable insights, the influence of algorithms and platform goals (algorithmic confounding) can introduce biases and patterns that researchers need to be aware of.



Big data sources often contain “dirty” data, which can be misleading:

Misconception: Some researchers assume that big data, especially from online sources, is pristine because it’s automatically collected. However, those familiar with big data know that it’s often “dirty,” meaning it includes data that doesn’t reflect genuine actions of interest to researchers.

Dangers of Dirty Data: An example is the study by Back and colleagues (2010) on the emotional response to the September 11, 2001 attacks. They used timestamped pager messages to create an emotional timeline of the day. They found an increase in anger throughout the day. However, a closer look by Cynthia Pury (2011) revealed that many of these “angry” messages came from a single pager, sending automated messages with the word “CRITICAL.” Removing these messages eliminated the observed increase in anger. This example shows how simple analysis of complex data can lead to incorrect conclusions.

Intentional Spam: Some online systems attract spammers who generate fake data. For instance, political activity on Twitter sometimes includes spam, making certain political causes appear more popular than they are. Removing this intentional spam can be challenging.

Dependence on Research Question: What’s considered “dirty” can vary based on the research question. For instance, many Wikipedia edits are made by automated bots. If a researcher is studying Wikipedia’s ecology, these bot edits are relevant. But if the focus is on human contributions, bot edits should be excluded.

Cleaning Dirty Data: There’s no single method to ensure data is sufficiently clean. The best approach is to understand thoroughly how the data was created.

In summary, while big data offers rich insights, it’s essential to be aware of and address the “dirty” nature of some of this data to avoid drawing incorrect conclusions.



Big data sources often contain sensitive information, which poses challenges and ethical concerns:

Nature of Sensitive Data: Companies and governments possess sensitive data. For instance, health insurance companies have detailed medical care records of their customers. While this data can be invaluable for health research, its public release could lead to emotional or economic harm.

Difficulty in Defining Sensitivity: Determining what constitutes sensitive information can be challenging. The Netflix Prize serves as an illustrative example. In 2006, Netflix released 100 million movie ratings from almost 500,000 members for a competition. While the data was anonymized, researchers managed to re-identify some users. One might assume movie ratings are benign, but a closeted lesbian woman joined a lawsuit against Netflix, highlighting that movie ratings can reveal personal struggles with issues like sexuality, mental illness, recovery from addictions, and experiences of abuse.

De-identification Challenges: A common defense to protect sensitive data is de-identification. However, the Netflix example shows that even de-identified data can be re-identified, revealing sensitive information.

Ethical Concerns: Collecting sensitive data without consent, even if it doesn’t cause direct harm, raises ethical issues. Analogously, observing someone without their knowledge is a violation of their privacy. The challenge lies in determining what is deemed sensitive, and collecting such data without consent can lead to potential privacy concerns.

In conclusion, while big data offers rich insights, it often contains sensitive information. Researchers must be cautious and ethical when handling such data, understanding the challenges of de-identification and the potential repercussions of revealing sensitive information.

3.2 Research Strategies


Counting things in research, especially in the age of big data, can be insightful when combined with a good question and appropriate data:

Counting in Social Research: Many social research methodologies involve counting specific phenomena. However, in the era of big data, the sheer volume of data can be overwhelming. The challenge is determining what is worth counting.

Motivation by Absence: Some researchers are motivated to count something simply because no one has counted it before. However, this approach doesn’t always lead to meaningful research. Instead, researchers should focus on questions that are either important or interesting.

Taxi Drivers Study: Henry Farber studied the behavior of New York City taxi drivers to test two competing theories in labor economics. Using data from every taxi trip taken by NYC cabs from 2009 to 2013, Farber found that most drivers work more on days when wages are higher, aligning with the neoclassical theory. The study also revealed that newer drivers gradually adapt to work more hours on high-wage days, and those who behave differently are more likely to quit.

Chinese Censorship Study: Gary King, Jennifer Pan, and Molly Roberts researched online censorship by the Chinese government. They collected data from over 1,000 Chinese social media websites to determine which posts were censored. The study found that posts on highly sensitive topics were censored only slightly more than other topics. The key finding was that only three types of posts were regularly censored: pornography, criticism of censors, and those with collective action potential.

Both studies above demonstrate that simple counting, when combined with a meaningful question and relevant data, can lead to significant insights. However, the data alone isn’t enough; the research question and context are crucial.



Forecasting in the realm of big data involves predicting future trends, while nowcasting focuses on accurately measuring the current state of the world:

Forecasting vs. Nowcasting: While forecasting aims to predict the future, nowcasting, a term combining “now” and “forecasting,” seeks to “predict the present.” Nowcasting can be particularly useful for entities like governments and companies that need up-to-date and accurate world metrics.

Epidemiology and the Need for Nowcasting: Influenza, for instance, poses a global health threat annually. Governments worldwide have set up surveillance systems to monitor influenza outbreaks. The US Centers for Disease Control and Prevention (CDC) collects data from doctors nationwide. However, this data has a reporting lag, meaning it reflects the situation two weeks prior. In an epidemic, officials need real-time data.

Google’s Contribution: Google collects data that can indicate influenza prevalence, such as search queries related to flu symptoms. Combining this fast but inaccurate data with the slower, accurate CDC data can produce real-time and accurate measurements of influenza prevalence. This project was named Google Flu Trends.

Google Flu Trends’ Limitations:

  1. Performance vs. Simple Models: Google Flu Trends’ predictions weren’t significantly better than simple models that estimated flu prevalence based on the two most recent measurements.
  2. Drift and Algorithmic Confounding: Google Flu Trends faced issues like overestimating flu prevalence during the 2009 Swine Flu outbreak due to changed search behaviors. Also, its performance decayed over time, possibly due to changes in Google’s search algorithms.

Future of Nowcasting: Despite the challenges faced by Google Flu Trends, nowcasting remains promising. Combining big data with traditional research data can provide more timely and accurate measurements. Nowcasting can essentially speed up any measurement made repeatedly over time with some delay.

In essence, while forecasting and nowcasting offer valuable insights using big data, it’s crucial to be aware of potential pitfalls and the importance of combining different data sources for accurate results.



This section discusses how researchers can approximate experiments they haven’t or can’t conduct. Two primary approaches benefiting from big data sources are highlighted: natural experiments and matching.

Causal Questions: Many scientific and policy questions are causal in nature. For instance, determining the effect of a job training program on wages. A challenge arises when trying to differentiate the effects of the program from pre-existing differences between participants and non-participants.

Randomized Controlled Experiments: The gold standard for estimating causal effects is randomized controlled experiments, where participants are randomly assigned to receive a treatment. However, it’s not always feasible to conduct such experiments.

Natural Experiments: One approach to making causal estimates from non-experimental data is to identify events that have randomly (or nearly randomly) assigned a treatment to some individuals and not to others. These situations are termed “natural experiments.” Some examples:

  1. Vietnam Draft: Joshua Angrist leveraged the Vietnam draft lottery, where birth dates were randomly selected to determine conscription, as a natural experiment. He combined this with earnings data from the Social Security Administration to estimate the effect of military service on earnings.

  2. Supermarket Study: Alexandre Mas and Enrico Moretti used a natural experiment involving cashiers at a supermarket. Due to scheduling processes, cashiers worked with different peers at different times. The supermarket’s digital checkout system provided data on each cashier’s productivity. By combining these, they estimated the effect of working with productive colleagues on a worker’s productivity.

Challenges with Natural Experiments: Even in clear cases like the draft lottery, complications arise. For instance, not everyone drafted served in the military, and not everyone who served was drafted. Assumptions and additional considerations are needed to derive meaningful conclusions.

Other Examples of Natural Experiments: The chapter provides a table summarizing other studies that have used natural experiments combined with big data sources. Some of these include studying peer effects on productivity, friendship formation during hurricanes using Facebook data, and the effect of stress on unborn babies during the 2006 Israel–Hezbollah war using birth records.

Continuing from where we left off on approximating experiments using big data:

Matching Strategy: Another strategy for making causal estimates from non-experimental data is “matching.” In this approach, researchers look through non-experimental data to create pairs of people who are similar, except that one has received the treatment and one has not. This method also involves “pruning” or discarding cases where there are no obvious matches.

eBay Auction Study: Liran Einav and colleagues used matching to study auctions on eBay. They were interested in the effect of auction starting price on outcomes, such as the sale price. They found that the relationship between starting price and sale price is nonlinear. For starting prices between 0.05 and 0.85, the starting price has minimal impact on sale price. They also found that the effect of starting price varies across different categories of items.

Challenges with Matching: Matching can lead to biased results if not all relevant factors are considered. For instance, in the eBay study, if items were sold at different times of the year, seasonal variations could affect the results. Another challenge is that matching results only apply to the matched data and not to unmatched cases.

Benefits of Matching in Big Data: Matching in massive datasets might be superior to a few field experiments when: 1. There’s significant heterogeneity in effects. 2. Important variables needed for matching have been measured.

Other Studies Using Matching with Big Data:

  1. Effect of Shootings on Police Violence: Using stop-and-frisk records, Legewie (2016) studied this.
  2. Effect of September 11, 2001 on Families and Neighbors: Hersh (2013) used voting records and donation records.
  3. Social Contagion: Aral, Muchnik, and Sundararajan (2009) used communication and product adoption data.

While natural experiments can be powerful for making causal estimates from non-experimental data, they come with challenges and assumptions. Big data sources, however, enhance the ability to capitalize on such experiments.

3.3 Conclusion

Big data sources are abundant, but their use in social research can be complex. Here are the key takeaways:

Data Characteristics: Big data sources typically exhibit 10 characteristics. Three are generally beneficial for research: they’re big, always-on, and nonreactive. Seven can be problematic: they’re incomplete, inaccessible, nonrepresentative, drifting, algorithmically confounded, dirty, and sensitive. These characteristics arise because big data sources weren’t designed for social research.

Value for Social Research:

Big data can:

  1. Help researchers choose between competing theoretical predictions. For instance, studies on New York Taxi drivers’ behavior or censorship in China.
  2. Enhance policy measurement through nowcasting, like Google Flu Trends.
  3. Aid in making causal estimates without experiments, such as studies on peer effects on productivity or the impact of starting prices on eBay auctions.

Data and Theory Relationship: While the chapter has been theory-driven, big data also allows for empirically driven theorizing. Through accumulating empirical facts and patterns, researchers can build new theories. This approach doesn’t negate the need for theory but suggests a rebalancing between data and theory in the digital age.

Future Directions: The chapter emphasizes that while observing people can provide valuable insights, the next chapters will delve into more direct interactions with individuals, such as asking questions, running experiments, and involving them in the research process.



This section delves into the mathematical aspects of making causal inferences from non-experimental data:

Potential Outcomes Framework: This framework is introduced to understand causal inference. It consists of three main elements:

  1. Units: In the context of the example provided, these are people eligible for the 1970 draft in the U.S.
  2. Treatments: These can be “serving in the military” or “not serving in the military.”
  3. Potential Outcomes: These are hypothetical outcomes that could have occurred under different treatment conditions. For instance, the earnings a person would have had if they served in the military versus if they didn’t.

Causal Effect: Defined as the difference between potential outcomes for a person if they received the treatment versus if they didn’t. The average treatment effect (ATE) is the average of these individual effects across the population.

Natural Experiments: These are situations where nature or some external factor effectively randomizes a treatment for researchers. An example provided is the draft lottery, which can be seen as a natural experiment determining who serves in the military.

Encouragement Design: This is a scenario where there’s a secondary treatment (like the draft) that encourages people to take a primary treatment (like serving in the military). The analysis method for this situation is called “instrumental variables.”

Complier Average Causal Effect (CACE): This is the effect of the treatment on a specific group of people, called “compliers.” Compliers are those who will take the treatment if encouraged and won’t if not encouraged.

Exclusion Restriction: This is a crucial assumption in the instrumental variables approach. It requires that the encouragement (e.g., draft lottery) affects the outcome (e.g., earnings) only through the treatment (e.g., military service). The exclusion restriction could be violated if, for instance, people who were drafted spent more time in school to avoid service, leading to higher earnings.

Estimating CACE: If three conditions are met—random assignment to treatment, no defiers, and the exclusion restriction—then the Complier Average Causal Effect (CACE) can be estimated from observed data. The formula for estimating CACE is: \[ \widehat{CACE} = \frac{\widehat{ITT_Y}}{\widehat{ITT_W}} \] where \(\widehat{ITT_Y}\) is the estimated intent-to-treat effect on the outcome, and \(\widehat{ITT_W}\) is the estimated intent-to-treat effect on the treatment.


  1. The exclusion restriction is a strong assumption and must be justified on a case-by-case basis. Randomization of the encouragement doesn’t automatically justify this assumption.
  2. Weak instruments, where the encouragement has little effect on treatment uptake, can lead to challenges. The estimated CACE can be sensitive to small biases in \(\widehat{ITT_Y}\), especially when \(\widehat{ITT_W}\) is small.

Further Reading: The section references various sources for a more in-depth understanding of the discussed topics. For instance, it mentions the traditional econometric approach to instrumental variables, alternative presentations of the approach, and discussions on the exclusion restriction.

In essence, the mathematical notes provide a deeper, more technical understanding of the methods used to make causal inferences from non-experimental data, emphasizing the importance of assumptions and the challenges they present.