# Chapter 6 The Empirical Analysis

Any quantitative research in economics is centered on the analysis we perform on the data we collected. This is the most crucial part of the paper and will define if our work is a success or not (this is, of course linked to having a good research question and a plausible hypothesis).

In this section, I provide a set of guidelines of some of the elements to keep in mind when conducting quantitative research. This material, of course, is not exhaustive as there are many elements we need to take into account, but it may provide you with some structure as to what are the issues we need to keep in mind.

## 6.1 The Data

There are two different types of data that exist. **Experimental** data is collected when an experiment or study is conducted to examine the effects of a given policy or intervention. One example may be when looking if there is an increase in vaccination when providing incentives. One group may not receive any sort of incentive, whereas another group may receive a monetary incentive and another one an in-kind incentive. Data is collected to ensure that all the arms in the study have a similar configuration, so when the study is conducted, we can verify that the true effects come from the treatment (the incentives) and not from a different factor affecting the configuration of the sample.

The most popular sort of data, however, is **observational** data. This information is collected by either administrative sources (think of the U.S. Census data or the World Bank). This data is collected using surveys or accessing historical records. Sometimes, it is hard to use this data for econometric analysis as there is no random assignment of a treatment, so it is harder to elicit the *true effect*. However, there are multiple tools that we can use to deal with these issues and estimate causal effects.

### 6.1.1 Data configuration

#### 6.1.1.1 Cross-sectional Data

Cross-sectional data includes data on different subjects (individuals, households, government units, countries) for *a single time period*. This means that we only have one level of analysis and one observation per subject (the *i*). This type of data allows us to learn more about the relationship among different variables.

One example of this type of data is the survey on smallholder farmers collected in the Ivory Coast in 2015 by the World Bank, where about 2,500 smallholder farmers were surveyed to ask questions about farming practices, investment and access to financial services.

#### 6.1.1.2 Time-series Data

In this case, data for a single subject is collected during multiple time periods. In this case, the main unit of analysis will be based on time (the *t*).

The most common type of data used for this type of analysis is macroeconomic data (GDP, unemployment, etc.) and is highly used to do forecasting.

#### 6.1.1.3 Panel Data

Panel, or longitudinal, data includes multiple observations for each subject Mostly, we are going to see that data is collected for the same object during multiple time periods, so we will see that for the same *i*, we will have data for multiple *t*’s.

This data is highly used in econometrics. One example is, for instance, the number of violent crimes per county (the *i*) for the period between 2000 and 2020 (the *t*).

It is extremely important to understand the configuration of your data, as this will define the type of econometric analysis that you can conduct.

### 6.1.2 Describing your Variables

After we have identified the configuration of our data, it is necessary that we think deeper about the configuration of the variables that we will use in our analysis. It is **crucial** that you identify their characteristics, as well as their distribution. This will then help you evaluate if you need to conduct any sort of transformation to your variables, and understand how to interpret the coefficients of your regressions. Here, I am just including the most relevant aspects of this steps, but you can read Nick Hunington-Kelin’s book for more details.

#### 6.1.2.1 Types of Variables

**Continuous variables**: In theory, this variables can include any value, but sometimes they may be censored in some way (for instance, some variables cannot be negative). Some examples of this type of variable are income, for example.**Count variables**: Most times, we treat this variables in the same way as we treat continuous variables, but in this case, these variables represent how many or how much there is of a certain variable (they count). When we plot them, it is clear that these variables are not continuous.**Categorical variables**: Multiple times, surveys include questions that have a pre-set number of values or where the respondent needs to provide an answer that can then be grouped in a given category. For instance, ethnicity, religion, age group, etc. Many times, these variables are or can be transformed into binary (or indicator) variables. A clear example of the former is sex, but a new set of variables for different religions can be created to identify Christians, Jewish, Muslims, and so forth. Depending on the original category, a new set of dichotomous variables can be created to identify if a person identifies with one of these religions.**Qualitative Variables**: Sometimes, responses require a more detailed explanation and therefore cannot be grouped into categories (at least not on first sight). For instance, the ACLED data, a source on conflict data, includes a variable that explains the details of a given conflict event.

### 6.1.3 Visualizing your Data

After you identify the type of variables you are using in your analysis, it is key that you understand their distribution. What are the different values that a variable can take? How often these values occur?

This can be done in multiple ways. The easiest one is to generate a table for the variable. In Stata, this is done with:

`tab varX`

To tabulate a variable in R, you can use:

`table(dataset$varX)`

You can also plot your variables to obtain a clear visualization of their distribution. You can use *histograms* for non-continuous variables, and *density* plots for continuous variables.

In Stata:

```
%% Histogram: You can use the option 'normal' to add a normal density curve to the graph:
histogram varX, normal
%% Density plots (for continuous variables):
kdensity varX, normal
```

In R:

```
%% Histogram:
varX <- dataset$VarX
hist(varX)
%% Density plots (for continuous variables):
varX <- density(dataset$VarX)
plot(varX)
```

### 6.1.4 Distribution

Many times, it is important to know more about the different moments of the distribution of your variables: mean, variance (or standard deviation), skewness, and sometimes, the kurtosis.

Although a visual representation of your data is very useful in these cases, obtaining a table with this information may also be necessary, to also obtain the range of your data, as well as other important characteristics.

In Stata, you can obtain a set of descriptive statistics using:

`sum varX, d`

In R, you can get a range of descriptive statistics using

`summary(varX)`

Or:

`sapply(varX, mean, na.rm=TRUE)`

**Why is this important?** Because remember, we are trying to draw some inferences from the sample we have and apply it to the real world (to the whole population we are analyzing). Many times, we have some idea of *theoretical distribution* of the variables we are interested in In most cases, it is plausible to assume a normal distribution (remember the **Central Limit Theorem**). This is one of the reasons we prefer larger samples than smaller ones. In some cases, we may get a distribution that is skewed to the right and has a very fat right-tail, but once we obtain the natural logarithm, it becomes normal. This refers to a log-normal distribution. As we proceed with analysis and do hypothesis testing, remember that you are using a limited sample to learn more about a bigger population.

## 6.2 Initial Description of a Relationship

Once we know how our specific variables are distributed, we may be interested in learning more about how they are linked. We want to see how our independent variable(s) is(are) linked to the dependent variable.

The most straightforward way to do this is by using a scatterplot, where we plot the independent and dependent variable and see how they correlate.

In Stata:

`twoway scatter varY varX`

In R:

`plot(varX, varY, main)`

We may also look at some conditional distributions and plot histograms and scatterplots, looking at a subsample of the data or plotting it for different groups.

In addition, we can obtain an initial image on the relationship between X and Y doing a simple OLS regression (with no control variables). We may even plot this fitted OLS line.

In Stata:

```
twoway scatter varY varX || lfit varY varX
%% If we expect a non-linear relationship, we can do:
twoway scatter varY varX || lowess varY varX
```

In R:

```
abline(lm(varY~varX), col="red") # regression line (y~x)
lines(lowess(varY,varX), col="blue") # lowess line (x,y)
%% Please note that you can also use ggplot to create these plots.
```

For more examples and a more detailed description, please check Nick Hunington-Kelin’s book.