Chapter 4 Introduction to Data Wrangling

In this chapter we present some very basic data handling and processing functions (data wrangling) that will be necessary for doing basic analyses, comparisons, and graphics. Most of the commands presented in this section stress the functions and R packages in the tidyverse - a set or family of packages that have similar syntax and behaviors.

4.1 Tidy Data

What is tidy data? Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

4.2 Subset using filter

Suppose we wish to examine a subset of data for only one country, Jon’s favorite country, Australia!! The following code starts by taking the gapminder dataset and then “pipes” it into the filtering (selecting rows) action so that only dataset rows from Australia are selected. The pipe function is %>% and is similar to a plumbing pipe that goes one direction: from left to right. After the “Australia” rows are selected, the result is “piped” into the head function for display. The head function says show the top 12 rows. When no rows are specified in the head function, the default is 6 rows. Note that the filter function resides in the dplyr package within the tidyverse family.

If the tidyverse or dplyr packages have been loaded, you don’t need to supply the dplyr:: prefix to the filter command.

#gapminder %>% filter(country=="Australia") %>% head(n=12)
gapminder %>% dplyr::filter(country=="Australia") %>% head(n=12)

## # A tibble: 12 x 6
##    country   continent  year lifeExp      pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Australia Oceania    1952    69.1  8691212    10040.
##  2 Australia Oceania    1957    70.3  9712569    10950.
##  3 Australia Oceania    1962    70.9 10794968    12217.
##  4 Australia Oceania    1967    71.1 11872264    14526.
##  5 Australia Oceania    1972    71.9 13177000    16789.
##  6 Australia Oceania    1977    73.5 14074100    18334.
##  7 Australia Oceania    1982    74.7 15184200    19477.
##  8 Australia Oceania    1987    76.3 16257249    21889.
##  9 Australia Oceania    1992    77.6 17481977    23425.
## 10 Australia Oceania    1997    78.8 18565243    26998.
## 11 Australia Oceania    2002    80.4 19546792    30688.
## 12 Australia Oceania    2007    81.2 20434176    34435.

4.3 Subset using multiple conditions

Let’s select observations by continent and year. The head function will then show some of the rows selected. Here the gapminder dataframe is piped to the filter function to select rows to be further piped to the head() function for display. The logical condtion inside filter restricts continent to “Oceania” AND (AND condition is “&”) year to be 1997. Both of these conditions must be TRUE for the row to enter the dataframe to displayed by the head() function.

gapminder %>% 
  dplyr::filter(continent=="Oceania" & year==1997) %>% 
  head()

## # A tibble: 2 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia   Oceania    1997    78.8 18565243    26998.
## 2 New Zealand Oceania    1997    77.6  3676187    21050.

Notice that two filter statements produce the same result.

gapminder %>% 
  dplyr::filter(continent=="Oceania") %>% 
  dplyr::filter(year==1997) %>%
  head()

## # A tibble: 2 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Australia   Oceania    1997    78.8 18565243    26998.
## 2 New Zealand Oceania    1997    77.6  3676187    21050.

The next example uses an “or” condition to specify the desired rows in the first filter expression - the next filter permits only observations from 1997.

gapminder %>% 
  dplyr::filter(continent=="Oceania" | continent =="Americas") %>%
  dplyr::filter(year==1997) %>% 
  head()

## # A tibble: 6 x 6
##   country   continent  year lifeExp       pop gdpPercap
##   <fct>     <fct>     <int>   <dbl>     <int>     <dbl>
## 1 Argentina Americas   1997    73.3  36203463    10967.
## 2 Australia Oceania    1997    78.8  18565243    26998.
## 3 Bolivia   Americas   1997    62.0   7693188     3326.
## 4 Brazil    Americas   1997    69.4 168546719     7958.
## 5 Canada    Americas   1997    78.6  30305843    28955.
## 6 Chile     Americas   1997    75.8  14599929    10118.

The next example selects observations/rows from a list of countries and also restricts year to 1997.

gapminder %>% 
  filter(country %in% c("Australia", "New Zealand","Argentina") & year==1997) %>% 
  head()

## # A tibble: 3 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Argentina   Americas   1997    73.3 36203463    10967.
## 2 Australia   Oceania    1997    78.8 18565243    26998.
## 3 New Zealand Oceania    1997    77.6  3676187    21050.

The next example selects observations by omitting one continent (Oceania is excluded) and then specifies a year. The code that causes “omit” is the “!=” syntax. In the code year==1997, the double equal sign == means make a logical check if year is 1997. Only rows where both aspects of the filter conditions pass through to be displayed by head. Again, the logical operator “AND” is expressed by the & expression.

gapminder %>% 
  filter(continent!="Oceania" & year==1997) %>% 
  head()

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1997    41.8 22227415      635.
## 2 Albania     Europe     1997    73.0  3428038     3193.
## 3 Algeria     Africa     1997    69.2 29072015     4797.
## 4 Angola      Africa     1997    41.0  9875024     2277.
## 5 Argentina   Americas   1997    73.3 36203463    10967.
## 6 Austria     Europe     1997    77.5  8069876    29096.

Please note that in all the above examples, the filter function accepts/rejects rows or observations in a dataframe according to the logical conditions specified inside the filter function.

4.4 Saving as a new dataframe

Here we save the the modified dataset as a new dataframe called gap97.

gap97 <- gapminder %>% 
  filter(continent!="Oceania" & year==1997) 
#
dplyr::glimpse(gap97)

## Rows: 140
## Columns: 6
## $ country   <fct> "Afghanistan", "Albania", "Algeria", "Angola", "Argentina", "Aust…
## $ continent <fct> Asia, Europe, Africa, Africa, Americas, Europe, Asia, Asia, Europ…
## $ year      <int> 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997, 1997,…
## $ lifeExp   <dbl> 41.763, 72.950, 69.152, 40.963, 73.275, 77.510, 73.925, 59.412, 7…
## $ pop       <int> 22227415, 3428038, 29072015, 9875024, 36203463, 8069876, 598561, …
## $ gdpPercap <dbl> 635.3414, 3193.0546, 4797.2951, 2277.1409, 10967.2820, 29095.9207…

4.5 Subset using top_n

Let’s make a dataset based on the ten countries in 1997 with highest gdp.

gapminder %>% filter(year==1997) %>%
  top_n(n = 10, wt = gdpPercap) %>%
  head(n=10)

## # A tibble: 10 x 6
##    country       continent  year lifeExp       pop gdpPercap
##    <fct>         <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Austria       Europe     1997    77.5   8069876    29096.
##  2 Canada        Americas   1997    78.6  30305843    28955.
##  3 Denmark       Europe     1997    76.1   5283663    29804.
##  4 Japan         Asia       1997    80.7 125956499    28817.
##  5 Kuwait        Asia       1997    76.2   1765345    40301.
##  6 Netherlands   Europe     1997    78.0  15604464    30246.
##  7 Norway        Europe     1997    78.3   4405672    41283.
##  8 Singapore     Asia       1997    77.2   3802309    33519.
##  9 Switzerland   Europe     1997    79.4   7193761    32135.
## 10 United States Americas   1997    76.8 272911760    35767.

4.6 Subset using select

The filter function controls the rows of the dataframe. Sometimes we might want to include only a few of the variables (columns) in a dataset. We frequently want to create a data subset with only a few variables when the original dataset has many variables. The select function is used to select and rename variables.

# the next command selects three variables and renames two of them:
gapminder %>% 
  dplyr::select(country, Year=year,LifeExp=lifeExp) %>% 
  head()

## # A tibble: 6 x 3
##   country      Year LifeExp
##   <fct>       <int>   <dbl>
## 1 Afghanistan  1952    28.8
## 2 Afghanistan  1957    30.3
## 3 Afghanistan  1962    32.0
## 4 Afghanistan  1967    34.0
## 5 Afghanistan  1972    36.1
## 6 Afghanistan  1977    38.4

# to change the order of display, puts year first in the list of variables
gapminder %>% 
  select(year,everything()) %>% 
  head()

## # A tibble: 6 x 6
##    year country     continent lifeExp      pop gdpPercap
##   <int> <fct>       <fct>       <dbl>    <int>     <dbl>
## 1  1952 Afghanistan Asia         28.8  8425333      779.
## 2  1957 Afghanistan Asia         30.3  9240934      821.
## 3  1962 Afghanistan Asia         32.0 10267083      853.
## 4  1967 Afghanistan Asia         34.0 11537966      836.
## 5  1972 Afghanistan Asia         36.1 13079460      740.
## 6  1977 Afghanistan Asia         38.4 14880372      786.

The profiling_num command from the funModeling package produces a lot of output, some we might not want. We will show how to modify the output of this command here. The command produces a dataframe which has many columns we might not wish to display or consider further.

We begin by removing some columns of summary statistics that we wish to ignore. Selecting a list of column names with a “minus” - sign in front of the list will remove these items from the dataframe and keep the rest in place. The command below pipes the modified dataframe to the kable command in the knitr package for a more pleasing tabular display.

# Let's observe the contents of profiling_num:
funModeling::profiling_num(gapminder) %>% 
  dplyr::glimpse()

## Rows: 4
## Columns: 16
## $ variable       <chr> "year", "lifeExp", "pop", "gdpPercap"
## $ mean           <dbl> 1.979500e+03, 5.947444e+01, 2.960121e+07, 7.215327e+03
## $ std_dev        <dbl> 1.726533e+01, 1.291711e+01, 1.061579e+08, 9.857455e+03
## $ variation_coef <dbl> 0.008722066, 0.217187544, 3.586268548, 1.366182632
## $ p_01           <dbl> 1952.0000, 33.4926, 154117.9200, 369.2201
## $ p_05           <dbl> 1952.0000, 38.4924, 475458.9000, 547.9964
## $ p_25           <dbl> 1965.750, 48.198, 2793664.000, 1202.060
## $ p_50           <dbl> 1979.5000, 60.7125, 7023595.5000, 3531.8470
## $ p_75           <dbl> 1.993250e+03, 7.084550e+01, 1.958522e+07, 9.325462e+03
## $ p_95           <dbl> 2007.000, 77.437, 89822054.500, 26608.333
## $ p_99           <dbl> 2.007000e+03, 8.023892e+01, 6.319900e+08, 3.678357e+04
## $ skewness       <dbl> 0.0000000, -0.2524798, 8.3328742, 3.8468819
## $ kurtosis       <dbl> 1.783217, 1.873099, 80.716151, 30.431702
## $ iqr            <dbl> 2.750000e+01, 2.264750e+01, 1.679156e+07, 8.123402e+03
## $ range_98       <chr> "[1952, 2007]", "[33.4926, 80.23892]", "[154117.92, 63199000…
## $ range_80       <chr> "[1957, 2002]", "[41.5108, 75.097]", "[946367.1, 54801369.5]…

# now remove unwanted columns from summary display
funModeling::profiling_num(gapminder) %>%
  select(-c("variation_coef","skewness","kurtosis","range_98","range_80","p_01","p_99")) %>%
  knitr::kable()

variable	mean	std_dev	p_05	p_25	p_50	p_75	p_95	iqr
year	1.979500e+03	1.726533e+01	1952.0000	1965.750	1979.5000	1.993250e+03	2007.000	2.750000e+01
lifeExp	5.947444e+01	1.291711e+01	38.4924	48.198	60.7125	7.084550e+01	77.437	2.264750e+01
pop	2.960121e+07	1.061579e+08	475458.9000	2793664.000	7023595.5000	1.958522e+07	89822054.500	1.679156e+07
gdpPercap	7.215327e+03	9.857455e+03	547.9964	1202.060	3531.8470	9.325462e+03	26608.333	8.123402e+03

In the next command we take a different approach - we explicitly select the statistics (columns) we want to keep and display. The most commonly used summaries are chosen.

funModeling::profiling_num(gapminder) %>%
  select(c("variable","mean","std_dev","p_25","p_50","p_75")) %>%
  knitr::kable()

variable	mean	std_dev	p_25	p_50	p_75
year	1.979500e+03	1.726533e+01	1965.750	1979.5000	1.993250e+03
lifeExp	5.947444e+01	1.291711e+01	48.198	60.7125	7.084550e+01
pop	2.960121e+07	1.061579e+08	2793664.000	7023595.5000	1.958522e+07
gdpPercap	7.215327e+03	9.857455e+03	1202.060	3531.8470	9.325462e+03

4.7 Order using arrange

Sometimes we might want to know the countries with the largest or smallest values of some variables. In the following examples we sort/order by the values of life expectancy. In the code below, when we use the command filter(year==1997), the double equal sign means make a logical check if year is 1997, and only allow dataframe rows where this is true to pass through to the next stage of the analysis pipeline. The desc function means the values will be arranged in descending values (large to small) - default is ascending (small to large).

# This command will show the countries with highest life expectancy because 
# the data are arranged in descending order of life expectancy (larger to smaller)
gapminder %>% 
  dplyr::filter(year==1997) %>%   
  dplyr::select(country, continent, lifeExp) %>% 
  dplyr::arrange(desc(lifeExp)) %>% 
  head()

## # A tibble: 6 x 3
##   country          continent lifeExp
##   <fct>            <fct>       <dbl>
## 1 Japan            Asia         80.7
## 2 Hong Kong, China Asia         80  
## 3 Sweden           Europe       79.4
## 4 Switzerland      Europe       79.4
## 5 Iceland          Europe       79.0
## 6 Australia        Oceania      78.8

# This command uses the default ascending (increasing) order with
# respect to life expectancy (order smaller to larger)
gapminder %>% 
  filter(year==1997) %>%  
  select(country, continent, lifeExp) %>% 
  arrange(lifeExp) %>% 
  head()

## # A tibble: 6 x 3
##   country      continent lifeExp
##   <fct>        <fct>       <dbl>
## 1 Rwanda       Africa       36.1
## 2 Sierra Leone Africa       39.9
## 3 Zambia       Africa       40.2
## 4 Angola       Africa       41.0
## 5 Afghanistan  Asia         41.8
## 6 Liberia      Africa       42.2

The top_n function from the dplyr package will select the n rows with the largest values of a variable. This is similar to the code above that orders the rows - then use head function to select the number of desired rows.

This first example uses the default alphabetical ordering of country name.

gapminder %>% 
  filter(year==1997) %>%  
  select(country, continent, lifeExp) %>%
  dplyr::top_n(n=6,wt=lifeExp) %>%
  knitr::kable()

country	continent	lifeExp
Australia	Oceania	78.83
Hong Kong, China	Asia	80.00
Iceland	Europe	78.95
Japan	Asia	80.69
Sweden	Europe	79.39
Switzerland	Europe	79.37

The results can then be ordered by the life expectancy:

gapminder %>% 
  filter(year==1997) %>%  
  select(country, continent, lifeExp) %>%
  dplyr::top_n(n=6,wt=lifeExp) %>%
  dplyr::arrange(desc(lifeExp)) %>%
  knitr::kable()

country	continent	lifeExp
Japan	Asia	80.69
Hong Kong, China	Asia	80.00
Sweden	Europe	79.39
Switzerland	Europe	79.37
Iceland	Europe	78.95
Australia	Oceania	78.83

The countries with the largest life expectancy can then be ordered by another variable like population. Here we find the 6 countries in 1997 with the highest life expectancy - then display them in order of population size.

gapminder %>% 
  filter(year==1997) %>%  
  select(country, continent, lifeExp, pop) %>%
  dplyr::top_n(n=6,wt=lifeExp) %>%
  dplyr::arrange(desc(pop)) %>%
  knitr::kable()

country	continent	lifeExp	pop
Japan	Asia	80.69	125956499
Australia	Oceania	78.83	18565243
Sweden	Europe	79.39	8897619
Switzerland	Europe	79.37	7193761
Hong Kong, China	Asia	80.00	6495918
Iceland	Europe	78.95	271192

4.8 Grouped Filter

Another useful verb in the tidyverse is group_by. Suppose we wanted to view the two countries with the highest life expectancy in 1997, in each continent.

gapminder %>% 
  filter(year==1997) %>%  
  select(country, continent, lifeExp, pop) %>%
  dplyr::group_by(continent) %>%
  dplyr::top_n(n=2,wt=lifeExp) %>%
  dplyr::arrange(continent) %>%
  knitr::kable()

country	continent	lifeExp	pop
Reunion	Africa	74.772	684810
Tunisia	Africa	71.973	9231669
Canada	Americas	78.610	30305843
Costa Rica	Americas	77.260	3518107
Hong Kong, China	Asia	80.000	6495918
Japan	Asia	80.690	125956499
Sweden	Europe	79.390	8897619
Switzerland	Europe	79.370	7193761
Australia	Oceania	78.830	18565243
New Zealand	Oceania	77.550	3676187

4.9 New Variables Using Mutate

In many problems we may wish to create a new variable based on an existing variable. Here we illustrate by making a new variable - the natural logarithm of population - based on the original variable pop using the mutate function.

gapminder %>%
  dplyr::mutate(logpopulation = log(pop)) %>%
  dplyr::glimpse()

## Rows: 1,704
## Columns: 7
## $ country       <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "…
## $ continent     <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, A…
## $ year          <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2…
## $ lifeExp       <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.82…
## $ pop           <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 128…
## $ gdpPercap     <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 9…
## $ logpopulation <dbl> 15.94675, 16.03915, 16.14445, 16.26115, 16.38655, 16.51555, 1…

If I want to change the name of the new variable from logpopulation to something shorter like logPop, we could re-run the mutate command, or use a rename function.

In addition we create a new version of the gapminder dataset that contains the new variable - called gapVers1. This dataframe is now available to be used in the ongoing analysis.

gapVers1 <- gapminder %>%
  dplyr::mutate(logpopulation = log(pop)) %>%
  dplyr::rename(logPop=logpopulation) 
#
  dplyr::glimpse(gapVers1)

## Rows: 1,704
## Columns: 7
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 4…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1288181…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.0…
## $ logPop    <dbl> 15.94675, 16.03915, 16.14445, 16.26115, 16.38655, 16.51555, 16.37…

The next code uses a mutate command with logical conditions to make a new, two-level categorical variable region as a character variable. Then we use mutate again to convert region (character) to a factor variable named regionf. In statistical models, factor variables are preferred, but in data handling stages, character versions are probably easier to manipulate.

The if_else function from dplyr has the form `if_else(logical condition,value if TRUE, value if FALSE).

The next example uses the “T-pipe” function %T>% to break the piping so that the result of the second mutate flows to both glimpse and to head - in this construction, it is understood the output of glimpse does not pipe to head, but rather the original data flow from the second mutate.

gapminder %>% 
  dplyr::mutate(region = if_else(country=="Oceania","Oceania","NotOceania")) %>%
  dplyr::mutate(regionf = as_factor(region)) %T>%
  dplyr::glimpse() %>%
  head()

## Rows: 1,704
## Columns: 8
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 4…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1288181…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.0…
## $ region    <chr> "NotOceania", "NotOceania", "NotOceania", "NotOceania", "NotOcean…
## $ regionf   <fct> NotOceania, NotOceania, NotOceania, NotOceania, NotOceania, NotOc…

## # A tibble: 6 x 8
##   country     continent  year lifeExp      pop gdpPercap region     regionf   
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>      <fct>     
## 1 Afghanistan Asia       1952    28.8  8425333      779. NotOceania NotOceania
## 2 Afghanistan Asia       1957    30.3  9240934      821. NotOceania NotOceania
## 3 Afghanistan Asia       1962    32.0 10267083      853. NotOceania NotOceania
## 4 Afghanistan Asia       1967    34.0 11537966      836. NotOceania NotOceania
## 5 Afghanistan Asia       1972    36.1 13079460      740. NotOceania NotOceania
## 6 Afghanistan Asia       1977    38.4 14880372      786. NotOceania NotOceania

4.10 Simple Counting Using tally() and count()

We frequently wish to know how many observations/rows satisfy a set of conditions. We will filter the observations for the given conditions, then count them using the tally() or count() functions from dplyr.

Essentially, count() is a short-hand for group_by() + tally().

For example, what if we want to know how many observations are from continent ‘Americas’ in 1997.

These examples have no grouping, no group_by is being used.

gapminder %>% dplyr::filter(year==1997) %>%
  dplyr::filter(continent=="Americas") %>%
  dplyr::tally()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    25

gapminder %>% dplyr::filter(year==1997) %>%
  dplyr::filter(continent=="Americas") %>%
  dplyr::count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    25

Now we group by continent.

gapminder %>% dplyr::filter(year==1997) %>%
  dplyr::group_by(continent) %>%
  dplyr::filter(continent=="Americas") %>%
  dplyr::tally()

## # A tibble: 1 x 2
##   continent     n
##   <fct>     <int>
## 1 Americas     25

#
gapminder %>% dplyr::filter(year==1997) %>%
  dplyr::group_by(continent) %>%
  dplyr::tally()

## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa       52
## 2 Americas     25
## 3 Asia         33
## 4 Europe       30
## 5 Oceania       2

gapminder %>% dplyr::filter(year==1997) %>%
  dplyr::group_by(continent) %>%
  dplyr::filter(continent=="Americas") %>%
  dplyr::count()

## # A tibble: 1 x 2
## # Groups:   continent [1]
##   continent     n
##   <fct>     <int>
## 1 Americas     25

#
gapminder %>% dplyr::filter(year==1997) %>%
  dplyr::count(continent)

## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa       52
## 2 Americas     25
## 3 Asia         33
## 4 Europe       30
## 5 Oceania       2

4.11 Missing Values

If a variable is not complete and contains empty places, these are denoted in R as NA. We will often wish to create a dataframe without any missing values, or discover how many rows contain variables with missing values.

First let’s create a small dataset with missing values:

x <- c(1,2,NA,4)
y <- c(11,12,13,NA)
z <- c(7,8,9,10)
tempdf <- data.frame(x,y,z)
tempdf

##    x  y  z
## 1  1 11  7
## 2  2 12  8
## 3 NA 13  9
## 4  4 NA 10

# count missing values for variable x
tempdf %>%
  dplyr::summarise(count = sum(is.na(x)))

##   count
## 1     1

# count rows with missing y
tempdf %>%
  dplyr::tally(is.na(y))

##   n
## 1 1

# subset of rows with complete data for specified columns
tempdf %>%
  dplyr::select(y,z) %>%
  tidyr::drop_na() %>%
  head()

##    y z
## 1 11 7
## 2 12 8
## 3 13 9

#   drop rows with missing values in all variables
tempdf %>%
  tidyr::drop_na() %>%
  head()

##   x  y z
## 1 1 11 7
## 2 2 12 8

Use base is.na function

tempdf %>%
  filter(!is.na(x),           # remove obs with missing x
         !is.na(y),  # remove obs with missing y
         !is.na(z))                # remove obs with missing z

##   x  y z
## 1 1 11 7
## 2 2 12 8

Some code that will execute a filter that will permit only rows with entirely complete data in x to pass through to the dataset,

tempdf %>%
  filter(x %>% is.na() %>% magrittr::not()) %>%
  head()

##   x  y  z
## 1 1 11  7
## 2 2 12  8
## 3 4 NA 10