What do we mean by this tidyverse thing?
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. [Source]
We’ll open up a new R script and work with the NC Breweries data again. Make sure that you are in the project file for this course. You should see elon-r-datacamp in the upper right project dropdown menu in RStudio.
Our goal: Get the data organized in a way that lets us do our analysis.
Open up a blank R script and bring in the
here. Then, load in the breweries data.
If you are familiar with pivot tables in Excel, a lot of what we’re about to do is the “code” for slicing and dicing your data in a similar fashion. The end goal is often to get our data set up for some type of statistical analysis, creating a plot, etc.
Let’s check and see what R thinks this brewery data is. A
class in R is bascially the type of thing something is. You can read more here at Datacamp. This gets into object oriented programming, or OOP and is beyond the scope of what we’re doing here. However, you’ll need to at least be familiar with these ideas to get the full benefit of using something like R.
class(ncbreweries) #>  "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We’ve read in the data using the
read_csv function, which comes with the
readr package and is included as part of the
tidyverse universe. R thinks that this data file is a special kind of data frame, called a
OK, but what is a data frame? And tibble sounds funny. We need to talk about data.
You might be familiar with data when in something like an Excel file. You have rows and columns. What defines nice and neat, or tidy data?
The characteristics of tidy data are:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
The tidyverse is a series of packages that work well together and will help you get your data “tidy”.
OK, but what are tibbles? Think of a data frame (or,
data.frame to R) as an Excel file with the first row as the names of the variables and then each row underneath that as an observation.
tibbles are just special data frames. They are “spreadsheets,” a place to keep our data.
vignette("tibble") in the Console to see more, or go to the Help panel and search for “tibble.”
I should note that there is a “tidyverse vs. base R” debate. Base R doesn’t include the
tidyverse and sticks to functions, methods, and classes that come with the R you install without any packages.
Many R users think that you should learn the tidyverse first. I agree. Why?
tidyversedocumentation and community support are outstanding. Folks are friendly!
- Having a consistent philosophy and syntax makes it much easier to learn.
- For data cleaning, wrangling and plotting… the tidyverse is really a no-brainer.
One point of convenience is that there is often a direct correspondence between a tidyverse command and its base R equivalent.
These invariably follow a
base::period.case rule. For example,
If you call up the above examples, you’ll see that the tidyverse alternative typically offers some enhancements or other useful options (and sometimes restrictions) over its base counterpart. If you see an underscore,
_, in a function, then you’re in the
There are always many ways to achieve a single goal in R. So, when you start searching for solutions to your coding problems, just keep that in mind.
You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode, this would look like… a mess:
park(drive(start_car(find("keys")), to = "campus"))
Writing it out using pipes give it a more natural (and easier to read) structure:
The pipe operator is denoted
%>% and is automatically loaded with the
tidyverse, as part of the
magrittr package. Get it?
Pipes dramatically improve the experience of reading and writing code. Here’s another example:
The first line reads from left to right, exactly how I thought of the operations in my head. - Take this data set (mpg), do this (filter), then do this (group by), then summarize some variable (hwy).
The second line totally inverts this logical order, as the final operation comes first! Who wants to read things inside out?
The piped version of the code is even more readable if we write it over several lines. Here it is again and, this time, I’ll run it for good measure so you can see the output:
Remember: Using vertical space costs nothing and makes for much more readable/writeable code than cramming things horizontally.
To send results to a function argument other than first one or to use the previous result for multiple arguments, use
I just ran a regression using the
starswars data that comes built-in to the
tidyverse package. Turns out taller characters weigh more. Makes sense. This code is taking the
starwars tibble, keeping just the humans, and then running a simple linear regression that uses height to explain mass.
You can see the first few rows of any data set using the
head function. There’s also a similar
head(starwars) #> # A tibble: 6 × 14 #> name height mass hair_color skin_color eye_color #> <chr> <int> <dbl> <chr> <chr> <chr> #> 1 Luke Skywal… 172 77 blond fair blue #> 2 C-3PO 167 75 <NA> gold yellow #> 3 R2-D2 96 32 <NA> white, bl… red #> 4 Darth Vader 202 136 none white yellow #> 5 Leia Organa 150 49 brown light brown #> 6 Owen Lars 178 120 brown, grey light blue #> # … with 8 more variables: birth_year <dbl>, sex <chr>, #> # gender <chr>, homeworld <chr>, species <chr>, #> # films <list>, vehicles <list>, starships <list>
By the way, there are now pipes in Base R.
OK, let’s get back to our breweries data that we should now have in our Global Environment. You can view the names of variables in a data frame/tibble via the
names function. We can also use
glimpse, which will also tell us the number of rows, columns, and variable types.
We haven’t talked about variable types yet. For our purposes, there are characters (
There are many ways to peak at our data. You can double click on the data in the Global Environment to see the whole thing like a spreadsheet. This is the same as the
names(ncbreweries) #>  "name" "city" "type" "beercount" #>  "est" "status" "url" glimpse(ncbreweries) #> Rows: 251 #> Columns: 7 #> $ name <chr> "217 Brew Works", "3rd Rock Brewing Comp… #> $ city <chr> "Wilson", "Trenton", "Cherokee", "Andrew… #> $ type <chr> "Microbrewery", "Microbrewery", "Client … #> $ beercount <dbl> 10, 12, 1, 18, 8, 78, 15, 87, 106, 59, 1… #> $ est <dbl> 2017, 2016, 2018, 2014, 2017, 2013, 2017… #> $ status <chr> "Active", "Active", "Active", "Active", … #> $ url <chr> "https://www.ratebeer.com//brewers/217-b…
We are now going to take this data set and look at some of the
filter: pick rows matching criteria
slice: pick rows using index(es)
select: pick columns by name
pull: grab a column as a vector
arrange: reorder rows
mutate: add new variables
distinct: filter for unique rows
sample_frac: randomly sample rows
summarize: reduce variables to values
- … (many more)
Because all of these functions come from the
dplyr package, then they all follow similar rules.
- First argument is always a data frame, or tibble.
- Subsequent arguments say what to do with that data frame
- Always return a data frame
- This won’t change the original data frame unless you tell it to.
%>% operator in
dplyr functions is called the pipe operator. This means you “pipe” the output of the previous line of code as the first input of the next line of code.
I am not saving any of this to memory. So, when we filter, get rid of missings, etc., we will see the output below. But, we are not changing the original data, or saving the data to a new file, unless we do an extra sep
We would use the assignment operator
<-. For example,
new_data <- our code.
You can find more examples and syntax in Chapter 5 of R for Data Science.
Let’s find breweries in Durham County. To do that, we’ll use
ncbreweries %>% dplyr::filter(city == "Durham") #> # A tibble: 9 × 7 #> name city type beercount est status url #> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> #> 1 Barrel C… Durham Micro… 29 2017 Active https://ww… #> 2 Bull Cit… Durham Brewp… 64 2011 Active https://ww… #> 3 Bull Cit… Durham Comme… 9 2014 Active https://ww… #> 4 Bull Dur… Durham Micro… 7 2015 Active https://ww… #> 5 Durty Bu… Durham Micro… 19 2016 Active https://ww… #> 6 Fullstea… Durham Micro… 131 2009 Active https://ww… #> 7 Ponysaur… Durham Micro… 26 2013 Active https://ww… #> 8 G2B Gast… Durham Brewp… 18 2015 Closed https://ww… #> 9 Triangle… Durham Micro… 21 2007 Closed https://ww…
:: tells R where to look for the
filter function. It lives in
dplyr. You don’t actually need that, though things can go wrong without it, if there are two identically named functions from different packages! Because we loaded
tidyverse and, therefore,
dplyr, last, it will mask other functions of the same name. In other words, when we type
filter, R is going to assume that we want the one from
And, because we are using a
%>%, we start with the data frame. The data frame then gets passed on through. So, you don’t need to specifiy a data frame to
filter. It knows that we want to use
Here’s dplyr::filter for many conditions. Other conditions that you could use include
!= for not equal. Note that equal is
== and not
Let’s find breweries in Durham that were founded in 2014.
Looks like there was just one.
A very common
filter() use case is identifying (or removing) observations with missing data.
is.na returns a logical vector (i.e. a bunch of TRUE and FALSE values) that will be TRUE if a value of the column referenced is missing/not available.
To remove missing observations, simply use negation:
filter(!is.na(city)). This data set doesn’t have any missing city values.
You can also use
regex (or Regular Expressions).
regex is just a series of commands to deal with strings/text/characters. Here’s one basic example.
filter is looking for a True/False from the arguments inside.
grepl returns True if the name column contains the string “Bear” anywhere in it. So, you’ll get a data frame (tibble) that only has breweries with “Bear” in their name.
You can use
dplyr::select to keep just the variables that you want.
ncbreweries %>% filter(city == "Durham", beercount > 10) %>% select(name, type) #> # A tibble: 7 × 2 #> name type #> <chr> <chr> #> 1 Barrel Culture Brewing and Blending Microbrewery #> 2 Bull City Burger and Brewery Brewpub/Brewery #> 3 Durty Bull Brewing Company Microbrewery #> 4 Fullsteam Brewery Microbrewery #> 5 Ponysaurus Brewing Co. Microbrewery #> 6 G2B Gastropub & Brewery Brewpub/Brewery #> 7 Triangle Brewing Company Microbrewery
You can also exclude variables. The
- means “select everything but.”
ncbreweries %>% select(-type) #> # A tibble: 251 × 6 #> name city beercount est status url #> <chr> <chr> <dbl> <dbl> <chr> <chr> #> 1 217 Brew … Wilson 10 2017 Active https://www.ra… #> 2 3rd Rock … Trenton 12 2016 Active https://www.ra… #> 3 7 Clans B… Cherok… 1 2018 Active https://www.ra… #> 4 Andrews B… Andrews 18 2014 Active https://www.ra… #> 5 Angry Tro… Elkin 8 2017 Active https://www.ra… #> 6 Appalachi… Boone 78 2013 Active https://www.ra… #> 7 Archetype… Ashevi… 15 2017 Active https://www.ra… #> 8 Asheville… Ashevi… 87 2003 Active https://www.ra… #> 9 Ass Clown… Cornel… 106 2011 Active https://www.ra… #> 10 Aviator B… Fuquay… 59 2008 Active https://www.ra… #> # … with 241 more rows
Sometimes you have hundreds of variables and you don’t want to type them all out. You can use “first:last” for consecutive columns, or just list out the columns separated by commas.
ncbreweries %>% select(city:beercount) #> # A tibble: 251 × 3 #> city type beercount #> <chr> <chr> <dbl> #> 1 Wilson Microbrewery 10 #> 2 Trenton Microbrewery 12 #> 3 Cherokee Client Brewer 1 #> 4 Andrews Microbrewery 18 #> 5 Elkin Microbrewery 8 #> 6 Boone Microbrewery 78 #> 7 Asheville Microbrewery 15 #> 8 Asheville Brewpub 87 #> 9 Cornelius Microbrewery 106 #> 10 Fuquay Varina Microbrewery 59 #> # … with 241 more rows
You can use
dplyr::select to rename columns, or variables. The new name goes first.
select(contains(PATTERN)) option provides a nice shortcut in relevant cases and gives us another example of using text. Only one variable contains “type” in this case.
ncbreweries %>% select(name, contains("type")) #> # A tibble: 251 × 2 #> name type #> <chr> <chr> #> 1 217 Brew Works Microbrewery #> 2 3rd Rock Brewing Company Microbrewery #> 3 7 Clans Brewing Client Brewer #> 4 Andrews Brewing Company Microbrewery #> 5 Angry Troll Brewing Microbrewery #> 6 Appalachian Mountain Brewery Microbrewery #> 7 Archetype Brewing Microbrewery #> 8 Asheville Brewing Company Brewpub #> 9 Ass Clown Brewing Company Microbrewery #> 10 Aviator Brewing Company Microbrewery #> # … with 241 more rows
select(..., everything()) option is another useful shortcut if you only want to bring some variable(s) to the “front” of a data frame.
ncbreweries %>% select(name, beercount, everything()) #> # A tibble: 251 × 7 #> name beercount city type est status url #> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> #> 1 217 Bre… 10 Wilson Micro… 2017 Active https://ww… #> 2 3rd Roc… 12 Trent… Micro… 2016 Active https://ww… #> 3 7 Clans… 1 Chero… Clien… 2018 Active https://ww… #> 4 Andrews… 18 Andre… Micro… 2014 Active https://ww… #> 5 Angry T… 8 Elkin Micro… 2017 Active https://ww… #> 6 Appalac… 78 Boone Micro… 2013 Active https://ww… #> 7 Archety… 15 Ashev… Micro… 2017 Active https://ww… #> 8 Ashevil… 87 Ashev… Brewp… 2003 Active https://ww… #> 9 Ass Clo… 106 Corne… Micro… 2011 Active https://ww… #> 10 Aviator… 59 Fuqua… Micro… 2008 Active https://ww… #> # … with 241 more rows
dplyr::slice to keep certain row numbers, or observations.
Here are the first five observations of the data frame/tibble. This is a lot like the Base R
ncbreweries %>% slice(1:5) #> # A tibble: 5 × 7 #> name city type beercount est status url #> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> #> 1 217 Bre… Wilson Micro… 10 2017 Active https://www… #> 2 3rd Roc… Trent… Micro… 12 2016 Active https://www… #> 3 7 Clans… Chero… Clien… 1 2018 Active https://www… #> 4 Andrews… Andre… Micro… 18 2014 Active https://www… #> 5 Angry T… Elkin Micro… 8 2017 Active https://www…
You can also grab the last five observations of the tibble. Notice the use of
nrow, a Base R function. I am finding the number of rows in this data frame and then saving that to the variable last_row. Check out the Global Environment.
last_row <- nrow(ncbreweries) ncbreweries %>% slice((last_row - 4):last_row) #> # A tibble: 5 × 7 #> name city type beercount est status url #> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> #> 1 Sweet Tat… Rocky… Brewp… 9 2016 Closed https://w… #> 2 Triangle … Durham Micro… 21 2007 Closed https://w… #> 3 White Rab… Angier Micro… 19 2013 Closed https://w… #> 4 Williamsv… Farmv… Brewp… 5 2000 Closed https://w… #> 5 Wolf Beer… Wilmi… Clien… 2 2009 Closed https://w…
We can also choose a random sample from our data set. Notice that I am now saving, or assigning the resulting data to a new data frame/tibble. I then use the
dim function to get the dimensions (rows and columns) of the new data.
sample_n: randomly sample 5 observations
sample_frac: randomly sample 20% of observations
Sometimes you have duplicate values that you want to get rid of. For example, suppose we just wanted to see the distinct values for city and brewery type in our data? We can use
dplyr::distinct to filter for unique rows
We can then use
arrange to sort our columns. Character columns get sorted alphabetically. We can also arrange items in descending order using
ncbreweries %>% select(city, type) %>% distinct() %>% arrange(type, city) #> # A tibble: 138 × 2 #> city type #> <chr> <chr> #> 1 Andrews Brewpub #> 2 Arden Brewpub #> 3 Asheville Brewpub #> 4 Boone Brewpub #> 5 Chapel Hill Brewpub #> 6 Charlotte Brewpub #> 7 Farmville Brewpub #> 8 Fayetteville Brewpub #> 9 Fuqauy Varina Brewpub #> 10 Greenville Brewpub #> # … with 128 more rows
There are 138 city x type pairs in our data.
dplyr::summarize let’s us calculate things like means, medians, standard deviations, etc. This will create a new data frame. Here’s an example that looks across all of the breweries and calculates the mean number of beers, the standard deviation, and the number of breweries included.
You can do calculations by group using
ddplyr::group_by. This is very common and basically like creating pivot tables in Excel. The
group_by goes first, then the
ungroup() after performing the work in the groups to prevent mistakes later on. If you don’t include it, R will still look at this data frame as grouped.
Including “na.rm = T” is usually a good idea with summarize functions. Otherwise, any missing value will propogate to the summarized value too. You’ll get just a single
NA value. Let’s go back to the
starwars for just a second, as it has some missing data.
OK, that makes more sense!
We can also simply count by groups using
You can also count pairs, or other combinations.
ncbreweries %>% count(city, type) #> # A tibble: 138 × 3 #> city type n #> <chr> <chr> <int> #> 1 Aberdeen Microbrewery 1 #> 2 Andrews Brewpub 1 #> 3 Andrews Microbrewery 2 #> 4 Angier Microbrewery 2 #> 5 Apex Brewpub/Brewery 1 #> 6 Apex Microbrewery 2 #> 7 Arden Brewpub 1 #> 8 Asheboro Microbrewery 1 #> 9 Asheville Brewpub 5 #> 10 Asheville Brewpub/Brewery 5 #> # … with 128 more rows
You will usually be creating a new variable from existing data.
mutate() is order aware, so you can chain multiple mutates in a single call. In other words, you can create a new variable and then use it to create the next variable.
case_when to create a new variable called
est_category that is set to either “old” or “new.” We’ll then create counts by that new category.
When we’re creating new variables, things can get a bit complicated. There are many ways to do the same thing sometimes. Different methods create different types of variables. Here, the logical variable
new is TRUE if the brewery was founded in 2000 or later. The character variable
est_category is either “New” or “Old.” Both tell you the same thing.
To do this, we are using two mutate functions. One for each new variable.
ncbreweries %>% mutate(new = est >= 2000) %>% mutate(est_category = ifelse(est >= 2000, "new", "old")) #> # A tibble: 251 × 9 #> name city type beercount est status url new #> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <lgl> #> 1 217 Br… Wilson Micr… 10 2017 Active https:… TRUE #> 2 3rd Ro… Trent… Micr… 12 2016 Active https:… TRUE #> 3 7 Clan… Chero… Clie… 1 2018 Active https:… TRUE #> 4 Andrew… Andre… Micr… 18 2014 Active https:… TRUE #> 5 Angry … Elkin Micr… 8 2017 Active https:… TRUE #> 6 Appala… Boone Micr… 78 2013 Active https:… TRUE #> 7 Archet… Ashev… Micr… 15 2017 Active https:… TRUE #> 8 Ashevi… Ashev… Brew… 87 2003 Active https:… TRUE #> 9 Ass Cl… Corne… Micr… 106 2011 Active https:… TRUE #> 10 Aviato… Fuqua… Micr… 59 2008 Active https:… TRUE #> # … with 241 more rows, and 1 more variable: #> # est_category <chr> # Same effect, but can choose labels
Once you’ve filtered, mutated, arranged, etc., you likely want to save the resulting data frame. You can overwrite the existing data frame in your global environment. You can also change the name and keep a new copy. We’ve seen how to do that with the
<- assignment operator.
Let’s create those two new variables and overwrite our data in our global environment.
But, this data is not saved to your computer. It is in, however, your global environment. You can save this new data frame back to your data folder as a .csv file using
I am using the here:here() function that we saw earlier.
I changed the name of my output .csv so that I don’t write over my original data. Never overwrite your original data!
dplyr tools, you’re ready to import some data and start looking at it. Of course, to really look at it, we’ll want to plot it. Chapter 7 of R for Data Science goes into the basics of EDA. We’ll plot some data in Section 3.
EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others. [Source]
Two important questions to ask of your data:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
You can answer these questions a variety of ways, some simple (e.g. correlation), while others are more complex (e.g. linear regression). In our research, we are often looking for causal interpretations (i.e. X causes Y). The econometrics resources on our intro page deal with that question.
I want to bring up one additional package that I find useful. When you first import some data, it might be messy. The janitor package can help to clean up and check our data. For example, you can get some quick counts and percent of total observations and keep the 10 cities with most breweries in our data.
library(janitor) #> #> Attaching package: 'janitor' #> The following objects are masked from 'package:stats': #> #> chisq.test, fisher.test ncbreweries %>% tabyl(city) %>% arrange(desc(percent)) %>% slice(1:10) #> city n percent #> Asheville 27 0.10756972 #> Raleigh 24 0.09561753 #> Charlotte 23 0.09163347 #> Wilmington 14 0.05577689 #> Durham 9 0.03585657 #> Greensboro 7 0.02788845 #> Winston-Salem 5 0.01992032 #> Concord 4 0.01593625 #> Cornelius 4 0.01593625 #> Fayetteville 4 0.01593625
Seems reasonable to me!
remove_empty, which gets rid of empty columns or rows. Of course, if you have missing data, you want to know why. Nothing happens to our data, as there are no missing observations.
ncbreweries %>% remove_empty() #> value for "which" not specified, defaulting to c("rows", "cols") #> # A tibble: 251 × 9 #> name city type beercount est status url new #> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <lgl> #> 1 217 Br… Wilson Micr… 10 2017 Active https:… TRUE #> 2 3rd Ro… Trent… Micr… 12 2016 Active https:… TRUE #> 3 7 Clan… Chero… Clie… 1 2018 Active https:… TRUE #> 4 Andrew… Andre… Micr… 18 2014 Active https:… TRUE #> 5 Angry … Elkin Micr… 8 2017 Active https:… TRUE #> 6 Appala… Boone Micr… 78 2013 Active https:… TRUE #> 7 Archet… Ashev… Micr… 15 2017 Active https:… TRUE #> 8 Ashevi… Ashev… Brew… 87 2003 Active https:… TRUE #> 9 Ass Cl… Corne… Micr… 106 2011 Active https:… TRUE #> 10 Aviato… Fuqua… Micr… 59 2008 Active https:… TRUE #> # … with 241 more rows, and 1 more variable: #> # est_category <chr>
We can check that with one of my favorites,
get_dupes(). We get an empty tibble back, since there are no duplicate observations. I always check my data for weird duplicates.
ncbreweries %>% get_dupes() #> No variable names specified - using all columns. #> No duplicate combinations found of: name, city, type, beercount, est, status, url, new, est_category #> # A tibble: 0 × 10 #> # … with 10 variables: name <chr>, city <chr>, type <chr>, #> # beercount <dbl>, est <dbl>, status <chr>, url <chr>, #> # new <lgl>, est_category <chr>, dupe_count <int>
Check out janitor for many more helpful functions!
There is another popular way to handle data in R, that is outside the
tidyverse, that uses the
data.table package. We don’t have time to go over it, but it is a fast, very fast implementation of the Base R
data.frame. There’s even a package called
dtplyr that combines
If you’re handling large data sets, you’ll want to explore all of the choices that R has available for your. Data frames, data tables, and tibbles! All ways for R to store your data.