2 R Refresher

This lesson is a brief refresher on R. You all have had at least one R class and thus most of this should be familiar.

One thing worth noting is that we’ll be using a mix of base R and the tidyverse set of packages. For those of you who haven’t used tidyverse, those sets of packages all for a more natural and easy to follow R syntax. One goal of this lesson is to introduce tidyverse to those who haven’t used it, and more base R who learned the tidyverse-centric approach. You can learn more about tidyverse through their wonderful help files here

2.1 Installing and loading packages

To install tidyverse, you can run the following code.

install.packages("tidyverse")

While you only have to install packages once, you need to load them into your environment each time. Thus, the very first lines of code should be loading the packages that you need. You load each package using the library() function.

library(tidyverse)

You can see that it is telling you what packages were loaded as tidyverse is actually a family of related packages. It also tells you if there are any conflicts. In this case there are a couple tidyverse function names that overlap with the same function names in other packages… here it’s just telling you how it’s dealing with this.

2.2 Loading data into your environment

The next step after getting your packages loaded is bringing your data into your local environment. There are several ways to do this. Often, I’ll be providing a URL for a google spreadsheet that will just load directly. For example:

beer <- read_csv("https://docs.google.com/spreadsheets/d/18Iux-10Ggj2qLNEgH5WJGGUNTKET9Tpy3HHl1gc6L9Y/gviz/tq?tqx=out:csv")
## Parsed with column specification:
## cols(
##   number_of_reviews = col_double(),
##   brewery_name = col_character(),
##   brewery_state = col_character(),
##   beer_name = col_character(),
##   rating = col_double(),
##   style = col_character(),
##   abv = col_double(),
##   year = col_double()
## )

This brings in data that I scraped from the website www.beeradvocate.com. Note that we use the function read_csv(), which is the tidyverse version of base R’s read.csv(). I like the tidverse version as it is more explicit in telling you what datatypes it imported each column as (in this case doubles - or numeric, and character strings). Also note that we assigned the data to the object beer.

There are other ways to bring data into your environment. One way is to use the file.choose() function inside read_csv(). This will allow a graphical user interface (GUI) to pop up so you can go and manually select your dataframe wherever you downloaded it on your computer. For example:

my_df <- read_csv(file.choose())

Alternatively, you can use a filepath specific to your local machine:

my_df <- read_csv('/Users/nick_dirienzo/Documents/R/intro/data/beer.csv')

There are pros and cons to each method.

  • Using the file.choose() to get a GUI is fast, but that means you need to load and reload your data every time you open R or if you make a mistake. I tend to use that if I’m exploring a dataset for someone else quickly and won’t need to do it again.

  • Having your data on the google sheets is great as you can access it from anywhere on any machine, but it takes a few more steps to setup and has file size limitations. Feel free to check out the tutorial I made on how to do this!

  • Using filepaths is the fastest method in terms of load time, and they also are not limited by file size (as much). The downside is that filepaths are specific to your computer, so it’s a bit more difficult if you use a few machines or want to work with other people.

2.3 Exploring our dataframe

OK! Now that we know how to bring in data, let’s take a bit to explore it. Data exploration is a critical step to understanding your data and making sure that any manipulations you’ve done worked how you expected them to. You should always be checking and rechecking your data!

To start, let’s compare two summary functions to explore our data frame: base R’s summary() and tidyverse’s glimpse()

summary(beer)
##  number_of_reviews brewery_name       brewery_state       beer_name        
##  Min.   :    1.0   Length:1350        Length:1350        Length:1350       
##  1st Qu.:    8.0   Class :character   Class :character   Class :character  
##  Median :   33.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  133.6                                                           
##  3rd Qu.:   89.0                                                           
##  Max.   :12618.0                                                           
##      rating        style                abv               year     
##  Min.   :1.82   Length:1350        Min.   :  3.900   Min.   :   0  
##  1st Qu.:3.74   Class :character   1st Qu.:  6.300   1st Qu.:2013  
##  Median :3.96   Mode  :character   Median :  7.800   Median :2014  
##  Mean   :3.90                      Mean   :  8.381   Mean   :2008  
##  3rd Qu.:4.14                      3rd Qu.:  9.500   3rd Qu.:2016  
##  Max.   :4.83                      Max.   :110.500   Max.   :2209
glimpse(beer)
## Rows: 1,350
## Columns: 8
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name      <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state     <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name         <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating            <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style             <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv               <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year              <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...

Of course, there are pros and cons to each method.

  • summary() is nice in that it gives you summary statistics (min, max, mean, etc) of your numeric data. This is really useful as you can immediately spot some issues (i.e. some strange values in the year column). Still, it doesn’t give you any info on your character columns. It also doesn’t give you any info on parameters of the data frame itself (number of rows or columns).

  • glimpse() is great in that it simply shows you values from every column so you get an immediate idea of the data contained within. You also get the number of rows (observations) and columns (variables) up top, as well as the datatype of each column. The downside is that you don’t have an overall summary of what’s contained in each column.

    • Some might say that this is just a fancy version of the head() function, and in a way it is. I’d argue it’s better, though, as head() only shows you as many columns as your console is wide. Go try head(beer) to see what I mean!

Neither of these methods do a perfect job showing you what type of character values are present in the data. summary() shows you nothing, while glimpse() shows only what fits in the console window. Calling the function unique() on a specific row is a great way to see what’s present. Try it out:

unique(beer$style)
## [1] "American Adjunct Lager"           "American Double / Imperial IPA"  
## [3] "American Double / Imperial Stout" "American IPA"                    
## [5] "American Stout"

So we can see that there are five unique styles out of the 1350 observations in our data frame. Go try it on the other columns. When is this method useful? When is it not?

2.4 Exploring data via plots

Although summary() gives summary statistics of your quantitative columns, it doesn’t provide any information on the distribution of values within those columns. For that you need to actually plot out the data with a histogram.

Histograms in base R are simple. You just use the function hist() and feed it the column you want to use. You can add additional arguments for bin width, but for simple data exploration you can normally skip that.

hist(beer$rating)

You can also use ggplot for histograms. ggplot makes overall much more attractive figures and is easier to use for complex figures than base R. But, for really quick exploration I think ggplot can be a bit slow. Still, just to get you introduced to syntax, below is the code and associated plot.

You can see that there are three parts to any ggplot call:

  1. The ggplot() function with the data frame that you’re calling. In this case beer

  2. Your aesthetic aes() where you tell it what columns from your data are your x and y.

  1. Histograms only have an x axis so that’s all we needed to specify here.
  1. Your geom_plottype(). There are lots of different plot types (i.e. geom_scatter, geom_line, etc.), but in this case we just need the obvious geom_histogram.
ggplot(beer,
       aes(x = rating)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

So, pretty the same plot as base R. ggplot gave us a warning saying we should select the binwidth rather than using the default, but this is fine for now. Go ahead and make plots for the other numeric columns. What errors do you find from visual inspection?

2.5 Slicing and dicing - data frame manipulation

Being able to manipulate your data frame is a critical skill for any R workflow. Here we’ll go over how to select out specific rows, columns, as well as how to make totally new columns.

Here’s a sample of rows from our data frame. You can imagine that we might want to pull out just the columns relating to rating and alcohol by volume (abv). Or we might want to make a new data frame of one specific style (say American IPA), or a certain range of alcohol values. This is something that you’ll need to do all the time. Thus, we’re going to cover how this is done using both base R and tidyverse syntax
number_of_reviews brewery_name brewery_state beer_name rating style abv year
77 Rhinelander Brewing Company Wisconsin Export Beer 2.80 American Adjunct Lager 5.0 2011
54 Two Beers Brewing Co.  Washington Forester Double IPA 3.58 American Double / Imperial IPA 7.8 2013
2 Rhinelander Brewing Company Wisconsin Braumeister 3.05 American Adjunct Lager 4.2 2017
13 Olde Hickory Brewery North Carolina Maple Syruption 4.12 American Double / Imperial Stout 9.0 2017
4 Dead Bird Brewing Company Wisconsin Moustache Wax 2.81 American Double / Imperial Stout 10.5 2017
42 Revolution Brewing Illinois Hops For Heroes: Homefront IPA 3.70 American IPA 6.2 2015
21 Hop Butcher For The World Illinois The Jewels 4.13 American Double / Imperial IPA 7.5 2018

2.5.1 Accessing rows

All data frames in R can be accessed using what’s called square bracket notation. In such notation the two positions inside square brackets correspond to the rows and columns separated by a comma. Like this [row number, column number]. Leaving one of the two entries blank will print off all the values where the blank exists. For example:

beer[5,]
## # A tibble: 1 x 8
##   number_of_revie~ brewery_name brewery_state beer_name rating style   abv  year
##              <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl> <dbl>
## 1              789 Miller Brew~ Wisconsin     Milwauke~   1.86 Amer~   4.8  2001

This prints off every column of the 5th row. Note that you have to put the data frame name before the square brackets so R knows what data you’re trying to access. Also, you must put the comma in there otherwise R will not know if you’re trying to access the row or column.

You can also access a range of row values by entering in a sequence of numbers. Remember you can make sequences by just typing two values separated by a colon i.e. 2:10. Let’s try it:

beer[2:10,]
## # A tibble: 9 x 8
##   number_of_revie~ brewery_name brewery_state beer_name rating style   abv  year
##              <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl> <dbl>
## 1             2521 Miller Brew~ Wisconsin     Miller G~   2.27 Amer~  4.66  2002
## 2             1267 Miller Brew~ Wisconsin     Icehouse    2.05 Amer~  5.5   2001
## 3              924 Hamm's Brew~ Wisconsin     Hamm's      2.77 Amer~  4.6   2001
## 4              789 Miller Brew~ Wisconsin     Milwauke~   1.86 Amer~  4.8   2001
## 5              770 Miller Brew~ Wisconsin     Red Dog     2.01 Amer~  5     2001
## 6              690 Miller Brew~ Wisconsin     Milwauke~   2.08 Amer~  5.9   2001
## 7              599 Jacob Leine~ Wisconsin     Leinenku~   2.94 Amer~  4.6   2002
## 8              518 JOS. Schlit~ Ariz0na       Schlitz ~   3.54 Amer~  4.7   2008
## 9              319 Stevens Poi~ Wisconsin     Point Sp~   3.08 Amer~  4.8   2002

You can also create a list using the combine function c(). entering c(5,10,20) will make a list of the values 5, 10, and 20. Enter that into our square bracket:

beer[c(5,10,20),]
## # A tibble: 3 x 8
##   number_of_revie~ brewery_name brewery_state beer_name rating style   abv  year
##              <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl> <dbl>
## 1              789 Miller Brew~ Wisconsin     Milwauke~   1.86 Amer~   4.8  2001
## 2              319 Stevens Poi~ Wisconsin     Point Sp~   3.08 Amer~   4.8  2002
## 3               31 Minhas Craf~ Wisconsin     Boxer Ice   2.36 Amer~   5.5  2011

This is great and all, but what you’ll need to do more commonly is get rows that match a specific value, or set of values. To do this we ask R to return rows where data within a column meet a specific condition. For example, what if we want only beers where the rating is greater than a 4.0? In R syntax that would be beer$rating >= 4.0. That code entered straight into the console will give you a bunch of TRUE and FALSE values as it’s going through and checking within beer which values in the column rating are greater than or equal to 4.0. So just enter that into your square bracket notation to return just those rows where this is TRUE

beer[beer$rating >= 4.0,]
## # A tibble: 613 x 8
##    number_of_revie~ brewery_name brewery_state beer_name rating style   abv
##               <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl>
##  1                3 Cervecería ~ Illinois      Gringo H~   4    Amer~   5  
##  2                3 Pabst Milwa~ Wisconsin     Schlitz     4.18 Amer~   4.7
##  3                1 Bare Bones ~ Wisconsin     Chiquita~   4    Amer~   5.3
##  4                1 Imperial Oa~ Illinois      Margarit~   4.49 Amer~   5.5
##  5                1 On Tour Bre~ Illinois      Cities      4.09 Amer~   5  
##  6             2534 Elysian Bre~ Washington    Space Du~   4.08 Amer~   8.2
##  7             2498 Pipeworks B~ Illinois      Ninja Vs~   4.28 Amer~   8  
##  8             1975 New Glarus ~ Wisconsin     Thumbpri~   4.33 Amer~   9  
##  9             1405 Wicked Weed~ North Caroli~ Freak Of~   4.3  Amer~   8.5
## 10             1396 Pipeworks B~ Illinois      Citra       4.4  Amer~   9.5
## # ... with 603 more rows, and 1 more variable: year <dbl>

You can do the same in tidyverse syntax using the function filter(). You can see that the syntax is a bit different. For one, you only call the data once. You then have what’s called a ‘pipe’ which is the %>%. The pipe is functionally equivalent to saying ‘then’. Finally, you have the function filter(). Thus, you’re saying ’take the data frame beer then filter is for all ratings greater than or equal to 4.0. Note that you don’t have to call the data frame when asking for the column rating… tidyverse does that for you!

beer %>%
  filter(rating >= 4.0)

In both base R and tidyverse you can ask for specific character strings in the same way. What if we just want beers from Illinois?

beer %>%
  filter(brewery_state == 'Illinois')
## # A tibble: 511 x 8
##    number_of_revie~ brewery_name brewery_state beer_name rating style   abv
##               <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl>
##  1                3 4204 Main S~ Illinois      Off Duty~   3.38 Amer~   4.2
##  2                3 Hailstorm B~ Illinois      Hotel Li~   3.42 Amer~   4.7
##  3                3 The Old Bak~ Illinois      Cerveza ~   3.45 Amer~   5  
##  4                3 Cervecería ~ Illinois      Gringo H~   4    Amer~   5  
##  5                2 Goose Islan~ Illinois      Natural ~   3.62 Amer~   4.7
##  6                2 Riggs Beer ~ Illinois      American~   3.89 Amer~   4.9
##  7                2 Blue Nose B~ Illinois      Pipa Cor~   3.58 Amer~   5.2
##  8                1 DryHop Brew~ Illinois      BBQ Igua~   3.5  Amer~   5.5
##  9                1 Imperial Oa~ Illinois      Margarit~   4.49 Amer~   5.5
## 10                1 Corridor Br~ Illinois      Shift Be~   3.79 Amer~   4.7
## # ... with 501 more rows, and 1 more variable: year <dbl>

You can add additional filters using additional operators. & and |, or ‘and’ and ‘or’ are the most commonly used. Let’s get just American IPA’s from Illinois

beer %>%
  filter(brewery_state == 'Illinois' & style == 'American IPA')
## # A tibble: 80 x 8
##    number_of_revie~ brewery_name brewery_state beer_name rating style   abv
##               <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl>
##  1             3154 Revolution ~ Illinois      Anti-Her~   4.1  Amer~   6.7
##  2             1285 Half Acre B~ Illinois      GoneAway~   4.17 Amer~   7  
##  3              896 Goose Islan~ Illinois      Endless ~   3.52 Amer~   5  
##  4              879 Two Brother~ Illinois      Heavy Ha~   3.82 Amer~   6.7
##  5              833 Half Acre B~ Illinois      Vallejo     4.13 Amer~   6.7
##  6              820 Revolution ~ Illinois      Galaxy-H~   4.12 Amer~   7  
##  7              747 Revolution ~ Illinois      Citra-He~   4.21 Amer~   7.5
##  8              689 Goose Islan~ Illinois      Rambler ~   3.59 Amer~   6.7
##  9              600 Revolution ~ Illinois      Crystal-~   4.13 Amer~   7.2
## 10              599 Finch Beer ~ Illinois      Threadle~   3.57 Amer~   6  
## # ... with 70 more rows, and 1 more variable: year <dbl>

Numerics can be done in the same manner. What if we want beer that is either really good (say a rating > 4.0) or is just really alcoholic (avb > 10)? That would be:

beer %>%
  filter(rating > 4.0 | abv > 10)
## # A tibble: 637 x 8
##    number_of_revie~ brewery_name brewery_state beer_name rating style   abv
##               <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl>
##  1                3 Green Man B~ North Caroli~ Green Ma~   3.79 Amer~ 110. 
##  2                3 Pabst Milwa~ Wisconsin     Schlitz     4.18 Amer~   4.7
##  3                1 Imperial Oa~ Illinois      Margarit~   4.49 Amer~   5.5
##  4                1 On Tour Bre~ Illinois      Cities      4.09 Amer~   5  
##  5             2534 Elysian Bre~ Washington    Space Du~   4.08 Amer~   8.2
##  6             2498 Pipeworks B~ Illinois      Ninja Vs~   4.28 Amer~   8  
##  7             1975 New Glarus ~ Wisconsin     Thumbpri~   4.33 Amer~   9  
##  8             1405 Wicked Weed~ North Caroli~ Freak Of~   4.3  Amer~   8.5
##  9             1396 Pipeworks B~ Illinois      Citra       4.4  Amer~   9.5
## 10             1067 Revolution ~ Illinois      Unsessio~   4.37 Amer~  10  
## # ... with 627 more rows, and 1 more variable: year <dbl>

You can also string tidyverse functions together using extra pipes. So let’s add another filter also select only American IPA’s or American Double / Imperial IPA’s

beer %>%
  filter(rating > 4.0 | abv > 10) %>%
  filter(style == 'American IPA' | style == 'American Double / Imperial IPA')
## # A tibble: 267 x 8
##    number_of_revie~ brewery_name brewery_state beer_name rating style   abv
##               <dbl> <chr>        <chr>         <chr>      <dbl> <chr> <dbl>
##  1             2534 Elysian Bre~ Washington    Space Du~   4.08 Amer~   8.2
##  2             2498 Pipeworks B~ Illinois      Ninja Vs~   4.28 Amer~   8  
##  3             1975 New Glarus ~ Wisconsin     Thumbpri~   4.33 Amer~   9  
##  4             1405 Wicked Weed~ North Caroli~ Freak Of~   4.3  Amer~   8.5
##  5             1396 Pipeworks B~ Illinois      Citra       4.4  Amer~   9.5
##  6             1067 Revolution ~ Illinois      Unsessio~   4.37 Amer~  10  
##  7             1012 Half Acre B~ Illinois      Double D~   4.27 Amer~   8  
##  8              962 Finch Beer ~ Illinois      Hardcore~   4.05 Amer~   9  
##  9              932 Central Wat~ Wisconsin     Illumina~   4.05 Amer~   9  
## 10              807 Pipeworks B~ Illinois      Emerald ~   4.49 Amer~   9.5
## # ... with 257 more rows, and 1 more variable: year <dbl>

Of course, we’re doing all this filtering, but not assigning the resulting data frame anywhere. Let’s take the above filtering operation and assign it to a new data frame called beer_high. Remember, we assign things to objects using the <- operator

beer_high <- beer %>%
  filter(rating > 4.0 | abv > 10) %>%
  filter(style == 'American IPA'| style == 'American Double / Imperial IPA')

Assigning things to objects will result in them not printing to the console, so it’s good to go and check that it actually worked. Let’s use glimpse() and unique() to make sure that beer_high actually contains what we want it to!

glimpse(beer_high)
## Rows: 267
## Columns: 8
## $ number_of_reviews <dbl> 2534, 2498, 1975, 1405, 1396, 1067, 1012, 962, 93...
## $ brewery_name      <chr> "Elysian Brewing Company", "Pipeworks Brewing Com...
## $ brewery_state     <chr> "Washington", "Illinois", "Wisconsin", "North Car...
## $ beer_name         <chr> "Space Dust IPA", "Ninja Vs. Unicorn", "Thumbprin...
## $ rating            <dbl> 4.08, 4.28, 4.33, 4.30, 4.40, 4.37, 4.27, 4.05, 4...
## $ style             <chr> "American Double / Imperial IPA", "American Doubl...
## $ abv               <dbl> 8.2, 8.0, 9.0, 8.5, 9.5, 10.0, 8.0, 9.0, 9.0, 9.5...
## $ year              <dbl> 2012, 2012, 2014, 2013, 2012, 2014, 2010, 2014, 2...

Looks good at a glance as there are way fewer rows as there should be

unique(beer_high$style)
## [1] "American Double / Imperial IPA" "American IPA"

And only our two selected styles are present

2.5.2 Accessing specific columns

You can use square bracket notation to get specific columns as well. You can do this by entering in the numeric value of the column. For example, rating is the 5th column, so we can enter the following to select it:

beer[,5]
## # A tibble: 1,350 x 1
##    rating
##     <dbl>
##  1   2.69
##  2   2.27
##  3   2.05
##  4   2.77
##  5   1.86
##  6   2.01
##  7   2.08
##  8   2.94
##  9   3.54
## 10   3.08
## # ... with 1,340 more rows

But it’s generally not a good idea to select columns by numeric position as that could change if you add/remove columns. It’s better to instead as for them by name.

beer[,'abv']
## # A tibble: 1,350 x 1
##      abv
##    <dbl>
##  1  4.6 
##  2  4.66
##  3  5.5 
##  4  4.6 
##  5  4.8 
##  6  5   
##  7  5.9 
##  8  4.6 
##  9  4.7 
## 10  4.8 
## # ... with 1,340 more rows

And if you want multiple columns just make a list!

beer[,c('abv', 'rating', 'year')]
## # A tibble: 1,350 x 3
##      abv rating  year
##    <dbl>  <dbl> <dbl>
##  1  4.6    2.69  2000
##  2  4.66   2.27  2002
##  3  5.5    2.05  2001
##  4  4.6    2.77  2001
##  5  4.8    1.86  2001
##  6  5      2.01  2001
##  7  5.9    2.08  2001
##  8  4.6    2.94  2002
##  9  4.7    3.54  2008
## 10  4.8    3.08  2002
## # ... with 1,340 more rows

tidyverse makes this a bit more clear using the select() function.

beer %>%
  select('abv', 'rating', 'year')
## # A tibble: 1,350 x 3
##      abv rating  year
##    <dbl>  <dbl> <dbl>
##  1  4.6    2.69  2000
##  2  4.66   2.27  2002
##  3  5.5    2.05  2001
##  4  4.6    2.77  2001
##  5  4.8    1.86  2001
##  6  5      2.01  2001
##  7  5.9    2.08  2001
##  8  4.6    2.94  2002
##  9  4.7    3.54  2008
## 10  4.8    3.08  2002
## # ... with 1,340 more rows

2.5.3 Creating new columns

Often one will need to manipulate a column and create a new column with the manipulated data. For example, one might want to change units from minutes to seconds by multiplying by 60, or take the average of several columns. We can do this in both base R and tidyverse.

Let’s consider our beer data again. The year column contains the year in which the beer was added to the database, and thus roughly corresponds to the year it came on the market. Let’s use this column and the number_of_reviews column to create a new column of how many reviews per year a beer has gotten. Given these data were collected in 2018, we 1) need to subtract the year value from 2018 to get years on market, and 2) divide number_of_reviews by this value to get a new variable of reviews per year.

Let’s first add a new column current_year to the data frame that contains just the value 2018. We can do this in base R by assigning a new column with the syntax existing_data_frame_name$new_column_name <- value_for_new_column

beer$current_year <- 2018

Check it’s there

glimpse(beer)
## Rows: 1,350
## Columns: 9
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name      <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state     <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name         <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating            <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style             <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv               <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year              <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year      <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...

Now let’s create a new column called years_on_market by subtracting year from current_year

beer$years_on_market <- beer$current_year - beer$year

Check!

glimpse(beer)
## Rows: 1,350
## Columns: 10
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name      <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state     <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name         <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating            <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style             <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv               <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year              <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year      <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...
## $ years_on_market   <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...

Of course, you don’t need to create the current_year column. Instead, you can just do that when you create the years_on_market column. Let’s do it again but all it years_on_market_2 to show it’s the same

beer$years_on_market_2 <- 2018 - beer$year

Check!

glimpse(beer)
## Rows: 1,350
## Columns: 11
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name      <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state     <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name         <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating            <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style             <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv               <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year              <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year      <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...
## $ years_on_market   <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
## $ years_on_market_2 <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...

tidyverse can do this using the mutate() function. The general syntax is the same as before. Within mutate() you first specify the new column name, then =, then what math or manipulation you want that new column to contain. Note that as with the other tidyverse functions, you only tell it the data frame name once at first, and the you call columns directly within functions (i.e. you don’t need to use the $ operator as you do with base R).

beer <- beer %>%
  mutate(years_on_market_3 = 2018 - year)

Check!

glimpse(beer)
## Rows: 1,350
## Columns: 12
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name      <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state     <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name         <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating            <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style             <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv               <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year              <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year      <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...
## $ years_on_market   <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
## $ years_on_market_2 <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
## $ years_on_market_3 <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...

Great! You now have several different ways to create new columns within R. All are useful and will be used at different points. Don’t worry if you don’t see the distinctions between them all. For now, you just need to be aware that there are different way and be able to apply at least one of them to new data.