2 R Refresher
This lesson is a brief refresher on R. You all have had at least one R class and thus most of this should be familiar.
One thing worth noting is that we’ll be using a mix of base R and the tidyverse set of packages. For those of you who haven’t used tidyverse, those sets of packages all for a more natural and easy to follow R syntax. One goal of this lesson is to introduce tidyverse to those who haven’t used it, and more base R who learned the tidyverse-centric approach. You can learn more about tidyverse through their wonderful help files here
2.1 Installing and loading packages
To install tidyverse, you can run the following code.
install.packages("tidyverse")
While you only have to install packages once, you need to load them into your environment each time. Thus, the very first lines of code should be loading the packages that you need. You load each package using the library()
function.
library(tidyverse)
You can see that it is telling you what packages were loaded as tidyverse is actually a family of related packages. It also tells you if there are any conflicts. In this case there are a couple tidyverse function names that overlap with the same function names in other packages… here it’s just telling you how it’s dealing with this.
2.2 Loading data into your environment
The next step after getting your packages loaded is bringing your data into your local environment. There are several ways to do this. Often, I’ll be providing a URL for a google spreadsheet that will just load directly. For example:
<- read_csv("https://docs.google.com/spreadsheets/d/18Iux-10Ggj2qLNEgH5WJGGUNTKET9Tpy3HHl1gc6L9Y/gviz/tq?tqx=out:csv") beer
## Parsed with column specification:
## cols(
## number_of_reviews = col_double(),
## brewery_name = col_character(),
## brewery_state = col_character(),
## beer_name = col_character(),
## rating = col_double(),
## style = col_character(),
## abv = col_double(),
## year = col_double()
## )
This brings in data that I scraped from the website www.beeradvocate.com. Note that we use the function read_csv()
, which is the tidyverse version of base R’s read.csv()
. I like the tidverse version as it is more explicit in telling you what datatypes it imported each column as (in this case doubles - or numeric, and character strings). Also note that we assigned the data to the object beer
.
There are other ways to bring data into your environment. One way is to use the file.choose()
function inside read_csv()
. This will allow a graphical user interface (GUI) to pop up so you can go and manually select your dataframe wherever you downloaded it on your computer. For example:
<- read_csv(file.choose()) my_df
Alternatively, you can use a filepath specific to your local machine:
<- read_csv('/Users/nick_dirienzo/Documents/R/intro/data/beer.csv') my_df
There are pros and cons to each method.
Using the
file.choose()
to get a GUI is fast, but that means you need to load and reload your data every time you open R or if you make a mistake. I tend to use that if I’m exploring a dataset for someone else quickly and won’t need to do it again.Having your data on the google sheets is great as you can access it from anywhere on any machine, but it takes a few more steps to setup and has file size limitations. Feel free to check out the tutorial I made on how to do this!
Using filepaths is the fastest method in terms of load time, and they also are not limited by file size (as much). The downside is that filepaths are specific to your computer, so it’s a bit more difficult if you use a few machines or want to work with other people.
2.3 Exploring our dataframe
OK! Now that we know how to bring in data, let’s take a bit to explore it. Data exploration is a critical step to understanding your data and making sure that any manipulations you’ve done worked how you expected them to. You should always be checking and rechecking your data!
To start, let’s compare two summary functions to explore our data frame: base R’s summary()
and tidyverse’s glimpse()
summary(beer)
## number_of_reviews brewery_name brewery_state beer_name
## Min. : 1.0 Length:1350 Length:1350 Length:1350
## 1st Qu.: 8.0 Class :character Class :character Class :character
## Median : 33.0 Mode :character Mode :character Mode :character
## Mean : 133.6
## 3rd Qu.: 89.0
## Max. :12618.0
## rating style abv year
## Min. :1.82 Length:1350 Min. : 3.900 Min. : 0
## 1st Qu.:3.74 Class :character 1st Qu.: 6.300 1st Qu.:2013
## Median :3.96 Mode :character Median : 7.800 Median :2014
## Mean :3.90 Mean : 8.381 Mean :2008
## 3rd Qu.:4.14 3rd Qu.: 9.500 3rd Qu.:2016
## Max. :4.83 Max. :110.500 Max. :2209
glimpse(beer)
## Rows: 1,350
## Columns: 8
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
Of course, there are pros and cons to each method.
summary()
is nice in that it gives you summary statistics (min, max, mean, etc) of your numeric data. This is really useful as you can immediately spot some issues (i.e. some strange values in the year column). Still, it doesn’t give you any info on your character columns. It also doesn’t give you any info on parameters of the data frame itself (number of rows or columns).glimpse()
is great in that it simply shows you values from every column so you get an immediate idea of the data contained within. You also get the number of rows (observations) and columns (variables) up top, as well as the datatype of each column. The downside is that you don’t have an overall summary of what’s contained in each column.- Some might say that this is just a fancy version of the
head()
function, and in a way it is. I’d argue it’s better, though, ashead()
only shows you as many columns as your console is wide. Go tryhead(beer)
to see what I mean!
- Some might say that this is just a fancy version of the
Neither of these methods do a perfect job showing you what type of character values are present in the data. summary()
shows you nothing, while glimpse()
shows only what fits in the console window. Calling the function unique()
on a specific row is a great way to see what’s present. Try it out:
unique(beer$style)
## [1] "American Adjunct Lager" "American Double / Imperial IPA"
## [3] "American Double / Imperial Stout" "American IPA"
## [5] "American Stout"
So we can see that there are five unique styles out of the 1350 observations in our data frame. Go try it on the other columns. When is this method useful? When is it not?
2.4 Exploring data via plots
Although summary()
gives summary statistics of your quantitative columns, it doesn’t provide any information on the distribution of values within those columns. For that you need to actually plot out the data with a histogram.
Histograms in base R are simple. You just use the function hist()
and feed it the column you want to use. You can add additional arguments for bin width, but for simple data exploration you can normally skip that.
hist(beer$rating)
You can also use ggplot for histograms. ggplot makes overall much more attractive figures and is easier to use for complex figures than base R. But, for really quick exploration I think ggplot can be a bit slow. Still, just to get you introduced to syntax, below is the code and associated plot.
You can see that there are three parts to any ggplot call:
The
ggplot()
function with the data frame that you’re calling. In this casebeer
Your aesthetic
aes()
where you tell it what columns from your data are your x and y.
- Histograms only have an x axis so that’s all we needed to specify here.
- Your
geom_plottype()
. There are lots of different plot types (i.e.geom_scatter
,geom_line
, etc.), but in this case we just need the obviousgeom_histogram
.
ggplot(beer,
aes(x = rating)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
So, pretty the same plot as base R. ggplot gave us a warning saying we should select the binwidth rather than using the default, but this is fine for now. Go ahead and make plots for the other numeric columns. What errors do you find from visual inspection?
2.5 Slicing and dicing - data frame manipulation
Being able to manipulate your data frame is a critical skill for any R workflow. Here we’ll go over how to select out specific rows, columns, as well as how to make totally new columns.
Here’s a sample of rows from our data frame. You can imagine that we might want to pull out just the columns relating to rating and alcohol by volume (abv). Or we might want to make a new data frame of one specific style (say American IPA), or a certain range of alcohol values. This is something that you’ll need to do all the time. Thus, we’re going to cover how this is done using both base R and tidyverse syntaxnumber_of_reviews | brewery_name | brewery_state | beer_name | rating | style | abv | year |
---|---|---|---|---|---|---|---|
77 | Rhinelander Brewing Company | Wisconsin | Export Beer | 2.80 | American Adjunct Lager | 5.0 | 2011 |
54 | Two Beers Brewing Co. | Washington | Forester Double IPA | 3.58 | American Double / Imperial IPA | 7.8 | 2013 |
2 | Rhinelander Brewing Company | Wisconsin | Braumeister | 3.05 | American Adjunct Lager | 4.2 | 2017 |
13 | Olde Hickory Brewery | North Carolina | Maple Syruption | 4.12 | American Double / Imperial Stout | 9.0 | 2017 |
4 | Dead Bird Brewing Company | Wisconsin | Moustache Wax | 2.81 | American Double / Imperial Stout | 10.5 | 2017 |
42 | Revolution Brewing | Illinois | Hops For Heroes: Homefront IPA | 3.70 | American IPA | 6.2 | 2015 |
21 | Hop Butcher For The World | Illinois | The Jewels | 4.13 | American Double / Imperial IPA | 7.5 | 2018 |
2.5.1 Accessing rows
All data frames in R can be accessed using what’s called square bracket notation. In such notation the two positions inside square brackets correspond to the rows and columns separated by a comma. Like this [row number, column number]
. Leaving one of the two entries blank will print off all the values where the blank exists. For example:
5,] beer[
## # A tibble: 1 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv year
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 789 Miller Brew~ Wisconsin Milwauke~ 1.86 Amer~ 4.8 2001
This prints off every column of the 5th row. Note that you have to put the data frame name before the square brackets so R knows what data you’re trying to access. Also, you must put the comma in there otherwise R will not know if you’re trying to access the row or column.
You can also access a range of row values by entering in a sequence of numbers. Remember you can make sequences by just typing two values separated by a colon i.e. 2:10
. Let’s try it:
2:10,] beer[
## # A tibble: 9 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv year
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 2521 Miller Brew~ Wisconsin Miller G~ 2.27 Amer~ 4.66 2002
## 2 1267 Miller Brew~ Wisconsin Icehouse 2.05 Amer~ 5.5 2001
## 3 924 Hamm's Brew~ Wisconsin Hamm's 2.77 Amer~ 4.6 2001
## 4 789 Miller Brew~ Wisconsin Milwauke~ 1.86 Amer~ 4.8 2001
## 5 770 Miller Brew~ Wisconsin Red Dog 2.01 Amer~ 5 2001
## 6 690 Miller Brew~ Wisconsin Milwauke~ 2.08 Amer~ 5.9 2001
## 7 599 Jacob Leine~ Wisconsin Leinenku~ 2.94 Amer~ 4.6 2002
## 8 518 JOS. Schlit~ Ariz0na Schlitz ~ 3.54 Amer~ 4.7 2008
## 9 319 Stevens Poi~ Wisconsin Point Sp~ 3.08 Amer~ 4.8 2002
You can also create a list using the combine function c(). entering c(5,10,20)
will make a list of the values 5, 10, and 20. Enter that into our square bracket:
c(5,10,20),] beer[
## # A tibble: 3 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv year
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 789 Miller Brew~ Wisconsin Milwauke~ 1.86 Amer~ 4.8 2001
## 2 319 Stevens Poi~ Wisconsin Point Sp~ 3.08 Amer~ 4.8 2002
## 3 31 Minhas Craf~ Wisconsin Boxer Ice 2.36 Amer~ 5.5 2011
This is great and all, but what you’ll need to do more commonly is get rows that match a specific value, or set of values. To do this we ask R to return rows where data within a column meet a specific condition. For example, what if we want only beers where the rating is greater than a 4.0? In R syntax that would be beer$rating >= 4.0
. That code entered straight into the console will give you a bunch of TRUE and FALSE values as it’s going through and checking within beer which values in the column rating are greater than or equal to 4.0. So just enter that into your square bracket notation to return just those rows where this is TRUE
$rating >= 4.0,] beer[beer
## # A tibble: 613 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 3 Cervecería ~ Illinois Gringo H~ 4 Amer~ 5
## 2 3 Pabst Milwa~ Wisconsin Schlitz 4.18 Amer~ 4.7
## 3 1 Bare Bones ~ Wisconsin Chiquita~ 4 Amer~ 5.3
## 4 1 Imperial Oa~ Illinois Margarit~ 4.49 Amer~ 5.5
## 5 1 On Tour Bre~ Illinois Cities 4.09 Amer~ 5
## 6 2534 Elysian Bre~ Washington Space Du~ 4.08 Amer~ 8.2
## 7 2498 Pipeworks B~ Illinois Ninja Vs~ 4.28 Amer~ 8
## 8 1975 New Glarus ~ Wisconsin Thumbpri~ 4.33 Amer~ 9
## 9 1405 Wicked Weed~ North Caroli~ Freak Of~ 4.3 Amer~ 8.5
## 10 1396 Pipeworks B~ Illinois Citra 4.4 Amer~ 9.5
## # ... with 603 more rows, and 1 more variable: year <dbl>
You can do the same in tidyverse syntax using the function filter()
. You can see that the syntax is a bit different. For one, you only call the data once. You then have what’s called a ‘pipe’ which is the %>%
. The pipe is functionally equivalent to saying ‘then’. Finally, you have the function filter()
. Thus, you’re saying ’take the data frame beer then filter is for all ratings greater than or equal to 4.0. Note that you don’t have to call the data frame when asking for the column rating… tidyverse does that for you!
%>%
beer filter(rating >= 4.0)
In both base R and tidyverse you can ask for specific character strings in the same way. What if we just want beers from Illinois?
%>%
beer filter(brewery_state == 'Illinois')
## # A tibble: 511 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 3 4204 Main S~ Illinois Off Duty~ 3.38 Amer~ 4.2
## 2 3 Hailstorm B~ Illinois Hotel Li~ 3.42 Amer~ 4.7
## 3 3 The Old Bak~ Illinois Cerveza ~ 3.45 Amer~ 5
## 4 3 Cervecería ~ Illinois Gringo H~ 4 Amer~ 5
## 5 2 Goose Islan~ Illinois Natural ~ 3.62 Amer~ 4.7
## 6 2 Riggs Beer ~ Illinois American~ 3.89 Amer~ 4.9
## 7 2 Blue Nose B~ Illinois Pipa Cor~ 3.58 Amer~ 5.2
## 8 1 DryHop Brew~ Illinois BBQ Igua~ 3.5 Amer~ 5.5
## 9 1 Imperial Oa~ Illinois Margarit~ 4.49 Amer~ 5.5
## 10 1 Corridor Br~ Illinois Shift Be~ 3.79 Amer~ 4.7
## # ... with 501 more rows, and 1 more variable: year <dbl>
You can add additional filters using additional operators. &
and |
, or ‘and’ and ‘or’ are the most commonly used. Let’s get just American IPA’s from Illinois
%>%
beer filter(brewery_state == 'Illinois' & style == 'American IPA')
## # A tibble: 80 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 3154 Revolution ~ Illinois Anti-Her~ 4.1 Amer~ 6.7
## 2 1285 Half Acre B~ Illinois GoneAway~ 4.17 Amer~ 7
## 3 896 Goose Islan~ Illinois Endless ~ 3.52 Amer~ 5
## 4 879 Two Brother~ Illinois Heavy Ha~ 3.82 Amer~ 6.7
## 5 833 Half Acre B~ Illinois Vallejo 4.13 Amer~ 6.7
## 6 820 Revolution ~ Illinois Galaxy-H~ 4.12 Amer~ 7
## 7 747 Revolution ~ Illinois Citra-He~ 4.21 Amer~ 7.5
## 8 689 Goose Islan~ Illinois Rambler ~ 3.59 Amer~ 6.7
## 9 600 Revolution ~ Illinois Crystal-~ 4.13 Amer~ 7.2
## 10 599 Finch Beer ~ Illinois Threadle~ 3.57 Amer~ 6
## # ... with 70 more rows, and 1 more variable: year <dbl>
Numerics can be done in the same manner. What if we want beer that is either really good (say a rating > 4.0) or is just really alcoholic (avb > 10)? That would be:
%>%
beer filter(rating > 4.0 | abv > 10)
## # A tibble: 637 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 3 Green Man B~ North Caroli~ Green Ma~ 3.79 Amer~ 110.
## 2 3 Pabst Milwa~ Wisconsin Schlitz 4.18 Amer~ 4.7
## 3 1 Imperial Oa~ Illinois Margarit~ 4.49 Amer~ 5.5
## 4 1 On Tour Bre~ Illinois Cities 4.09 Amer~ 5
## 5 2534 Elysian Bre~ Washington Space Du~ 4.08 Amer~ 8.2
## 6 2498 Pipeworks B~ Illinois Ninja Vs~ 4.28 Amer~ 8
## 7 1975 New Glarus ~ Wisconsin Thumbpri~ 4.33 Amer~ 9
## 8 1405 Wicked Weed~ North Caroli~ Freak Of~ 4.3 Amer~ 8.5
## 9 1396 Pipeworks B~ Illinois Citra 4.4 Amer~ 9.5
## 10 1067 Revolution ~ Illinois Unsessio~ 4.37 Amer~ 10
## # ... with 627 more rows, and 1 more variable: year <dbl>
You can also string tidyverse functions together using extra pipes. So let’s add another filter also select only American IPA’s or American Double / Imperial IPA’s
%>%
beer filter(rating > 4.0 | abv > 10) %>%
filter(style == 'American IPA' | style == 'American Double / Imperial IPA')
## # A tibble: 267 x 8
## number_of_revie~ brewery_name brewery_state beer_name rating style abv
## <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 2534 Elysian Bre~ Washington Space Du~ 4.08 Amer~ 8.2
## 2 2498 Pipeworks B~ Illinois Ninja Vs~ 4.28 Amer~ 8
## 3 1975 New Glarus ~ Wisconsin Thumbpri~ 4.33 Amer~ 9
## 4 1405 Wicked Weed~ North Caroli~ Freak Of~ 4.3 Amer~ 8.5
## 5 1396 Pipeworks B~ Illinois Citra 4.4 Amer~ 9.5
## 6 1067 Revolution ~ Illinois Unsessio~ 4.37 Amer~ 10
## 7 1012 Half Acre B~ Illinois Double D~ 4.27 Amer~ 8
## 8 962 Finch Beer ~ Illinois Hardcore~ 4.05 Amer~ 9
## 9 932 Central Wat~ Wisconsin Illumina~ 4.05 Amer~ 9
## 10 807 Pipeworks B~ Illinois Emerald ~ 4.49 Amer~ 9.5
## # ... with 257 more rows, and 1 more variable: year <dbl>
Of course, we’re doing all this filtering, but not assigning the resulting data frame anywhere. Let’s take the above filtering operation and assign it to a new data frame called beer_high. Remember, we assign things to objects using the <- operator
<- beer %>%
beer_high filter(rating > 4.0 | abv > 10) %>%
filter(style == 'American IPA'| style == 'American Double / Imperial IPA')
Assigning things to objects will result in them not printing to the console, so it’s good to go and check that it actually worked. Let’s use glimpse()
and unique()
to make sure that beer_high actually contains what we want it to!
glimpse(beer_high)
## Rows: 267
## Columns: 8
## $ number_of_reviews <dbl> 2534, 2498, 1975, 1405, 1396, 1067, 1012, 962, 93...
## $ brewery_name <chr> "Elysian Brewing Company", "Pipeworks Brewing Com...
## $ brewery_state <chr> "Washington", "Illinois", "Wisconsin", "North Car...
## $ beer_name <chr> "Space Dust IPA", "Ninja Vs. Unicorn", "Thumbprin...
## $ rating <dbl> 4.08, 4.28, 4.33, 4.30, 4.40, 4.37, 4.27, 4.05, 4...
## $ style <chr> "American Double / Imperial IPA", "American Doubl...
## $ abv <dbl> 8.2, 8.0, 9.0, 8.5, 9.5, 10.0, 8.0, 9.0, 9.0, 9.5...
## $ year <dbl> 2012, 2012, 2014, 2013, 2012, 2014, 2010, 2014, 2...
Looks good at a glance as there are way fewer rows as there should be
unique(beer_high$style)
## [1] "American Double / Imperial IPA" "American IPA"
And only our two selected styles are present
2.5.2 Accessing specific columns
You can use square bracket notation to get specific columns as well. You can do this by entering in the numeric value of the column. For example, rating is the 5th column, so we can enter the following to select it:
5] beer[,
## # A tibble: 1,350 x 1
## rating
## <dbl>
## 1 2.69
## 2 2.27
## 3 2.05
## 4 2.77
## 5 1.86
## 6 2.01
## 7 2.08
## 8 2.94
## 9 3.54
## 10 3.08
## # ... with 1,340 more rows
But it’s generally not a good idea to select columns by numeric position as that could change if you add/remove columns. It’s better to instead as for them by name.
'abv'] beer[,
## # A tibble: 1,350 x 1
## abv
## <dbl>
## 1 4.6
## 2 4.66
## 3 5.5
## 4 4.6
## 5 4.8
## 6 5
## 7 5.9
## 8 4.6
## 9 4.7
## 10 4.8
## # ... with 1,340 more rows
And if you want multiple columns just make a list!
c('abv', 'rating', 'year')] beer[,
## # A tibble: 1,350 x 3
## abv rating year
## <dbl> <dbl> <dbl>
## 1 4.6 2.69 2000
## 2 4.66 2.27 2002
## 3 5.5 2.05 2001
## 4 4.6 2.77 2001
## 5 4.8 1.86 2001
## 6 5 2.01 2001
## 7 5.9 2.08 2001
## 8 4.6 2.94 2002
## 9 4.7 3.54 2008
## 10 4.8 3.08 2002
## # ... with 1,340 more rows
tidyverse makes this a bit more clear using the select()
function.
%>%
beer select('abv', 'rating', 'year')
## # A tibble: 1,350 x 3
## abv rating year
## <dbl> <dbl> <dbl>
## 1 4.6 2.69 2000
## 2 4.66 2.27 2002
## 3 5.5 2.05 2001
## 4 4.6 2.77 2001
## 5 4.8 1.86 2001
## 6 5 2.01 2001
## 7 5.9 2.08 2001
## 8 4.6 2.94 2002
## 9 4.7 3.54 2008
## 10 4.8 3.08 2002
## # ... with 1,340 more rows
2.5.3 Creating new columns
Often one will need to manipulate a column and create a new column with the manipulated data. For example, one might want to change units from minutes to seconds by multiplying by 60, or take the average of several columns. We can do this in both base R and tidyverse.
Let’s consider our beer data again. The year
column contains the year in which the beer was added to the database, and thus roughly corresponds to the year it came on the market. Let’s use this column and the number_of_reviews
column to create a new column of how many reviews per year a beer has gotten. Given these data were collected in 2018, we 1) need to subtract the year value from 2018 to get years on market, and 2) divide number_of_reviews
by this value to get a new variable of reviews per year.
Let’s first add a new column current_year
to the data frame that contains just the value 2018. We can do this in base R by assigning a new column with the syntax existing_data_frame_name$new_column_name <- value_for_new_column
$current_year <- 2018 beer
Check it’s there
glimpse(beer)
## Rows: 1,350
## Columns: 9
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...
Now let’s create a new column called years_on_market
by subtracting year
from current_year
$years_on_market <- beer$current_year - beer$year beer
Check!
glimpse(beer)
## Rows: 1,350
## Columns: 10
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...
## $ years_on_market <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
Of course, you don’t need to create the current_year column. Instead, you can just do that when you create the years_on_market
column. Let’s do it again but all it years_on_market_2
to show it’s the same
$years_on_market_2 <- 2018 - beer$year beer
Check!
glimpse(beer)
## Rows: 1,350
## Columns: 11
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...
## $ years_on_market <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
## $ years_on_market_2 <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
tidyverse can do this using the mutate()
function. The general syntax is the same as before. Within mutate()
you first specify the new column name, then =, then what math or manipulation you want that new column to contain. Note that as with the other tidyverse functions, you only tell it the data frame name once at first, and the you call columns directly within functions (i.e. you don’t need to use the $ operator as you do with base R).
<- beer %>%
beer mutate(years_on_market_3 = 2018 - year)
Check!
glimpse(beer)
## Rows: 1,350
## Columns: 12
## $ number_of_reviews <dbl> 3851, 2521, 1267, 924, 789, 770, 690, 599, 518, 3...
## $ brewery_name <chr> "Miller Brewing Co.", "Miller Brewing Co.", "Mill...
## $ brewery_state <chr> "Wisconsin", "Wisconsin", "Wisconsin", "Wisconsin...
## $ beer_name <chr> "Miller High Life", "Miller Genuine Draft", "Iceh...
## $ rating <dbl> 2.69, 2.27, 2.05, 2.77, 1.86, 2.01, 2.08, 2.94, 3...
## $ style <chr> "American Adjunct Lager", "American Adjunct Lager...
## $ abv <dbl> 4.60, 4.66, 5.50, 4.60, 4.80, 5.00, 5.90, 4.60, 4...
## $ year <dbl> 2000, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2...
## $ current_year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2...
## $ years_on_market <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
## $ years_on_market_2 <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
## $ years_on_market_3 <dbl> 18, 16, 17, 17, 17, 17, 17, 16, 10, 16, 17, 8, 16...
Great! You now have several different ways to create new columns within R. All are useful and will be used at different points. Don’t worry if you don’t see the distinctions between them all. For now, you just need to be aware that there are different way and be able to apply at least one of them to new data.