5.4 Working with columns

5.4.1 Select()

To select columns of a dataframe, use select(). The first argument to this function is the dataframe (measles_us), and the subsequent arguments are the columns to keep, separated by commas. Alternatively, if you are selecting columns adjacent to each other, you can use a : to select a range of columns, read as "select columns from ___ to ___."

We also want to save the results of this as a new data frame so that we can work with it without overriding our original dataframe.

#select the columns we want to work with and save to a new object.

measles_us_modified <-
  select(
    measles_us,
    Admin1Name,
    PeriodStartDate,
    PeriodEndDate,
    PartOfCumulativeCountSeries,
    CountValue
  )

#inspect our new data frame
glimpse(measles_us_modified)
## Rows: 422,051
## Columns: 5
## $ Admin1Name                  <chr> "WISCONSIN", "WISCONSIN", "WISCONSIN", "WI…
## $ PeriodStartDate             <chr> "11/20/1927", "11/27/1927", "12/4/1927", "…
## $ PeriodEndDate               <chr> "11/26/1927", "12/3/1927", "12/10/1927", "…
## $ PartOfCumulativeCountSeries <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ CountValue                  <dbl> 85, 120, 84, 106, 39, 45, 28, 140, 48, 85,…

5.4.2 Working with dates

Notice that the data type of each column is listed under the column name in the read out above. As we saw above, sometimes read_csv() will parse the data types of columns incorrectly. We learned how to use the col_types argument to fix this on import. In our data, we see that PeriodStartDate and PeriodEndDate were read in as character data instead of date data. Dates can often be a challenge when working with data. R has a few ways of dealing with this, but in this lesson we will look at the lubridate package, which is part of the tidyverse

Below we use $ notation, a handy base R way of extracting a particular column df$column. Then we use the function mdy(). This function expects an input string that conforms to month,day,year format, and it agnostic of separating punctuation (i.e., it could handle mm-dd-yyyy, or mm/dd/yyyy, etc)

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
#overwrite the old character column with new data column.
measles_us_modified$PeriodStartDate <-
  mdy(measles_us_modified$PeriodStartDate)

#Do the same for the PeriodEndDate column
measles_us_modified$PeriodEndDate <-
  mdy(measles_us_modified$PeriodEndDate)

#inspect
head(measles_us_modified)
## # A tibble: 6 x 5
##   Admin1Name PeriodStartDate PeriodEndDate PartOfCumulativeCountSeri… CountValue
##   <chr>      <date>          <date>                             <dbl>      <dbl>
## 1 WISCONSIN  1927-11-20      1927-11-26                             0         85
## 2 WISCONSIN  1927-11-27      1927-12-03                             0        120
## 3 WISCONSIN  1927-12-04      1927-12-10                             0         84
## 4 WISCONSIN  1927-12-18      1927-12-24                             0        106
## 5 WISCONSIN  1927-12-25      1927-12-31                             0         39
## 6 WISCONSIN  1928-01-01      1928-01-07                             0         45

Now these columns are recognized as a date object in the standard format YYYY-mm-dd.

5.4.3 Renaming columns

Sometimes when you receive data, you may find that the column names are not very descriptive or useful, and it may be necessary to rename them. You can assign new names to columns when you select them select(newColumnName = OldColumnName) or you can use the rename() function.

measles_us_modified <-
  rename(measles_us_modified, State = Admin1Name)