10 Assignment 1 - week 2 - data wrangling & import

Setup

This chunk of R code loads the packages that we will be using.

library(tidyverse) # functions using {tidyr}, {dplyr}, and {readr}
library(readxl)    # to read Excel files
library(janitor)   # for data cleaning

YOUR NAME HERE

Introduction

For this homework assignment, please write your answer for each question after the question text but before the line break before the next one.

In some cases, you will have to insert R code chunks, and run them to ensure that you’ve got the right result.

Use all of the R Markdown formatting you want! Bullets, bold text, etc. is welcome. And don’t forget to consider changes to the YAML.

Once you have finished your assignment, create an HTML document by “knitting” the document using either the “Preview” or “Knit” button in the top left of the script window frame.

Questions

1. Religion and income data (USA)

(2 marks)

The table relig_income is in {tidyr}. Load this table, and take a look at it. You can use View(relig_income) to see the table, and to understand the contents of the table you can use the Help function ?tidyr::relig_income (enter that code in the console).

head(tidyr::relig_income)

## # A tibble: 6 × 11
##   religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` `$100-150k` `>150k`
##   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>       <dbl>   <dbl>
## 1 Agnostic       27        34        60        81        76       137        122         109      84
## 2 Atheist        12        27        37        52        35        70         73          59      74
## 3 Buddhist       27        21        30        34        33        58         62          39      53
## 4 Catholic      418       617       732       670       638      1116        949         792     633
## 5 Don’t kn…      15        14        15        11        10        35         21          17      18
## 6 Evangeli…     575       869      1064       982       881      1486        949         723     414
## # ℹ 1 more variable: `Don't know/refused` <dbl>

What tidy principle(s) does this table violate?
Using the functions in {tidyr}, arrange this table in such a way that it is no longer untidy.

2. Importing data: csv

2.a Read the data

(6 marks)

NOTE:

Variable names with spaces and special symbols are a challenge to work with. They require that you put the name inside a pair of back-ticks. For example, in the dataset below, a variable called Court Region must be inside the back-ticks.

Remember that you have the option to use the clean_names() function in the {janitor} package to resolve this!
An overview of {janitor} functions is here: https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html

Comma delimited files are one of the most common formats used for flat files. The tidyverse has a package, {readr}, which will automatically load with the function library(tidyverse).

The reference page for {readr} is here: https://readr.tidyverse.org/index.html

Read in the file “cy-concluded-cases-5-yr-regional-report-dashboard-2014-to-2018.csv”, which has the number of concluded court cases in B.C. by a variety of dimensions.⁴

Assign the contents of the file to a named object (remember, name things in a descriptive way!)

Read the file again, but specify that the region code should be a character.

2.b Analysis

Now you’ve read the contents of the file, use your data manipulation skills to create a table that shows the total number of cases that were completed in each court region in 2018. Sort the table so the busiest regions are at the top.
Create a table showing the number of cases by court class completed in each region during 2018, with the region as the rows and the court class as the columns.

3. Importing data: Excel

3.a Read the data

(2 marks)

Use the {readxl} package to read in the file “bcmetal.xlsx”

This data is published by the Province of British Columbia, and shows the production of metals from mines in British Columbia from 1888 to 1988. ⁵

There are three sheets in the file. The sheet we are interested in is BCMETAL_PROD, metal production table for 1452 metal mines operating within the province between 1888 and 1988.

The reference page for {readxl} is here: https://readxl.tidyverse.org/

What are the names of the sheets in this file?
Read in the contents of the “BCMETAL_PROD” sheet.
Reading the file without any additional arguments will generate some warnings.

The first warning is

Warning: Expecting numeric in G8425 / R8425C7: got ’TOTALS:

The way to interpret this (and all the other warnings) is that R was busy reading the file, and got some unexpected values. Specifically in this case, what is cell G8425 in the Excel sheet (another way to express this location is row (R) 8425 and column (C) 7 is not of the type that was expected. R was busy reading the 7th column, and for 8424 rows found numbers–but in row 8425 it found a character, “TOTAL:”

The problem to solve: read only those rows that have valid data in them.

Change the options for your read function to address any issues you identified in 3.a.ii

3.b Analysis

(6 marks)

Use your data manipulation skills to answer the following, as separate calculations:

Based on the data here, in what year was the most gold mined in British Columbia?
What B.C. mine holds the record for the most silver extracted in B.C. through the entire life of the mine?
Does that mine hold the record for the total silver mined during a single year in B.C.?

-30-

This file by Martin Monkman is licensed under a Creative Commons Attribution 4.0 International License -->

9 R Markdown

11 Introduction to data visualization