10 Assignment 1 - week 2 - data wrangling & import
Introduction
For this homework assignment, please write your answer for each question after the question text but before the line break before the next one.
In some cases, you will have to insert R code chunks, and run them to ensure that you’ve got the right result.
Use all of the R Markdown formatting you want! Bullets, bold text, etc. is welcome. And don’t forget to consider changes to the YAML.
Once you have finished your assignment, create an HTML document by “knitting” the document using either the “Preview” or “Knit” button in the top left of the script window frame.
Questions
1. Religion and income data (USA)
(2 marks)
The table relig_income
is in {tidyr}. Load this table, and take a look at it. You can use View(relig_income)
to see the table, and to understand the contents of the table you can use the Help function ?tidyr::relig_income
(enter that code in the console).
head(tidyr::relig_income)
## # A tibble: 6 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k` `$100-150k` `>150k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122 109 84
## 2 Atheist 12 27 37 52 35 70 73 59 74
## 3 Buddhist 27 21 30 34 33 58 62 39 53
## 4 Catholic 418 617 732 670 638 1116 949 792 633
## 5 Don’t kn… 15 14 15 11 10 35 21 17 18
## 6 Evangeli… 575 869 1064 982 881 1486 949 723 414
## # ℹ 1 more variable: `Don't know/refused` <dbl>
What tidy principle(s) does this table violate?
Using the functions in {tidyr}, arrange this table in such a way that it is no longer untidy.
2. Importing data: csv
2.a Read the data
(6 marks)
NOTE:
Variable names with spaces and special symbols are a challenge to work with. They require
that you put the name inside a pair of back-ticks. For example, in the dataset below, a variable called Court Region
must be inside the back-ticks.
Remember that you have the option to use the
clean_names()
function in the {janitor} package to resolve this!An overview of {janitor} functions is here: https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html
Comma delimited files are one of the most common formats used for flat files. The tidyverse has a package, {readr}, which will automatically load with the function library(tidyverse)
.
- The reference page for {readr} is here: https://readr.tidyverse.org/index.html
- Read in the file “cy-concluded-cases-5-yr-regional-report-dashboard-2014-to-2018.csv”, which has the number of concluded court cases in B.C. by a variety of dimensions.4
Assign the contents of the file to a named object (remember, name things in a descriptive way!)
- Read the file again, but specify that the region code should be a character.
2.b Analysis
Now you’ve read the contents of the file, use your data manipulation skills to create a table that shows the total number of cases that were completed in each court region in 2018. Sort the table so the busiest regions are at the top.
Create a table showing the number of cases by court class completed in each region during 2018, with the region as the rows and the court class as the columns.
3. Importing data: Excel
3.a Read the data
(2 marks)
Use the {readxl} package to read in the file “bcmetal.xlsx”
This data is published by the Province of British Columbia, and shows the production of metals from mines in British Columbia from 1888 to 1988. 5
There are three sheets in the file. The sheet we are interested in is BCMETAL_PROD, metal production table for 1452 metal mines operating within the province between 1888 and 1988.
- The reference page for {readxl} is here: https://readxl.tidyverse.org/
What are the names of the sheets in this file?
Read in the contents of the “BCMETAL_PROD” sheet.
Reading the file without any additional arguments will generate some warnings.
- The first warning is
Warning: Expecting numeric in G8425 / R8425C7: got ’TOTALS:
The way to interpret this (and all the other warnings) is that R was busy reading the file, and got some unexpected values. Specifically in this case, what is cell G8425 in the Excel sheet (another way to express this location is row (R) 8425 and column (C) 7 is not of the type that was expected. R was busy reading the 7th column, and for 8424 rows found numbers–but in row 8425 it found a character, “TOTAL:”
The problem to solve: read only those rows that have valid data in them.
Change the options for your read function to address any issues you identified in 3.a.ii
3.b Analysis
(6 marks)
Use your data manipulation skills to answer the following, as separate calculations:
Based on the data here, in what year was the most gold mined in British Columbia?
What B.C. mine holds the record for the most silver extracted in B.C. through the entire life of the mine?
Does that mine hold the record for the total silver mined during a single year in B.C.?
-30-