4 Underwater
This brief practical application has two goals. First, I want to make sure that you are comfortable with the R and RStudio environment. Second, I want everyone to make some plots using ggplot
. You can find lots of examples in Chapter 3 of R for Data Science.
We begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic plots.
4.1 Setting Up Our R Environment
The top portion of your R Markdown file (between the three dashed lines) is called the YAML. It stands for “YAML Ain’t Markup Language.” It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML and that it contains meta information about your document.
Start by opening the project for this course in RStudio. Then, go to the project folder that you created for this course. If you do not have a project folder or a project for the course, see Chapter 8 of our text.
If you haven’t already, let’s do the following:
- Create a folder for this class a folder called
elon-r-datacamp
. - Go to the upper-right of R Studio and click the dropdown project menu. Create a new project in this folder called elon-r-datacamp
- Inside this folder, create a folder called
data
. Go to our course Github page and grab the Zillow data file. It’s the one with the really long name. Save it in theelon-r-datacamp
folder. - Use RStudio to create a new R Notebook inside of the
elon-r-datacamp
folder. Remember, an R Notebook is where you can mix your R code and written text. You format that text using the markdown syntax. Currently, your new notebook is called Untitled1, or something like that. Save your notebook as “underwater-lastname” in yourelon-r-datacamp
folder. Use your last name, of course!
The code generated when creating a new R Notebook is there to provide some syntax guidance.
In the YAML header, add lines for author:
and date:
underneath title:
. Put your name for author and today’s date. Then, hit the preview
button.
If you look in the folder that contains underwater-lastname.rmd
, the R Markdown file you are coding in, you’ll now see a file called underwater-lastname.nb.html
. This is the HTML notebook file that gets generated when you do a preview. This is the formatted file that you can share with people, like you would a Word .doc or PDF. You can save the notebooks as Word .docs or PDFs, actually!
4.1.1 Installing and Loading Packages
We will work with the tidyverse
packages, a collection of packages for doing data analysis in a “tidy” way. In particular, we will make use of ggplot2
and just two pieces of dplyr
. We are also going to use the here
package to navigate folders when loading in our data.
Install these packages by running the following in the console, or by finding and installing the packages manually in the lower-right window. No need to put this in your R Notebook document. If you’ve already installed the tidyverse
and here
, then you can skip this step.
install.packages("tidyverse")
install.packages("here")
Now, let’s get started with some code in the R Notebook. Go ahead and delete the code below the YAML header that you get every time you create a new R Notebook. Then, insert a new R chunk. You can name each chunk and put some options in between the curly brackets. The top line of your R chunk should look like ```{r set-up}
. I like to call my first chunk set-up and then put things like my library()
statements in there. Add code to load the tidyverse
and here
libraries. Remember, you need to do this in every R Notebook where you want to use something from the tidyverse
. As you write more code, your set-up chunk will include more things.
I left the name of the file the way it was when downloaded for a reason. If you change your file names, you are not be able to trace back your raw data to the source. You can read more about research by Zillow in this NPR interview.
Finally, use read_csv
, part of the tidyverse
, to load in the Zillow file. You’ll need to get the path to your folder correct, so we’ll use here::here()
to help us out. Remember, path names are relative to where R is currently looking. The here
library will have R start looking for your file in the same place as your R project.
library(tidyverse)
uw <- read_csv(here::here("data",
"zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv"))
Make sure that you’ve run the code chunk that loads the tidyverse
library and brings in our data. Do you see the cryptic message underneath? That will appear in your document if you don’t get rid of messages. This is an option for that you can add to each code chunk that you create. Go back and change the top line in that code chunk to read ```{r set-up, echo=TRUE, message=FALSE}
This means that the code will go into your document, but the messages R gives you about what happened will not. Any output, like plots, will still appear, however.
You can actually load the data directly from Github without even downloading it. The read_csv
function from readr
in the tidyverse
package can understand URLs.
uw <- read_csv("https://github.com/aaiken1/elon-r-datacamp/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv")
Finally, remember to re-run your chunks every time you change part of it. The output won’t appear in the HTML file unless you’ve run the chunk yourself. They same is true if you change an input into a chunk in another chunk. Run your chunks!
4.2 Data
You can find many different Zillow data sets here. The “uw” here stands for “under water” in a very literal sense. I first thought they were talking about mortgages!
The data frame we will be working with today contains data from Zillow.com, the housing web site, on the number of homes, the median value of homes, etc. at-risk for flooding due to rising sea levels.
Let’s take a glimpse
at our data. The glimpse
function comes from the tibble
package. I am using the package::function
method here, just to make it clear where glimpse
is coming from. You can’t run glimpse
without the tibble
package installed. You get tibble
as part of the tidyverse
.
tibble::glimpse(uw)
#> Rows: 2,610
#> Columns: 24
#> $ RegionType <chr> "Nation", "State", "S…
#> $ RegionName <chr> "United States", "Ala…
#> $ StateName <chr> NA, "Alabama", "Calif…
#> $ MSAName <chr> NA, NA, NA, NA, NA, N…
#> $ AllHomes_Tier1 <dbl> 35461549, 546670, 306…
#> $ AllHomes_Tier2 <dbl> 35452941, 520247, 307…
#> $ AllHomes_Tier3 <dbl> 35484532, 491300, 316…
#> $ AllHomes_AllTiers <dbl> 106399022, 1558217, 9…
#> $ UWHomes_Tier1 <dbl> 594672, 2890, 8090, 4…
#> $ UWHomes_Tier2 <dbl> 542681, 2766, 14266, …
#> $ UWHomes_Tier3 <dbl> 725955, 5046, 17783, …
#> $ UWHomes_AllTiers <dbl> 1863308, 10702, 40139…
#> $ UWHomes_TotalValue_Tier1 <dbl> 122992000000, 3659866…
#> $ UWHomes_TotalValue_Tier2 <dbl> 195513000000, 6291769…
#> $ UWHomes_TotalValue_Tier3 <dbl> 597449000000, 2583039…
#> $ UWHomes_TotalValue_AllTiers <dbl> 915954000000, 3578203…
#> $ UWHomes_MedianValue_AllTiers <dbl> 310306.0, 270254.0, 8…
#> $ AllHomes_Tier1_ShareUW <dbl> 0.016769487, 0.005286…
#> $ AllHomes_Tier2_ShareUW <dbl> 0.015307080, 0.005316…
#> $ AllHomes_Tier3_ShareUW <dbl> 0.020458351, 0.010270…
#> $ AllHomes_AllTiers_ShareUW <dbl> 0.017512454, 0.006868…
#> $ UWHomes_ShareInTier1 <dbl> 0.3191485, 0.2700430,…
#> $ UWHomes_ShareInTier2 <dbl> 0.2912460, 0.2584564,…
#> $ UWHomes_ShareInTier3 <dbl> 0.3896055, 0.4715007,…
The first thing to note about this data is that there are several tiers. Some numbers are at the national level, some at the region, some at the state, and some at the MSA. tidy
data refers to one row, one observation, across different variables (columns) and, possibly, across time. When you have multiple observations of the same firm, state, person, etc. over time, we call that panel data. Panel data is tidy
too.
This data isn’t tidy! Note that some RegionTypes (e.g. Nation, State) are actually summaries of other rows. This is bit confusing, but we will work with it as is.
MSA stands for Metropolitan Statistical Area.
Finally, let’s think about the variable definitions a bit. Zillow places homes into three tiers based on the home’s estimated market value, with Tier3 being the highest value. AllHomes refers to all homes in that tier or across all tiers. UWHomes refers to homes that are at risk for flooding. Note that there are some variables that are counts, some that are dollar values, and others that are percentages.
4.3 Data visualization and summary
With your data loaded and an R Markdown notebook ready to go, let’s make some plots. You can include all of your R code and output in this single R Notebook.
See our notes for how to use filter
and %>%
. This is the only dplyr
that you need for this exercise.
- Start with the full set of data and use
filter
and%>%
to keep only theRegionType
observations that are equal to “MSA.” Then, create a scatter plot ofUWHomes_Tier1
on the x axis andUWHomes_TotalValue_Tier1
on the y. Do all of this in one R chunk. You will then run that code together in the chunk in your notebook. The resulting plot should appear below.
How would you create a new data set with just the MSA RegionType
observations? If you do that, you won’t need to do the filter
and %>%
each time.
An important note: You’ll note that sometimes people put the mapping = aes()
inside of ggplot()
and other times they put it inside of the geom_()
statement. The R for Data Science text tends to put them inside of the geom_x()
statement. ggplot()
gets a data =
. See Section 3.2.3 of the book. But, when you are using pipes, you don’t need the data =
inside ggplot()
. The data gets “passed through” the pipe.
You can put mapping = aes()
inside the geom_x()
with other options following after a comma, like in the textbook. You could also put mapping = aes()
inside ggplot()
.
It is important to get use to seeing the same thing done several ways. There’s never just one way to code something. This will become very clear when you search for help online.
That looks OK, I guess. Go back and add some appropriate labels for the X and Y axis. Give your chart a title. Let’s use the
labs
option, the anx
, ay
, and atitle
. Finally, usegeom_smooth
to add a linear trend line. You can do all of this in the original chunk. No need to write the same code again.Is there anything surprising here?
Let’s create a statistical summary of sorts for each state. Again use
filter
and%>%
to keep only the MSA observations, unless you created a new data set with just those observations, as mentioned above. Usegeom_boxplot()
to summariseUWHomes_Tier1
by state across all MSA (there are multiple MSA observations per state). Again, add appropriate labeling. Anything interesting here?
For Part 4, to make things easy to read, put the state name on the y axis. You can do this by adding coord_flip()
to your ggplot
layers.
Anything change when you create the same plot for
UWHomes_Tier3
? Which states should be worried, assuming that we like Zillow’s model? Finally, addgeom_jitter()
to the new Tier3 graph. What does this do?Create a histogram that shows the distribution of
UWHomes_Tier3
across all MSA regions. Use abinwidth
of 5000 homes. Is that a normal distribution? Why or why not?

- Create two denisty plots on the same graph that compare the distribution of
UWHomes_Tier3
across all MSAs in just CA and NJ. To do this, we need to create a vector that contains the two state names.
The c
function combines the two strings together into one vector that contains the two state names. Strings means characters, or text, in computer lingo. You can use the class
function to see that states
contains character values.
We can then use this vector of names to select observations for just those two states using the %in
operator inside of the filter
function.
Remember to also filter so that you just have MSA observations.
Use scale_fill_manual
to make California this particular grey (“#999999”) and New Jersey this specific orange (“#E69F00”).
4.4 Other Items
- Resize your figures:
Click on the gear icon in on top of the R Markdown document, and select “Output Options…” in the dropdown menu. In the pop up dialogue box go to the Figures tab and change the height and width of the figures, and hit OK when done. Then, preview
your document and see how you like the new sizes. Change and preview
again and again until you’re happy with the figure sizes. Note that these values get saved in the YAML at the top of your Notebook.
You can also use different figure sizes for different figures. To do so click on the gear icon within the chunk where you want to make a change. Changing the figure sizes added new options to these chunks: fig.width
and fig.height
. You can change them by defining different values directly in your R Markdown document as well.

- Change the look of your report:
Once again click on the gear icon at the top of the R Markdown document and select “Output Options…” in the dropdown menu. In the General tab of the pop up dialogue box, try out different Syntax highlighting and theme options. Hit OK and preview
your document to see how it looks. Play around with these until you’re happy with the look.
- Remember to run your chunks!
Getting weird output in your final document? Did you remember to add message=FALSE
to your R chunk options?
The output in your final document will only reflect the last run for each chunk. Note that there is an option to run all chunks above or below your cursor under the Run menu.