Chapter 4 Opening the Data
We can open our containers or “nests” in data
using the function unnest
from the package tidyr
.
library(tidyr)
<- unnest(df, data)
df df
>> # A tibble: 24 × 9
>> source country date file_…¹ conti…² year lifeExp pop gdpPe…³
>> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
>> 1 gapminder afghanist… 2022… csv Asia 1952 28.801 8425… 779.44…
>> 2 gapminder afghanist… 2022… csv Asia 1957 30.332 9240… 820.85…
>> 3 gapminder afghanist… 2022… csv Asia 1962 31.997 1026… 853.10…
>> 4 gapminder afghanist… 2022… csv Asia 1967 34.02 1153… 836.19…
>> 5 gapminder afghanist… 2022… csv Asia 1972 36.088 1307… 739.98…
>> 6 gapminder afghanist… 2022… csv Asia 1977 38.438 1488… 786.11…
>> 7 gapminder afghanist… 2022… csv Asia 1982 39.854 1288… 978.01…
>> 8 gapminder afghanist… 2022… csv Asia 1987 40.822 1386… 852.39…
>> 9 gapminder afghanist… 2022… csv Asia 1992 41.674 1631… 649.34…
>> 10 gapminder afghanist… 2022… csv Asia 1997 41.763 2222… 635.34…
>> # … with 14 more rows, and abbreviated variable names ¹file_type,
>> # ²continent, ³gdpPercap
This data looks clean but looks can be deceiving. Remember that we set the df
object to be as_tibble
? A tibble is not only a table in R, but a cleaner kind of table. It shows only 10 rows, and as many columns that can comfortably be displayed. The sizing of the table surrounds the data, like the header (top line) # A tibble: 24 x 9
tells you how many rows (24) and columns (9) there are. The footer (bottom line) # ... with 14 more rows and 6 more variables
tells you what is missing from the display.
Even though this data is small (24 rows by 9 columns can be evaluated using our eyes), it is always best to practice techniques that are generalizable to both small and large data.
4.1 Checking Data
The simplest way to start evaluating data is to check that the values under each column meet expectations. Since we created the first 4 columns using our file names, we can be sure that these 4 columns are clean.
For the next column, continent
, we expect values to be continents, capitalized, and spelled correctly. Instead of reading each line with our eyes, we can read each line with our computers. Or a combination of the two. That is exactly how we will start. We will check all unique (i.e. distinct) values under the continent column using distinct
. It produces a tibble with only the distinct rows for the column(s) you choose.
distinct(df, continent)
>> # A tibble: 2 × 1
>> continent
>> <chr>
>> 1 Asia
>> 2 Americas
These unique values are perfect. Which means every value is perfect, as these unique values represent them. But what if they were not capitalized, for example? That is in fact the case with our country column: the values are not capitalized because the values came from our file-naming, and it is good practice not to capitalize when file-naming. To capitalize the values in a column, you can use a function called str_to_title
.
str_to_title(df$country)
>> [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
>> [5] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
>> [9] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
>> [13] "Canada" "Canada" "Canada" "Canada"
>> [17] "Canada" "Canada" "Canada" "Canada"
>> [21] "Canada" "Canada" "Canada" "Canada"
df
>> # A tibble: 24 × 9
>> source country date file_…¹ conti…² year lifeExp pop gdpPe…³
>> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
>> 1 gapminder afghanist… 2022… csv Asia 1952 28.801 8425… 779.44…
>> 2 gapminder afghanist… 2022… csv Asia 1957 30.332 9240… 820.85…
>> 3 gapminder afghanist… 2022… csv Asia 1962 31.997 1026… 853.10…
>> 4 gapminder afghanist… 2022… csv Asia 1967 34.02 1153… 836.19…
>> 5 gapminder afghanist… 2022… csv Asia 1972 36.088 1307… 739.98…
>> 6 gapminder afghanist… 2022… csv Asia 1977 38.438 1488… 786.11…
>> 7 gapminder afghanist… 2022… csv Asia 1982 39.854 1288… 978.01…
>> 8 gapminder afghanist… 2022… csv Asia 1987 40.822 1386… 852.39…
>> 9 gapminder afghanist… 2022… csv Asia 1992 41.674 1631… 649.34…
>> 10 gapminder afghanist… 2022… csv Asia 1997 41.763 2222… 635.34…
>> # … with 14 more rows, and abbreviated variable names ¹file_type,
>> # ²continent, ³gdpPercap
4.2 Finding Functions
How would you know which function to use if I did not tell you?
4.2.1 Google
Google is a great search engine that most R programmers use when learning the R language. If we search “r capitalize first letter” we see, on 2022-10-26, the following paragraph as the first result:
Convert First letter of every word to Uppercase in R Programming – str_to_title() Function. str_to_title() Function in R Language is used to convert the first letter of every word of a string to Uppercase and the rest of the letters are converted to lower case.
The trick is to, within Google, always write r
before a question or the desired command, like how to capitalize first letter
or simply capitalize first letter
.
This is a simple example. Most of the time it can be difficult to write in English what you want. This will come with time and practice. At first you may find that the Google search results have nothing to do with what you need. That is a sign to re-word your search, or, if you’ve already re-worded your search, it may be a sign that there is no dedicated function for what you need, or that a different approach is needed. It’s rare that there will be no dedicated function so long as your goal is simple. You may find that it is effective to break down what you’re doing into simple steps, and then search for how to do those steps, as opposed to Googling something long and complicated, involving many steps.
4.2.2 Stack Overflow
Speaking of breaking down something complicated so that a search engine like Google can understand it, this is also necessary for others to understand it. For learning R, allowing others to understand your challenge or need is valuable as the R community is not only willing, but also quickly able to help. R users mainly help each other through Stack Overflow. It is a website that easily allows users to ask or answer questions with code, have their code formatted (look nice), and receive feedback.
The main draw of Stack Overflow is that the person asking the question has one main responsibility, and that is to produce what is called a minimally reproducible example: an example that can be used (reproduced) by someone else seeing the question, and that does not have unnecessary detail irrelevant to the question (minimal).
Describe example
Knowing how to make an example is the majority of the work involved in asking a question on Stack Overflow.
4.2.2.1 Creating Minimally Reproducible Examples
If your question involves data frames, you need to learn how to build a data frame before asking your question on Stack Overflow. To build a data frame, you can use the tibble
function from package tibble
.
If you have 2 numeric columns, like in
>> # A tibble: 12 × 2
>> year lifeExp
>> <chr> <chr>
>> 1 1952 28.801
>> 2 1957 30.332
>> 3 1962 31.997
>> 4 1967 34.02
>> 5 1972 36.088
>> 6 1977 38.438
>> 7 1952 68.75
>> 8 1957 69.96
>> 9 1962 71.3
>> 10 1967 72.13
>> 11 1972 72.88
>> 12 1977 74.21
then the first part of your minimal example might look this:
tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
>> # A tibble: 4 × 2
>> x y
>> <dbl> <dbl>
>> 1 1 3
>> 2 2 4
>> 3 1 2
>> 4 2 2
And if what you’re trying to achieve is
>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA
>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA
>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA
>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA
>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA
>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA
>> # A tibble: 6 × 2
>> year mean_lifeExp
>> <chr> <dbl>
>> 1 1952 NA
>> 2 1957 NA
>> 3 1962 NA
>> 4 1967 NA
>> 5 1972 NA
>> 6 1977 NA
then the second part of your minimal example might look like this:
tibble(x = c(1, 2), mean_y = c(2.5, 2))
>> # A tibble: 2 × 2
>> x mean_y
>> <dbl> <dbl>
>> 1 1 2.5
>> 2 2 2
To summarize, your entire question on Stack Overflow could look like this:
How can I transform the first tibble into the second tibble with a function?
library(tibble)
tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
tibble(x = c(1, 2), mean_y = c(2.5, 2))
To make your question even better, you can format your code by using the reprex
function from the reprex package. The curly brackets are needed to tell reprex that you have multiple lines of code.
library(reprex)
reprex(
{library(tibble)
tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
tibble(x = c(1, 2), mean_y = c(2.5, 2))
} )
>> ℹ Non-interactive session, setting `html_preview = FALSE`.
>> ℹ Rendering reprex...
>> ✔ Reprex output is on the clipboard.
Finally your question looks friendly:
How can I transform the first tibble into the second tibble with a function?
library(tibble)
tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
#> # A tibble: 4 × 2
#> x y
#> <dbl> <dbl>
#> 1 1 3
#> 2 2 4
#> 3 1 2
#> 4 2 2
tibble(x = c(1, 2), mean_y = c(2.5, 2))
#> # A tibble: 2 × 2
#> x mean_y
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 2
4.3 Finding Function Documentation for Understanding Functions
Once you’ve found a function (or usually, a set of functions) recommended to you by Google’s search results, or by R users on Stack Overflow, it would be wise to understand how the function(s) work; specifically, the inputs and outputs.
Both Google and Stack Overflow can be overwhelming. Google gives a variety of websites. Which do you choose? A question on Stack Overflow can receive multiple answers, with each using different approaches and functions. Again, which do you choose?
Let’s start with Google then.
4.3.1 Google
Remember, after searching “r capitalize first letter” we saw the following paragraph as the first result:
Convert First letter of every word to Uppercase in R Programming – str_to_title() Function. str_to_title() Function in R Language is used to convert the first letter of every word of a string to Uppercase and the rest of the letters are converted to lower case.
This paragraph is from a website called GeeksforGeeks
I would not recommend to use this website. That is, after searching “r capitalize first letter” and seeing the the above paragraph, I would not recommend to visit the website to understand the function. And for multiple reasons.
- You are not familiar with the format of the website.
- You will find yourself on multiple websites when you need to discover and learn about multiple functions.
- You will then have to navigate the formats of these websites.
- Many things can get in the way of reading the instructions, like pop-ups to sign up for the website’s email list, advertisements for completely unrelated products (everything you need to learn R is FREE), and recommended articles to distract you.
It is more effective to use a single, standardized resource when learning about functions. Thankfully, R has a few.
After reading the above paragraph and learning that the function we need may be str_to_title()
, we can now Google search “r str_to_title” instead of “r capitalize first letter”. Again, Google shows multiple websites, but we are looking for one that is standardized. tidyverse.org is one of those websites, so we click the result that has “tidyverse.org” in the website address This brings us to this page: https://stringr.tidyverse.org/reference/case.html
As standard, there are multiple sections to the webpage describing a function: Usage, Arguments and Examples. Usage shows the format of the inputs to the function. Any input with an =
beside it has a default value. A default value usually indicates that most users will not need to change the value.
The Usage str_to_title(string, locale = "en")
tells us that
string
should be an object containing some string(s) or a string itself. It has no default value; we must provide one.locale
has the default value"en"
.
The Arguments tell us more about the inputs in case the Usage is not enough. When first learning R, Arguments can be overwhelming; you might quickly find yourself not understanding the words contained therein, and having to continuously look up definitions (or more function documentation) in order to understand.
4.3.2 Stack Overflow
Another way of understanding functions is to be presented with answers from others on Stack Overflow. These answers don’t need to be answers to the questions you have posted on Stack Overflow; they can be answers to questions posted by others.
For example, here is a question dated from 2019: https://stackoverflow.com/questions/58996293/transforming-a-dataframe-by-multiplying-a-columns-elements-by-the-names-of-th
There are three separate answers that have up votes (positive feedback represented by the digit on the top left of an answer): 1 using the data.table
package; 1 using base R
(R without packages); and 1 using tidyr
.
Notice how the answer using tidyr
is far more simple; it is one line of code. This word tidy
keeps popping up, and for good reason: the functions in this package and more broadly in the tidyverse
(the tidy universe) are designed to make coding short and simple.
It is possible to add comments to the answers on Stack Overflow, with further questions about the functions if there is something you don’t understand. Fortunately the tidyverse functions are well documented because of their standardized webpages, and because of multiple, free books on using them for specific tasks.
4.3.3 Books
There are many books describing tidyverse functions. Finding a good book is a strong alternative to using Google or Stack Overflow toward understanding functions. A book can hold your hand throughout each step and provide a narrative. It can also be designed toward a specific task, just as this one is designed toward getting you started as quickly and comfortably as possible.