Chapter 4 Opening the Data

We can open our containers or “nests” in data using the function unnest from the package tidyr.

library(tidyr)
df <- unnest(df, data)
df
>> # A tibble: 24 × 9
>>    source    country    date  file_…¹ conti…² year  lifeExp pop   gdpPe…³
>>    <chr>     <chr>      <chr> <chr>   <chr>   <chr> <chr>   <chr> <chr>  
>>  1 gapminder afghanist… 2022… csv     Asia    1952  28.801  8425… 779.44…
>>  2 gapminder afghanist… 2022… csv     Asia    1957  30.332  9240… 820.85…
>>  3 gapminder afghanist… 2022… csv     Asia    1962  31.997  1026… 853.10…
>>  4 gapminder afghanist… 2022… csv     Asia    1967  34.02   1153… 836.19…
>>  5 gapminder afghanist… 2022… csv     Asia    1972  36.088  1307… 739.98…
>>  6 gapminder afghanist… 2022… csv     Asia    1977  38.438  1488… 786.11…
>>  7 gapminder afghanist… 2022… csv     Asia    1982  39.854  1288… 978.01…
>>  8 gapminder afghanist… 2022… csv     Asia    1987  40.822  1386… 852.39…
>>  9 gapminder afghanist… 2022… csv     Asia    1992  41.674  1631… 649.34…
>> 10 gapminder afghanist… 2022… csv     Asia    1997  41.763  2222… 635.34…
>> # … with 14 more rows, and abbreviated variable names ¹​file_type,
>> #   ²​continent, ³​gdpPercap

This data looks clean but looks can be deceiving. Remember that we set the df object to be as_tibble? A tibble is not only a table in R, but a cleaner kind of table. It shows only 10 rows, and as many columns that can comfortably be displayed. The sizing of the table surrounds the data, like the header (top line) # A tibble: 24 x 9 tells you how many rows (24) and columns (9) there are. The footer (bottom line) # ... with 14 more rows and 6 more variables tells you what is missing from the display.

Even though this data is small (24 rows by 9 columns can be evaluated using our eyes), it is always best to practice techniques that are generalizable to both small and large data.

4.1 Checking Data

The simplest way to start evaluating data is to check that the values under each column meet expectations. Since we created the first 4 columns using our file names, we can be sure that these 4 columns are clean.

For the next column, continent, we expect values to be continents, capitalized, and spelled correctly. Instead of reading each line with our eyes, we can read each line with our computers. Or a combination of the two. That is exactly how we will start. We will check all unique (i.e. distinct) values under the continent column using distinct. It produces a tibble with only the distinct rows for the column(s) you choose.

distinct(df, continent)
>> # A tibble: 2 × 1
>>   continent
>>   <chr>    
>> 1 Asia     
>> 2 Americas

These unique values are perfect. Which means every value is perfect, as these unique values represent them. But what if they were not capitalized, for example? That is in fact the case with our country column: the values are not capitalized because the values came from our file-naming, and it is good practice not to capitalize when file-naming. To capitalize the values in a column, you can use a function called str_to_title.

str_to_title(df$country)
>>  [1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
>>  [5] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
>>  [9] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
>> [13] "Canada"      "Canada"      "Canada"      "Canada"     
>> [17] "Canada"      "Canada"      "Canada"      "Canada"     
>> [21] "Canada"      "Canada"      "Canada"      "Canada"
df
>> # A tibble: 24 × 9
>>    source    country    date  file_…¹ conti…² year  lifeExp pop   gdpPe…³
>>    <chr>     <chr>      <chr> <chr>   <chr>   <chr> <chr>   <chr> <chr>  
>>  1 gapminder afghanist… 2022… csv     Asia    1952  28.801  8425… 779.44…
>>  2 gapminder afghanist… 2022… csv     Asia    1957  30.332  9240… 820.85…
>>  3 gapminder afghanist… 2022… csv     Asia    1962  31.997  1026… 853.10…
>>  4 gapminder afghanist… 2022… csv     Asia    1967  34.02   1153… 836.19…
>>  5 gapminder afghanist… 2022… csv     Asia    1972  36.088  1307… 739.98…
>>  6 gapminder afghanist… 2022… csv     Asia    1977  38.438  1488… 786.11…
>>  7 gapminder afghanist… 2022… csv     Asia    1982  39.854  1288… 978.01…
>>  8 gapminder afghanist… 2022… csv     Asia    1987  40.822  1386… 852.39…
>>  9 gapminder afghanist… 2022… csv     Asia    1992  41.674  1631… 649.34…
>> 10 gapminder afghanist… 2022… csv     Asia    1997  41.763  2222… 635.34…
>> # … with 14 more rows, and abbreviated variable names ¹​file_type,
>> #   ²​continent, ³​gdpPercap

4.2 Finding Functions

How would you know which function to use if I did not tell you?

4.2.1 Google

Google is a great search engine that most R programmers use when learning the R language. If we search “r capitalize first letter” we see, on 2022-10-26, the following paragraph as the first result:

Convert First letter of every word to Uppercase in R Programming – str_to_title() Function. str_to_title() Function in R Language is used to convert the first letter of every word of a string to Uppercase and the rest of the letters are converted to lower case.

The trick is to, within Google, always write r before a question or the desired command, like how to capitalize first letter or simply capitalize first letter.

This is a simple example. Most of the time it can be difficult to write in English what you want. This will come with time and practice. At first you may find that the Google search results have nothing to do with what you need. That is a sign to re-word your search, or, if you’ve already re-worded your search, it may be a sign that there is no dedicated function for what you need, or that a different approach is needed. It’s rare that there will be no dedicated function so long as your goal is simple. You may find that it is effective to break down what you’re doing into simple steps, and then search for how to do those steps, as opposed to Googling something long and complicated, involving many steps.

4.2.2 Stack Overflow

Speaking of breaking down something complicated so that a search engine like Google can understand it, this is also necessary for others to understand it. For learning R, allowing others to understand your challenge or need is valuable as the R community is not only willing, but also quickly able to help. R users mainly help each other through Stack Overflow. It is a website that easily allows users to ask or answer questions with code, have their code formatted (look nice), and receive feedback.

The main draw of Stack Overflow is that the person asking the question has one main responsibility, and that is to produce what is called a minimally reproducible example: an example that can be used (reproduced) by someone else seeing the question, and that does not have unnecessary detail irrelevant to the question (minimal).

Describe example

EXAMPLE HERE

Knowing how to make an example is the majority of the work involved in asking a question on Stack Overflow.

4.2.2.1 Creating Minimally Reproducible Examples

If your question involves data frames, you need to learn how to build a data frame before asking your question on Stack Overflow. To build a data frame, you can use the tibble function from package tibble.

If you have 2 numeric columns, like in

>> # A tibble: 12 × 2
>>    year  lifeExp
>>    <chr> <chr>  
>>  1 1952  28.801 
>>  2 1957  30.332 
>>  3 1962  31.997 
>>  4 1967  34.02  
>>  5 1972  36.088 
>>  6 1977  38.438 
>>  7 1952  68.75  
>>  8 1957  69.96  
>>  9 1962  71.3   
>> 10 1967  72.13  
>> 11 1972  72.88  
>> 12 1977  74.21

then the first part of your minimal example might look this:

tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
>> # A tibble: 4 × 2
>>       x     y
>>   <dbl> <dbl>
>> 1     1     3
>> 2     2     4
>> 3     1     2
>> 4     2     2

And if what you’re trying to achieve is

>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA

>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA

>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA

>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA

>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA

>> Warning in mean.default(lifeExp): argument is not numeric or logical:
>> returning NA
>> # A tibble: 6 × 2
>>   year  mean_lifeExp
>>   <chr>        <dbl>
>> 1 1952            NA
>> 2 1957            NA
>> 3 1962            NA
>> 4 1967            NA
>> 5 1972            NA
>> 6 1977            NA

then the second part of your minimal example might look like this:

tibble(x = c(1, 2), mean_y = c(2.5, 2))
>> # A tibble: 2 × 2
>>       x mean_y
>>   <dbl>  <dbl>
>> 1     1    2.5
>> 2     2    2

To summarize, your entire question on Stack Overflow could look like this:

How can I transform the first tibble into the second tibble with a function?
library(tibble)
tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
tibble(x = c(1, 2), mean_y = c(2.5, 2))

To make your question even better, you can format your code by using the reprex function from the reprex package. The curly brackets are needed to tell reprex that you have multiple lines of code.

library(reprex)
reprex(
  {
    library(tibble)
    tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
    tibble(x = c(1, 2), mean_y = c(2.5, 2))
  }
)
>> ℹ Non-interactive session, setting `html_preview = FALSE`.
>> ℹ Rendering reprex...
>> ✔ Reprex output is on the clipboard.

Finally your question looks friendly:

How can I transform the first tibble into the second tibble with a function?
library(tibble)
tibble(x = c(1, 2, 1, 2), y = c(3, 4, 2, 2))
#> # A tibble: 4 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2     4
#> 3     1     2
#> 4     2     2
tibble(x = c(1, 2), mean_y = c(2.5, 2))
#> # A tibble: 2 × 2
#>       x mean_y
#>   <dbl>  <dbl>
#> 1     1    2.5
#> 2     2    2

4.3 Finding Function Documentation for Understanding Functions

Once you’ve found a function (or usually, a set of functions) recommended to you by Google’s search results, or by R users on Stack Overflow, it would be wise to understand how the function(s) work; specifically, the inputs and outputs.

Both Google and Stack Overflow can be overwhelming. Google gives a variety of websites. Which do you choose? A question on Stack Overflow can receive multiple answers, with each using different approaches and functions. Again, which do you choose?

Let’s start with Google then.

4.3.1 Google

Remember, after searching “r capitalize first letter” we saw the following paragraph as the first result:

Convert First letter of every word to Uppercase in R Programming – str_to_title() Function. str_to_title() Function in R Language is used to convert the first letter of every word of a string to Uppercase and the rest of the letters are converted to lower case.

This paragraph is from a website called GeeksforGeeks

I would not recommend to use this website. That is, after searching “r capitalize first letter” and seeing the the above paragraph, I would not recommend to visit the website to understand the function. And for multiple reasons.

  1. You are not familiar with the format of the website.
  2. You will find yourself on multiple websites when you need to discover and learn about multiple functions.
  3. You will then have to navigate the formats of these websites.
  4. Many things can get in the way of reading the instructions, like pop-ups to sign up for the website’s email list, advertisements for completely unrelated products (everything you need to learn R is FREE), and recommended articles to distract you.

It is more effective to use a single, standardized resource when learning about functions. Thankfully, R has a few.

After reading the above paragraph and learning that the function we need may be str_to_title(), we can now Google search “r str_to_title” instead of “r capitalize first letter”. Again, Google shows multiple websites, but we are looking for one that is standardized. tidyverse.org is one of those websites, so we click the result that has “tidyverse.org” in the website address This brings us to this page: https://stringr.tidyverse.org/reference/case.html

As standard, there are multiple sections to the webpage describing a function: Usage, Arguments and Examples. Usage shows the format of the inputs to the function. Any input with an = beside it has a default value. A default value usually indicates that most users will not need to change the value.

The Usage str_to_title(string, locale = "en") tells us that

  1. string should be an object containing some string(s) or a string itself. It has no default value; we must provide one.
  2. locale has the default value "en".

The Arguments tell us more about the inputs in case the Usage is not enough. When first learning R, Arguments can be overwhelming; you might quickly find yourself not understanding the words contained therein, and having to continuously look up definitions (or more function documentation) in order to understand.

4.3.2 Stack Overflow

Another way of understanding functions is to be presented with answers from others on Stack Overflow. These answers don’t need to be answers to the questions you have posted on Stack Overflow; they can be answers to questions posted by others.

For example, here is a question dated from 2019: https://stackoverflow.com/questions/58996293/transforming-a-dataframe-by-multiplying-a-columns-elements-by-the-names-of-th

There are three separate answers that have up votes (positive feedback represented by the digit on the top left of an answer): 1 using the data.table package; 1 using base R (R without packages); and 1 using tidyr.

Notice how the answer using tidyr is far more simple; it is one line of code. This word tidy keeps popping up, and for good reason: the functions in this package and more broadly in the tidyverse (the tidy universe) are designed to make coding short and simple.

It is possible to add comments to the answers on Stack Overflow, with further questions about the functions if there is something you don’t understand. Fortunately the tidyverse functions are well documented because of their standardized webpages, and because of multiple, free books on using them for specific tasks.

4.3.3 Books

There are many books describing tidyverse functions. Finding a good book is a strong alternative to using Google or Stack Overflow toward understanding functions. A book can hold your hand throughout each step and provide a narrative. It can also be designed toward a specific task, just as this one is designed toward getting you started as quickly and comfortably as possible.