Chapter 3 Overview of a Dataframe

Datasets in R are usually called dataframes or tibbles. The distinction between these names is not important for our purposes - we will usually refer to a dataset as a dataframe.

3.1 glimpse

Let’s look at what is inside the gapminder dataset using the glimpse command from the dplyr package. The dplyr package is contained in the package “tidyverse” that was loaded previously. The glimpse(gapminder) command would have executed without any errors. We use the dplyr:: prefix to inform readers that the glimpse function resides in the dplyr package.

# the next command would also execute if
# dplyr or tidyverse was previously loaded with library(dplyr)
#glimpse(gapminder)
dplyr::glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 4…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1288181…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.0…

This shows it contains economic and demographic information about different countries across years. There are 1704 rows (observations) and 6 columns (variables).

Each variable name is listed along with a variable type designation.

  • fct: means a factor variable, also known as a categorical variable.
  • int: means a quantitative variable that takes only integer or whole number values.
  • dbl: means double precision, a quantitative variable that is essentially continuous - taking decimal values.

3.3 tail

By default, the tail command will show the last 6 rows of the dataset gapminder.

Options to the tail command can change the rows displayed.

# default is to show 6 rows
tail(gapminder)
## # A tibble: 6 x 6
##   country  continent  year lifeExp      pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Zimbabwe Africa     1982    60.4  7636524      789.
## 2 Zimbabwe Africa     1987    62.4  9216418      706.
## 3 Zimbabwe Africa     1992    60.4 10704340      693.
## 4 Zimbabwe Africa     1997    46.8 11404948      792.
## 5 Zimbabwe Africa     2002    40.0 11926563      672.
## 6 Zimbabwe Africa     2007    43.5 12311143      470.
# show only 4 rows...
tail(gapminder,n=4)
## # A tibble: 4 x 6
##   country  continent  year lifeExp      pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Zimbabwe Africa     1992    60.4 10704340      693.
## 2 Zimbabwe Africa     1997    46.8 11404948      792.
## 3 Zimbabwe Africa     2002    40.0 11926563      672.
## 4 Zimbabwe Africa     2007    43.5 12311143      470.

3.4 summary

This command shows a basic summary of the values in each variable.

# A basic, base R command
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

The next command illustrates a “pipe” - here the dataframe gapminder is “piped” into the summary function to be processed. Note the same output is produce as using summary(gapminder). Note, the pipe operation %>% is contained in tidyverse package: magrittr which is loaded when tidyverse is loaded.

# Same idea, but using tidyverse pipe
gapminder %>% summary()
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

3.5 Dataframe Details: funModeling package

The funModeling package contains the df_status command which also summarizes a dataframe - showing different aspects like missing values, percentage of zero values, and also the number of unique values.

funModeling::df_status(gapminder)
##    variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1   country       0       0    0    0     0     0  factor    142
## 2 continent       0       0    0    0     0     0  factor      5
## 3      year       0       0    0    0     0     0 integer     12
## 4   lifeExp       0       0    0    0     0     0 numeric   1626
## 5       pop       0       0    0    0     0     0 integer   1704
## 6 gdpPercap       0       0    0    0     0     0 numeric   1704
di=funModeling::data_integrity(gapminder)
# returns a detailed summary of all variables
print(di)
## $vars_num_with_NA
## [1] variable q_na     p_na    
## <0 rows> (or 0-length row.names)
## 
## $vars_cat_with_NA
## [1] variable q_na     p_na    
## <0 rows> (or 0-length row.names)
## 
## $vars_cat_high_card
##         variable unique
## country  country    142
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## character(0)
## 
## $vars_cat
## [1] "country"   "continent"
## 
## $vars_num
## [1] "year"      "lifeExp"   "pop"       "gdpPercap"
## 
## $vars_char
## character(0)
## 
## $vars_factor
## [1] "country"   "continent"
## 
## $vars_other
## character(0)

3.6 Dataframe Details: skimr package

The skimr package contains many useful functions for summarizing a dataframe. When we supply a dataframe to the skim_without_charts function, dataframe details are separated by variable types.

gapminder %>% 
  skimr::skim_without_charts() 
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             1704      
## Number of columns          6         
## _______________________              
## Column type frequency:               
##   factor                   2         
##   numeric                  4         
## ________________________             
## Group variables            None      
## 
## ── Variable type: factor ────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique
## 1 country               0             1 FALSE        142
## 2 continent             0             1 FALSE          5
##   top_counts                            
## 1 Afg: 12, Alb: 12, Alg: 12, Ang: 12    
## 2 Afr: 624, Asi: 396, Eur: 360, Ame: 300
## 
## ── Variable type: numeric ───────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate       mean          sd      p0       p25
## 1 year                  0             1     1980.         17.3  1952      1966. 
## 2 lifeExp               0             1       59.5        12.9    23.6      48.2
## 3 pop                   0             1 29601212.  106157897.  60011   2793664  
## 4 gdpPercap             0             1     7215.       9857.    241.     1202. 
##         p50        p75         p100
## 1    1980.      1993.        2007  
## 2      60.7       70.8         82.6
## 3 7023596.  19585222.  1318683096  
## 4    3532.      9325.      113523.

3.7 describe: Hmisc package

The Hmisc package contains the describe function that gives a helpful overview of numeric and categorical variables.

gapminder %>% 
  Hmisc::describe() 
## . 
## 
##  6  Variables      1704  Observations
## -------------------------------------------------------------------------------------
## country 
##        n  missing distinct 
##     1704        0      142 
## 
## lowest : Afghanistan        Albania            Algeria            Angola             Argentina         
## highest: Vietnam            West Bank and Gaza Yemen, Rep.        Zambia             Zimbabwe          
## -------------------------------------------------------------------------------------
## continent 
##        n  missing distinct 
##     1704        0        5 
## 
## lowest : Africa   Americas Asia     Europe   Oceania 
## highest: Africa   Americas Asia     Europe   Oceania 
##                                                        
## Value        Africa Americas     Asia   Europe  Oceania
## Frequency       624      300      396      360       24
## Proportion    0.366    0.176    0.232    0.211    0.014
## -------------------------------------------------------------------------------------
## year 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
##     1704        0       12    0.993     1980    19.87     1952     1957     1966 
##      .50      .75      .90      .95 
##     1980     1993     2002     2007 
## 
## lowest : 1952 1957 1962 1967 1972, highest: 1987 1992 1997 2002 2007
##                                                                                   
## Value       1952  1957  1962  1967  1972  1977  1982  1987  1992  1997  2002  2007
## Frequency    142   142   142   142   142   142   142   142   142   142   142   142
## Proportion 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083
## -------------------------------------------------------------------------------------
## lifeExp 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
##     1704        0     1626        1    59.47    14.82    38.49    41.51    48.20 
##      .50      .75      .90      .95 
##    60.71    70.85    75.10    77.44 
## 
## lowest : 23.599 28.801 30.000 30.015 30.331, highest: 81.701 81.757 82.000 82.208 82.603
## -------------------------------------------------------------------------------------
## pop 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
##     1704        0     1704        1 29601212 46384459   475459   946367  2793664 
##      .50      .75      .90      .95 
##  7023596 19585222 54801370 89822054 
## 
## lowest :      60011      61325      63149      65345      70787
## highest: 1110396331 1164970000 1230075000 1280400000 1318683096
## -------------------------------------------------------------------------------------
## gdpPercap 
##        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
##     1704        0     1704        1     7215     8573    548.0    687.7   1202.1 
##      .50      .75      .90      .95 
##   3531.8   9325.5  19449.1  26608.3 
## 
## lowest :    241.1659    277.5519    298.8462    299.8503    312.1884
## highest:  80894.8833  95458.1118 108382.3529 109347.8670 113523.1329
## -------------------------------------------------------------------------------------