Chapter 3 Overview of a Dataframe
Datasets in R are usually called dataframes or tibbles. The distinction between these names is not important for our purposes - we will usually refer to a dataset as a dataframe.
3.1 glimpse
Let’s look at what is inside the gapminder dataset using the glimpse
command from the dplyr package. The dplyr package is contained in the package “tidyverse” that was loaded previously. The glimpse(gapminder)
command would have executed without any errors. We use the dplyr::
prefix to inform readers that the glimpse function resides in the dplyr package.
# the next command would also execute if
# dplyr or tidyverse was previously loaded with library(dplyr)
#glimpse(gapminder)
dplyr::glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afgh…
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia,…
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,…
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 4…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1288181…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.0…
This shows it contains economic and demographic information about different countries across years. There are 1704 rows (observations) and 6 columns (variables).
Each variable name is listed along with a variable type designation.
- fct: means a factor variable, also known as a categorical variable.
- int: means a quantitative variable that takes only integer or whole number values.
- dbl: means double precision, a quantitative variable that is essentially continuous - taking decimal values.
3.2 head
By default, the head
command will show the first 6 rows of the dataset gapminder. Datasets in R are called “dataframes.” The gapminder dataframe is denoted as a “tibble” which is a type of dataframe.
Options to the head
command can change the rows displayed.
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## # A tibble: 4 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
3.3 tail
By default, the tail
command will show the last 6 rows of the dataset gapminder.
Options to the tail
command can change the rows displayed.
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Zimbabwe Africa 1982 60.4 7636524 789.
## 2 Zimbabwe Africa 1987 62.4 9216418 706.
## 3 Zimbabwe Africa 1992 60.4 10704340 693.
## 4 Zimbabwe Africa 1997 46.8 11404948 792.
## 5 Zimbabwe Africa 2002 40.0 11926563 672.
## 6 Zimbabwe Africa 2007 43.5 12311143 470.
## # A tibble: 4 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Zimbabwe Africa 1992 60.4 10704340 693.
## 2 Zimbabwe Africa 1997 46.8 11404948 792.
## 3 Zimbabwe Africa 2002 40.0 11926563 672.
## 4 Zimbabwe Africa 2007 43.5 12311143 470.
3.4 summary
This command shows a basic summary of the values in each variable.
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
The next command illustrates a “pipe” - here the dataframe gapminder is “piped” into
the summary function to be processed. Note the same output is produce as using
summary(gapminder)
. Note, the pipe operation %>% is contained in tidyverse package: magrittr which is loaded when tidyverse is loaded.
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
3.5 Dataframe Details: funModeling package
The funModeling package contains the df_status
command which also summarizes
a dataframe - showing different aspects like missing values, percentage of zero
values, and also the number of unique values.
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 country 0 0 0 0 0 0 factor 142
## 2 continent 0 0 0 0 0 0 factor 5
## 3 year 0 0 0 0 0 0 integer 12
## 4 lifeExp 0 0 0 0 0 0 numeric 1626
## 5 pop 0 0 0 0 0 0 integer 1704
## 6 gdpPercap 0 0 0 0 0 0 numeric 1704
## $vars_num_with_NA
## [1] variable q_na p_na
## <0 rows> (or 0-length row.names)
##
## $vars_cat_with_NA
## [1] variable q_na p_na
## <0 rows> (or 0-length row.names)
##
## $vars_cat_high_card
## variable unique
## country country 142
##
## $MAX_UNIQUE
## [1] 35
##
## $vars_one_value
## character(0)
##
## $vars_cat
## [1] "country" "continent"
##
## $vars_num
## [1] "year" "lifeExp" "pop" "gdpPercap"
##
## $vars_char
## character(0)
##
## $vars_factor
## [1] "country" "continent"
##
## $vars_other
## character(0)
3.6 Dataframe Details: skimr package
The skimr package contains many useful functions for summarizing a dataframe. When we supply a dataframe to the skim_without_charts
function, dataframe details are separated by variable types.
## ── Data Summary ────────────────────────
## Values
## Name Piped data
## Number of rows 1704
## Number of columns 6
## _______________________
## Column type frequency:
## factor 2
## numeric 4
## ________________________
## Group variables None
##
## ── Variable type: factor ────────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate ordered n_unique
## 1 country 0 1 FALSE 142
## 2 continent 0 1 FALSE 5
## top_counts
## 1 Afg: 12, Alb: 12, Alg: 12, Ang: 12
## 2 Afr: 624, Asi: 396, Eur: 360, Ame: 300
##
## ── Variable type: numeric ───────────────────────────────────────────────────────────
## skim_variable n_missing complete_rate mean sd p0 p25
## 1 year 0 1 1980. 17.3 1952 1966.
## 2 lifeExp 0 1 59.5 12.9 23.6 48.2
## 3 pop 0 1 29601212. 106157897. 60011 2793664
## 4 gdpPercap 0 1 7215. 9857. 241. 1202.
## p50 p75 p100
## 1 1980. 1993. 2007
## 2 60.7 70.8 82.6
## 3 7023596. 19585222. 1318683096
## 4 3532. 9325. 113523.
3.7 describe: Hmisc package
The Hmisc package contains the describe
function that gives a helpful overview of numeric and categorical variables.
## .
##
## 6 Variables 1704 Observations
## -------------------------------------------------------------------------------------
## country
## n missing distinct
## 1704 0 142
##
## lowest : Afghanistan Albania Algeria Angola Argentina
## highest: Vietnam West Bank and Gaza Yemen, Rep. Zambia Zimbabwe
## -------------------------------------------------------------------------------------
## continent
## n missing distinct
## 1704 0 5
##
## lowest : Africa Americas Asia Europe Oceania
## highest: Africa Americas Asia Europe Oceania
##
## Value Africa Americas Asia Europe Oceania
## Frequency 624 300 396 360 24
## Proportion 0.366 0.176 0.232 0.211 0.014
## -------------------------------------------------------------------------------------
## year
## n missing distinct Info Mean Gmd .05 .10 .25
## 1704 0 12 0.993 1980 19.87 1952 1957 1966
## .50 .75 .90 .95
## 1980 1993 2002 2007
##
## lowest : 1952 1957 1962 1967 1972, highest: 1987 1992 1997 2002 2007
##
## Value 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## Frequency 142 142 142 142 142 142 142 142 142 142 142 142
## Proportion 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083 0.083
## -------------------------------------------------------------------------------------
## lifeExp
## n missing distinct Info Mean Gmd .05 .10 .25
## 1704 0 1626 1 59.47 14.82 38.49 41.51 48.20
## .50 .75 .90 .95
## 60.71 70.85 75.10 77.44
##
## lowest : 23.599 28.801 30.000 30.015 30.331, highest: 81.701 81.757 82.000 82.208 82.603
## -------------------------------------------------------------------------------------
## pop
## n missing distinct Info Mean Gmd .05 .10 .25
## 1704 0 1704 1 29601212 46384459 475459 946367 2793664
## .50 .75 .90 .95
## 7023596 19585222 54801370 89822054
##
## lowest : 60011 61325 63149 65345 70787
## highest: 1110396331 1164970000 1230075000 1280400000 1318683096
## -------------------------------------------------------------------------------------
## gdpPercap
## n missing distinct Info Mean Gmd .05 .10 .25
## 1704 0 1704 1 7215 8573 548.0 687.7 1202.1
## .50 .75 .90 .95
## 3531.8 9325.5 19449.1 26608.3
##
## lowest : 241.1659 277.5519 298.8462 299.8503 312.1884
## highest: 80894.8833 95458.1118 108382.3529 109347.8670 113523.1329
## -------------------------------------------------------------------------------------