1 Preparing Data

1.1 Achievements to unlock

Objectives for chapter 01

SwR Achievements

(~~Observations and variables~~)
Using reproducible research practices (Section 1.2)
(~~Understanding and changing data types~~)
Entering or loading data into R (Section 1.3)
Identifying and treating missing values (Data wrangling) (Section 1.4)
Building a basic bar chart (Replicate Figure 1.1 and 1.2) (Section 1.5)

I will skip the crossed out learning objectives in parenthesis as I know already these procedures. However I will modify and elaborate some of these achievements as mentioned in the parentheses.

I will add other objectives that resulted from questions that arose during reading the book.

Questions that resulted to additional objectives

How to download data directly from the General Social Survey (GSS) web page? {Section 1.7}
How to work with labelled data? {Section 1.8}
Reviewing different packages for creating tables (Section 1.9.1)
Introducing {gtsummary}, a packages for presentation-ready data summary and analytic result tables (see: Section A.37 Section 1.9.2)
Developing timeline graph (1973-2022) about opinions of respondents from GSS surveys about legalizing marijuana (Section 1.10)

Objectives 1.1: Achievements for chapter 01

1.2 Using reproducible research practices

1.2.1 Script files

SwR explains writing script files, but I am using Literate Programming with Quarto. This has the consequence that in addition to short comments inside code cells I have the possibility to write extensively in the other parts of the file about approach, code, results etc.

A practical advice for scripts is to include a prolog. Possible prolog sections:

Project name
Project purpose
Name(s) of data set(s) used in the project
Location(s) of data set(s) used in the project
Code author name (you!)
Date code created
Date last time code was edited

Most of these information are naturally occurring in the writing process of Quarto books.

Resource 1.1 : Literate Statistical Programming

Literate Programming: (Wikipedia)
Introduction to Literate Programming with Quarto (Online Slides)
Reproducibility and literate programming in R (bookdown course)
Introduction to Data Science in R for Biologists (Module on Literate Statistical Programming and Quarto)
Let’s build a blog with Quarto Literate programming in Quarto) by Isabella Velásquez. The site has other material (for Quarto blogs) as well: Migrate from R Markdown, Additional resources
Introduction to literate programming with Quarto and Markdown by Gesis (Slides)

1.2.2 Naming objects

I am used to apply the tidyverse style guide. It requires to use underlines (“snake_code”) as separators in object names. (In contrast to “camelCase” code style). But reading the book I thought it might be a good idea to use special additional styles for certain specific objects.

Naming constants: Prefix name of constants with k_.
Naming variables: Standard snake code.
Naming functions: Prefix name of private functions with a dot .. I had already experienced that didn’t know from which package a function was. Only to learn after looking around for minutes that it was a function I wrote myself!
Naming data frames: Prefix name with df_ for data.frame and dt_ for tibble. I might also use a suffix to refer to the status e.g., _raw (raw data), _clean (cleaned data), _v2 (version number).
Naming files: It could be helpful to add at the start the chapter number e.g. chap02_. And maybe also — as in naming data frames — the status as suffix.

1.3 Import data frames from outside resources

R has many possibilities to import data from other statistical packages.

1.3.1 Some common file extensions

.csv: comma separated values
.txt: text file
.xls or .xlsx: Excel file
.sav: SPSS file
.sasb7dat: SAS file
.xpt: SAS transfer file
.dta: Stata file

1.3.2 Some packages for import data sources

{readr}: Read Rectangular Text Data (see: Section A.71), it is part of {tidyverse}
{vroom}: Read and Write Rectangular Text Data Quickly
{haven}: Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files (see: Section A.38)
{foreign}: Read Data Stored by ‘Minitab’, ‘S’, ‘SAS’, ‘SPSS’, ‘Stata’, ‘Systat’, ‘Weka’, ‘dBase’, …
{readxl}: Read Excel Files
{openxslx}: Read, Write and Edit xslx Files
{readODS}: Read and Write ODS Files (e.g. LibreOffice)
{clipr}: Read and Write from the System Clipboard

I will not go into the import details of all the different packages here, because my focus is on the General Social Survey (GSS) data.

1.3.3 Import data with a .csv file

“While the GSS data can be read into R directly from the GSS website, Kiara had experienced this and knew that it could be frustrating.” (Harris, 2020) (pdf)

The author therefore offers for this chapter a .csv file with the data. In later chapters learner can choose to use the provided files from the SAGE webpage. Even if these data files are not yet cleaned, it is a kind of cheating, because it bypasses downloading data from the original source.

The pb_create_folder() function in the following code chunk checks if the folder already exists. If this is the case it leaves it untouched, otherwise it creates the folder at the desired path. The code for my private functions are in the file helper.R inside the R folder at root level of this working directory.

Example 1.1 : Read data from a .csv file into R

R Code 1.1 : Read .csv file and show data summary

Code

pb_create_folder(base::paste0(here::here(), "/data/"))
pb_create_folder(base::paste0(here::here(), "/data/chap01"))

gss_2016_book <- readr::read_csv(
    file = "data/chap01/legal_weed_age_GSS2016_ch1.csv",
    show_col_types = FALSE)

summary(gss_2016_book)

#>     grass               age           
#>  Length:2867        Length:2867       
#>  Class :character   Class :character  
#>  Mode  :character   Mode  :character

Listing / Output 1.1: Read the provided dataset as a .csv file into R and show base::summary()

Some comments

In contrast to base::read.csv() in the book I used with readr::read_csv() a function from the {tidyverse} package collection.
I added the show_col_types = FALSE argument to prevent a message about the column specification.

The output of base::summary is in this case not very informative. Looking around for appropriate reporting function I developed with my_glance_data() my private command. (See Listing / Output 1.2 in next tab.)

R Code 1.2 Look at data with my private function my_glance_data()

Code

gss_2016_book |>
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)

#> # A tibble: 10 × 3
#>      obs age   grass    
#>    <int> <chr> <chr>    
#>  1     1 47    IAP      
#>  2   172 27    LEGAL    
#>  3   562 70    NOT LEGAL
#>  4   898 60    NOT LEGAL
#>  5  1019 30    LEGAL    
#>  6  1176 80    NOT LEGAL
#>  7  1505 53    IAP      
#>  8  1911 54    NOT LEGAL
#>  9  2622 62    IAP      
#> 10  2867 72    NOT LEGAL

Listing / Output 1.2: Display 8 randomly chosen rows in addition to the first and last row

I developed a private function my_glance_data() to provide a first impression about the data. The function prints the first and last row of the dataset, so you know implicitly how many rows the dataset has. Between the first and last row the function adds randomly N other rows (default is 8). Additionally it provides the row number of the data in a separate column. (The column header obs stands for “observation”.) For reproducibility purposes you can also add a number for the set.seed() command.

The idea of my_glance_data() is to get a first impression of the dataset. Other printing methods show only the first (or last) rows of the dataset. This could be misleading, giving a wrong impression about typical data.

Maybe you are interested to use my_glance_data() yourself? It isn’t available through a package, but you can copy the source code from the next R chunk.

I have saved the function in a .qmd file one level higher as this book (and all my other R projects). With {{< include "../_common.qmd" >}} I have the code integrated in this book. (For the include shortcode see the section Includes of the Quarto documentation.)

Code

### function my_glance_data ##############
my_glance_data <- function(df, N = 8, seed = 42){
    df_temp <- first_and_last_row(df)
    
    set.seed(seed)
    df |> 
    dplyr::mutate(obs = dplyr::row_number()) |> 
    dplyr::relocate(obs) |> 
    dplyr::slice_sample(n = N) |> 
    dplyr::bind_rows(df_temp) |> 
    dplyr::arrange(obs)
} 

first_and_last_row <-  function(df) {
    df |> 
    dplyr::mutate(obs = dplyr::row_number()) |>
    dplyr::filter(dplyr::row_number() %in% base::c(1, dplyr::n()))
}

Listing / Output 1.3: Code of my private function glance_date()

1.4 Data Wrangling

1.4.1 ToDo List

After we saved the data we need to do some data wrangling. To replicate the data structure for the book Figure 1.2 we need to:

filter the dataset to the year 2016 (in the case you work with the full GSS dataset 1972-2022, which we won’t)
select only the variables age and grass to work with
drop all NA’s
convert grass into factor
recode grass labels
convert age from double to numeric
divide age into appropriate age intervals and label them accordingly

(a) Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1)

1.4.2 Replicating data structure for bar charts

Example 1.2 : Replicate data structure for bar charts

R Code 1.3 : Replicate data structure for Figures 1.1 and 1.2 (Book)

Code

gss_2016 <- readr::read_csv(
    file = "data/chap01/legal_weed_age_GSS2016_ch1.csv",
    show_col_types = FALSE)


gss_2016_clean <- gss_2016 |> 

#### (A) rework grass ##################
## convert grass to factor
  dplyr::mutate(grass = forcats::as_factor(grass))  |> 

## change DK and IAP to NA
  dplyr::mutate(grass = dplyr::na_if(x = grass, y = "DK")) |> 
  dplyr::mutate(grass = dplyr::na_if(x = grass, y = "IAP")) |> 

## drop unused levels "DK" and "IAP"
  dplyr::mutate(grass = forcats::fct_drop(grass)) |> 
    
## convert to factor and recode it
  dplyr::mutate(grass = forcats::fct_recode(
      grass, "Yes" = "LEGAL", "No" = "NOT LEGAL")) |> 

#### (B) rework age #########################
## change data types and recode

## turn character type of age into factor and recode "89 OR OLDER" to "89"
  # dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89")) |> 
  dplyr::mutate(age = dplyr::case_match(age, 
                                 "89 OR OLDER" ~ "89",
                                 .default = age)) |> 

## convert data type of age from factor to numeric
  dplyr::mutate(age = base::as.numeric(age)) |> 

## cut age into several defined age groups
  dplyr::mutate(age_cat = base::cut(age, 
                       breaks = c(-Inf, 29, 59, 74, Inf),
                       labels = c("< 30", "30 - 59", "60 - 74", "75+" ))) |>  

## drop NA's
    tidyr::drop_na()

## save cleaned data gss_2016
save_data_file("chap01", gss_2016_clean, "gss_2016_clean.rds")

## (C) check result #########

## summarize
print("************************   SUMMARY   ****************************")
base::summary(gss_2016_clean)

## glance at the data
print("*******************   GLANCE AT SOME DATA  **********************")
gss_2016_clean |>
    dplyr::select(c(age_cat, grass)) |> 
    my_glance_data(N = 8, seed = 2016)

#> [1] "************************   SUMMARY   ****************************"
#>  grass           age           age_cat   
#>  Yes:1123   Min.   :18.00   < 30   :332  
#>  No : 713   1st Qu.:33.00   30 - 59:989  
#>             Median :48.00   60 - 74:364  
#>             Mean   :48.26   75+    :151  
#>             3rd Qu.:61.00                
#>             Max.   :89.00                
#> [1] "*******************   GLANCE AT SOME DATA  **********************"
#> # A tibble: 10 × 3
#>      obs age_cat grass
#>    <int> <fct>   <fct>
#>  1     1 60 - 74 Yes  
#>  2   172 30 - 59 No   
#>  3   562 75+     No   
#>  4   574 30 - 59 Yes  
#>  5  1019 30 - 59 No   
#>  6  1176 30 - 59 Yes  
#>  7  1500 < 30    Yes  
#>  8  1568 75+     No   
#>  9  1803 75+     Yes  
#> 10  1836 60 - 74 No

Listing / Output 1.4: Replicate data structure for Figures 1.1 and 1.2 of the book)

The somewhat strange cut decision for age groups was motivated by the question if there is a difference between young and old voters about marijuana legalization.

Compare the result of the recoding with the original data structure in Listing / Output 1.1.

Some comments

I have included all data wrangling changes for the Figure 1.1 even if they appeared in the book section where the graph is prepared.
I have used |> form the native R pipe instead of %>% exported into tidyverse from the {magrittr} package.
Otherwise whenever possible I have used {tidyverse} code. For base::as.numeric() and base::cut() I couldn’t find {tidyverse} equivalents.

WATCH OUT! Difference between forcats::fct_recode() and dplyr::recode(). Use dplyr::case_match()!

To recode age from “89 OR OLDER” to “89” I used at first forcats::fct_recode(). But this function changes the 72 levels of the factor age and resulted — after I changed age to numeric in the next line — in a data frame where age ranged from 1 to 72!

After I noticed the problem and wanted to replace the {forcats}-line with dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89")) I saw in the help file that

recode() is superseded in favor of case_match(), which handles the most important cases of recode() with a more elegant interface. (Recode values) and

[dplyr::case_match()] allows you to vectorise multiple switch() statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the .default is used. (A general verctorized switch())

Listing / Output 1.4 used some important packages:

I used {dplyr} in the above code heavily. {dplyr} (Section A.18) is part of the {tidyverse} core collection (Section A.93), together with {forcats} (Section A.24), {ggplot2} (Section A.29), {readr} (Section A.29), {tibble} (Section A.90), {tidyr} (Section A.91), {stringr} (Section A.87) and {purrr} (Section A.69).

1.5 Reproducing bar charts

Example 1.3 : Reproducing bar charts

Figure 1.1
Figure 1.2

R Code 1.4 a: Reproducing bar chart Figure 1.1 using geom_bar()

Code

gss_2016_clean <-  base::readRDS("data/chap01/gss_2016_clean.rds")

## create figure 1.1 #######
gss_2016_clean |> 
  ggplot2::ggplot(ggplot2::aes(x = grass,
             y = 100 * ggplot2::after_stat(count) / 
                 base::sum(ggplot2::after_stat(count)),
             fill = grass)) +
  ggplot2::geom_bar() +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c("#79a778", '#7662aa'),
                    guide = "none") +
  ggplot2::labs(x = "Should marijuana be legal?",
       y = "Percent of responses")

Graph 1.2: Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1)

I changed the code slightly because of two warnings. Newer versions of {ggplot2} have deprecated some older functions:

Warning: The dot-dot notation (..count..) was deprecated in {ggplot2} version 3.4.0. Please use after_stat(count) instead.
Warning: The guide argument in scale_*() cannot be FALSE. This was deprecated in {ggplot2} version 3.3.4. Please use “none” instead. (I have slightly edited both warnings)

Compare the reproduction with the original Graph 1.1 (a).

R Code 1.5 b: Reproducing bar chart Figure 1.2 using geom_col()

Code

gss_2016_clean <-  base::readRDS("data/chap01/gss_2016_clean.rds")

## create figure 1.2 ########
gss_2016_clean |> 
  dplyr::group_by(grass, age_cat) |> 
  dplyr::count() |> 
  dplyr::group_by(age_cat) |> 
  dplyr::mutate(perc_grass = 100 * n / base::sum(n)) |> 
  ggplot2::ggplot(ggplot2::aes(x = age_cat, fill = grass,
             y = perc_grass)) +
  ggplot2::geom_col(position = 'dodge') +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c('#79a778', '#7662aa'),
                    name = "Should marijuana\nbe legal?") +
  ggplot2::labs(x = "Age group (in years)",
       y = "Percent of responses in age group")

Graph 1.3: Support for marijuana legalization by age group among participants in the 2016 General Social Survey (Figure 1.2)

Graph 1.2 uses ggplot2::geom_bar() whereas this figure here calls the ggplot2::geom_col() function.

There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group … . If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. (Bar charts with {ggplot2})

Compare the reproduction with the original Graph 1.1 (b).

1.6 Exercises

Exercises for the Coder & Hacker edition

SwR objectives

Visit the NHANES website and find the online codebook for the 2013–2014 data (Section 1.6.1)
Open the 2013–2014 NHANES data file saved as nhanes_2013_ch1.csv with the book materials at edge.sagepub.com/harris1e (Achievement 1) (Section 1.6.3 in tab “read_csv()”)
Examine the data types for DUQ200, RIDAGEYR, and RIAGENDR, and fix data types if needed based on the NHANES codebook (Achievements 2 and 3) (Section 1.6.3 in tab “View data” and tab “Codebook”)
Based on the online NHANES codebook, code missing values appropriately for DUQ200, RIDAGEYR, and RIAGENDR (Achievement 4) (Section 1.6.3 in tab “Data cleaning”)
Create a bar chart showing the percentage of NHANES participants answering yes and no to marijuana use (Achievement 5) (Section 1.6.4 in tab “Solution”)
Recode age into a new variable called age_cat with 4 categories: 18–29, 30–39, 40–49, 50–59 (Achievement 7) (Section 1.6.5 in tab “Numbers”)
Create a bar chart of marijuana use by age group (Achievement 6) (Section 1.6.5 in tab “Age”)
~~Add a prolog and comments to your code (Achievement 1)~~ (I am using literate programming in writing this book.)
Create a bar chart of marijuana use by age group and sex with side-by-side bars (Achievement 5) (Section 1.6.5 in tab “Age and sex”)
Following the R code in your code file, use comments to describe what you found. Given what you found and the information in the chapter, what do you predict will happen with marijuana legalization in the next 10 years? Discuss how the omission of older people from the marijuana use question for NHANES influenced your prediction. Write your prediction and discussion in comments at the end of your code file (Section 1.6.6).

My own additional exercises

Download and work with SAS .xpt file data (Demo) {Section 1.6.2}
Show table of frequencies and percentages (Section 1.6.4 in Tab “Numbers”)
Make bars thinner (Section 1.6.4 in Tab “Thin bars”)
Make bars thinner and adjust spacing accordingly (Section 1.6.4 in Tab “Solution”)

1.6.1 Visiting the NHANES website

Use the National Health and Nutrition Examination Survey (NHANES) data to examine marijuana use in the United States. Spend a few minutes looking through the NHANES website (https://www.cdc.gov/nchs/nhanes/index.htm) before you begin, including finding the online codebook for the 2013–2014 data. Complete the following tasks to explore whether age is related to marijuana use in the United States.

The 2013–2014 codebook is on the page NHANES 2013-2014. On this page are the links to different types of data, documentations or codebooks.

You can choose from

Bullet List: NHANES has six different data sections

Demographics Data
Dietary Data
Examination Data
Laboratory Data
Questionnaire Data
Limited Access Data

Bullet List 1.1: NHANES has six different data sections

Clicking of one of these links opens a page where a table contains the following information:

Bullet List: NHANES data pages has download links for documentation and data files

Data File Name
Doc File
Data File
Data Published

Bullet List 1.2: NHANES data pages has download links for documentation and data files

Datasets and Documentation and Frequently Asked Questions (FAQs) are important introductory pages.

NHANES data are saved in a SAS transport (.XPT) file. These file can be read with foreign::read.xport() and haven::read_xpt(). There is also a page with example code to download/import NHANES data files. I will use {haven} instead of the recommended {foreign} package.

1.6.2 Download NHANES SAS .xpt file data (Demo)

Experiment 1.1 : Download and work with NHANES SAS .xpt file data (Demo)

R Code 1.6 : Save a SAS transport (.xpt) file from the NHANES site for demo purposes and display the structure of the file

Code

## download only once (manually)
utils::download.file(
    "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BPXO.XPT",
    tf <- base::tempfile(),
    mode = "wb"
)
blood_pressure_demo <- haven::read_xpt(tf)

save_data_file("chap01", blood_pressure_demo, "blood_pressure_demo.rds")

Listing / Output 1.5: Download a file and save it as R object

(This code chunk has no visible output)

With the exception of using (haven) instead of {foreign}, I followed the instruction from the NHANES recommendation. We see labelled data as it was expected from the import with {haven}.

In Listing / Output 1.10 I have already used the utils::download.file() function: There are some differences:

Listing / Output 1.10 downloaded a STATA .zip file, therefore I used unz() to unzip the file and to read haven::read_dta() instead haven::read_xpt().
More important however was the base::unlink(temp) command to unlink the temporary file, which is not done here.
Another difference was using the argument mode = "wb" for writing binary files. As I understood this is only important for Windows user. On my Mac I could stick with the default value mode = "w". But for reproducibility it would be better to use in the future the more general option with mode = "wb".

Package {nhanesA} facilitates downloading NHANES data

I just learned that there is a {nhanesA} package that facilitates downloading data to R. (2024-01-20) The latest package release is brand-new (2024-01-09).

I am planning to use this package when I am working with again with NHANES data in Chapter 6. Just for me to remember:

The packages is on CRAN
There is a GitHub repository.
A very informative introductory vignette does not only explain the package but also the structure of the NHANES data and codebook.

Important 1.1: Package {nhanesA} facilitates downloading NHANES data

R Code 1.7 : Show data summary of three blood pressure variables with utils::str() (Demo)

Code

blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds")

blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    utils::str()

#> tibble [11,656 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ BPAOARM: chr [1:11656] "R" "R" "R" "R" ...
#>   ..- attr(*, "label")= chr "Arm selected - oscillometric"
#>  $ BPXOSY1: num [1:11656] 109 99 123 102 116 138 104 100 141 NA ...
#>   ..- attr(*, "label")= chr "Systolic - 1st oscillometric reading"
#>  $ BPXODI1: num [1:11656] 67 56 73 65 68 70 58 64 77 NA ...
#>   ..- attr(*, "label")= chr "Diastolic - 1st oscillometric reading"

R Code 1.8 : Show data summary of three blood pressure variables with summary() (Demo)

Code

blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    base::summary()

#>    BPAOARM             BPXOSY1        BPXODI1      
#>  Length:11656       Min.   : 52    Min.   : 31.00  
#>  Class :character   1st Qu.:106    1st Qu.: 63.00  
#>  Mode  :character   Median :117    Median : 71.00  
#>                     Mean   :120    Mean   : 72.04  
#>                     3rd Qu.:130    3rd Qu.: 80.00  
#>                     Max.   :225    Max.   :151.00  
#>                     NA's   :1304   NA's   :1304

R Code 1.9 : Show data summary of three blood pressure variables with skim() (Demo)

Code

blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    skimr::skim()

Data summary
Name	dplyr::select(blood_press…
Number of rows	11656
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
BPAOARM	0	1	0	1	2152	3	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
BPXOSY1	1304	0.89	120.00	19.95	52	106	117	130	225	▁▇▅▁▁
BPXODI1	1304	0.89	72.04	12.42	31	63	71	80	151	▁▇▃▁▁

R Code 1.10 : Show some blood pressure data with my private function my_glance_data()

Code

blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds")

## display some values
blood_pressure_demo |> 
    dplyr::select(c(2, 4, 5)) |> 
    dplyr::rename(arm_selected = BPAOARM) |> 
    dplyr::rename(systolic = BPXOSY1) |> 
    dplyr::rename(diastolic = BPXODI1) |> 
    my_glance_data()

#> # A tibble: 10 × 4
#>      obs arm_selected systolic diastolic
#>    <int> <chr>           <dbl>     <dbl>
#>  1     1 R                 109        67
#>  2   356 R                 131        64
#>  3  1252 R                 100        54
#>  4  2369 R                 173        82
#>  5  5273 R                 131        77
#>  6  8826 R                 111        74
#>  7  9290 R                 112        69
#>  8 10289 R                 141        74
#>  9 10801 R                 112        71
#> 10 11656 R                 122        55

1.6.3 Import & clean data

Exercise 1.1 : Coder exercises: Import & clean data

R Code 1.11 : Open the 2013–2014 NHANES data file that comes with the book

Code

## load and save it only once (manually) ########
nhanes_2013 <- readr::read_csv("data/chap01/nhanes_2013_ch1.csv")
save_data_file("chap01", nhanes_2013, "nhanes_2013.rds")

(This R Code chunk has no visible output.)

R Code 1.12 : Look at the data with utils::str() and skimr::skim()

Code

nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds")

base::summary(nhanes_2013)

#>       ...1          DUQ200        RIDAGEYR        RIAGENDR    
#>  Min.   :   1   Min.   :1.00   Min.   :18.00   Min.   :1.000  
#>  1st Qu.:1265   1st Qu.:1.00   1st Qu.:30.00   1st Qu.:1.000  
#>  Median :2529   Median :1.00   Median :42.00   Median :2.000  
#>  Mean   :2529   Mean   :1.48   Mean   :42.49   Mean   :1.522  
#>  3rd Qu.:3793   3rd Qu.:2.00   3rd Qu.:56.00   3rd Qu.:2.000  
#>  Max.   :5057   Max.   :9.00   Max.   :69.00   Max.   :2.000  
#>                 NA's   :1356

Code

utils::str(nhanes_2013)

#> spc_tbl_ [5,057 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#>  $ ...1    : num [1:5057] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ DUQ200  : num [1:5057] NA 1 1 NA 1 NA 2 2 1 2 ...
#>  $ RIDAGEYR: num [1:5057] 69 54 56 61 56 65 26 33 32 18 ...
#>  $ RIAGENDR: num [1:5057] 1 1 1 2 2 1 2 2 1 1 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   ...1 = col_double(),
#>   ..   DUQ200 = col_double(),
#>   ..   RIDAGEYR = col_double(),
#>   ..   RIAGENDR = col_double()
#>   .. )
#>  - attr(*, "problems")=<externalptr>

Code

skimr::skim(nhanes_2013)

Data summary
Name	nhanes_2013
Number of rows	5057
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
…1	0	1.00	2529.00	1459.97	1	1265	2529	3793	5057	▇▇▇▇▇
DUQ200	1356	0.73	1.48	0.61	1	1	1	2	9	▇▁▁▁▁
RIDAGEYR	0	1.00	42.49	15.16	18	30	42	56	69	▇▆▇▆▇
RIAGENDR	0	1.00	1.52	0.50	1	1	2	2	2	▇▁▁▁▇

Listing / Output 1.6: Look at the data with utils::str() and skimr::skim()

As one of the packages for displaying summaries I am using the {skimr} package (see Section A.84).

Unfortunately the .csv files does not include labelled data. Therefore I had to look for the variable names in all different NHANES sections as outlined in Bullet List 1.1. Inspecting the accompanying NHANES download pages (see: Bullet List 1.2) I could look for the variable names in the download links for the documentation.

DUQ200: Ever used marijuana or hashish
RIAGENDR: Gender
RIDAGEYR: Age in years at screening

There is an additional column ...1 (see Listing / Output 1.6) providing row numbers. Maybe this column works as an ID, a kind sequence number, so that one could merge other data files identified by this number. (See SEQN: Respondent Sequence Number). Why this column has lost its name, I do not know. Perhaps it has to do with the data import done by the book author?

To facilitate the work I made screenshots from these four variables:

R Code 1.13 : Look at the data after data cleaning

Code

nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds")

nhanes_2013_clean <- nhanes_2013 |> 
    dplyr::rename(nr = ...1) |> 
    dplyr::rename(
        sex = RIAGENDR,
        age = RIDAGEYR,
        grass = DUQ200
        ) |> 
    dplyr::mutate(grass =
          dplyr::na_if(x = grass, y = 7)) |> 
    dplyr::mutate(grass =
          dplyr::na_if(x = grass, y = 9)) |>
    tidyr::drop_na() |> 
    dplyr::mutate(sex = forcats::as_factor(sex)) |> 
    dplyr::mutate(grass = forcats::as_factor(grass)) |> 
    dplyr::mutate(grass = 
          forcats::fct_recode(grass, 
          "Yes" = "1", "No" = "2")) |> 
    dplyr::mutate(sex = 
          forcats::fct_recode(sex, 
          "Male" = "1", "Female" = "2"))

## save cleaned data ###############
save_data_file("chap01", nhanes_2013_clean, "nhanes_2013_clean.rds")

base::summary(nhanes_2013_clean)

#>        nr       grass           age            sex      
#>  Min.   :   2   Yes:1991   Min.   :18.00   Male  :1792  
#>  1st Qu.:1288   No :1699   1st Qu.:26.00   Female:1898  
#>  Median :2546              Median :38.00                
#>  Mean   :2536              Mean   :37.48                
#>  3rd Qu.:3784              3rd Qu.:48.00                
#>  Max.   :5056              Max.   :59.00

Code

skimr::skim(nhanes_2013_clean)

Data summary
Name	nhanes_2013_clean
Number of rows	3690
Number of columns	4
_______________________
Column type frequency:
factor	2
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
grass	0	1	FALSE	2	Yes: 1991, No: 1699
sex	0	1	FALSE	2	Fem: 1898, Mal: 1792

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
nr	0	1	2535.51	1450.33	2	1288.25	2546.5	3784.25	5056	▇▇▇▇▇
age	0	1	37.48	12.43	18	26.00	38.0	48.00	59	▇▆▆▆▆

Recoded gender to sex

In all R code chunks and text passages I have changed the gender category to sex. “Sex” refers only to biology but “gender” is a broader concepts: It refers to identity construction, that include biological, psycho-social, and cultural factors.

Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles associated with being a woman, man, girl or boy, as well as relationships with each other. As a social construct, gender varies from society to society and can change over time. … Gender interacts with but is different from sex, which refers to the different biological and physiological characteristics of females, males and intersex persons, such as chromosomes, hormones and reproductive organs. Gender and sex are related to but different from gender identity. Gender identity refers to a person’s deeply felt, internal and individual experience of gender, which may or may not correspond to the person’s physiology or designated sex at birth. (WHO)

My learning from the above procedure of data cleaning

It is wrong to put the column name ...1 into accents.
There exists a shortcut for several dplyr::rename() but not for dplyr::na_if() and forcats::as_factor() because here we need to change the column values with the command.
Sequence of commands is important:
- Start with the renaming of variables, This is not mandatory but it helps to address the correct column in the following commands.
- Recode different missing values to NA’s with dplyr::na_if()
- Then drop (all) NA’s with tidyr::drop_na().
forcats::as_factor() needs to rewrite the column as factor with dplyr::mutate()

1.6.4 Bar chart for marijuana use

Exercise 1.2 : Bar chart (with different bar width ans space) showing the percentage of NHANES participants answering yes and no to marijuana use

R Code 1.14 : Show frequencies and percentages of NHANES participants answering yes and no to “Ever used marijuana or hashish”

Code

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

tibble::enframe(table(nhanes_2013_clean$grass)) |> 
    dplyr::rename(`Ever used grass or hash` = name, Freq = value) |> 
    dplyr::mutate(Perc = round(100 * Freq / sum(Freq), 2))

#> # A tibble: 2 × 3
#>   `Ever used grass or hash` Freq        Perc       
#>   <chr>                     <table[1d]> <table[1d]>
#> 1 Yes                       1991        53.96      
#> 2 No                        1699        46.04

Instead of using name and value I could have used the position of the column numbers 1 and 2 (all names without the `-sign.)

This is a manual calculation using the {tidyverse} approach: I am sure there are some packages that may facilitate this computation (e.g., {Janitor} or {DescTools}), but I had difficulties to apply the appropriate functions.

R Code 1.15 : Percentage of NHANES participants answering yes and no to marijuana use

Code

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

ggplot2::ggplot(nhanes_2013_clean,
    ggplot2::aes(grass, 
         y = 100 * ggplot2::after_stat(count) /
             base::sum(ggplot2::after_stat(count)),
             fill = grass
        )) +
    ggplot2::geom_bar() + 
    ggplot2::theme_minimal() +
    ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'),
                    guide = "none") +
    ggplot2::scale_y_continuous(
        breaks = base::seq(0, 50, by = 10),
        labels = scales::label_percent(scale = 1)) +
    ggplot2::labs(x = "Did you ever use marijuana or hashish?",
       y = "Percent of responses")

Graph 1.4: Proportion of marijuana use of the participant in the NHANES 2013 survey

The bars are very thick. I tried with ggplot2::geom_bar(width = 0.5) to make them thinner. But I failed. (See Listing / Output 1.7)

R Code 1.16 : Percentage of NHANES participants answering yes and no to marijuana use

(a) Proportion of marijuana use of the participant in the NHANES 2013 survey

The bars are now thin but the space between didn’t change. I tried several options with the position argument to adjust bar width and spacing:

Dodging preserves the vertical position of an geom while adjusting the horizontal position. position_dodge() requires the grouping variable to be be specified in the global or geom_* layer. Unlike position_dodge(), position_dodge2() works without a grouping variable in a layer. position_dodge2() works with bars and rectangles, but is particulary useful for arranging box plots, which can have variable widths.

All my trials with position_dodge() and position_dodge2() failed. The only two strategies I could reduce the spacing was:

reducing the width of the graphic (and adapting the title of the graphic to fit into the smaller space)
using ggplot2::theme(aspect.ratio = 3/1) as shown in Listing / Output 1.8.

R Code 1.17 : Percentage of NHANES participants answering yes and no to marijuana use

What I learned from creating the bar chart

Even with this simple example of a bar chart I learned some new options:

Use ggplot2::after_stat() inside the aesthetics of the first layer to convert frequencies to percentages.
Adding breaks with n.breaks = <number> or breaks = base::seq(from, to, by = <number>) inside the appropriate scale.
Use the {scales} package to add the %-sign after the values. As I have already calculated the percentages I need to reduce the default value of scale = 100 to scale = 1.
Use ggplot2::theme(aspect.ratio = <number>/1) creating thinner bars and reducing the spacing between the bars accordingly. In my case I changed the aspect ratio to 3/1. A bigger front number compresses the graph, a smaller front number extends it. Aspect ratio of 1/1 is the standard (= no change).

1.6.5 Marijuana use per age and sex

Exercise 1.3 : Exploring the percentage of NHANES participants answering yes or no to marijuana use by age category and sex

R Code 1.18 : Show frequencies and percentages of NHANES participants answering yes and no to “Ever used marijuana or hashish”

Code

nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds")

nhanes_2013_grouped <- nhanes_2013_clean |> 
    ## cut age into several defined age groups
    dplyr::mutate(age_cat = base::cut(age,
                   breaks = c(-Inf, 29, 39, 49, 59),
                   labels = c("18 - 29", "30 - 39", "40 - 49", "50 - 59"))) |> 
    dplyr::select(grass, sex, age_cat)


## older trial, without the knowledge of {gtsummary}
# tibble::enframe(table(nhanes_2013_grouped$age_cat)) |>  
#     dplyr::rename(`Ever used grass or hash` = 1, Freq = 2) |> 
#     dplyr::mutate(Perc = round(100 * Freq / sum(Freq), 2))

nhanes_2013_grouped |> 
    labelled::set_variable_labels(
        grass = "Ever used grass or hash?"
    ) |> 
    gtsummary::tbl_summary(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        type = grass ~ "categorical"
    ) |> 
    gtsummary::bold_labels()

save_data_file("chap01", nhanes_2013_grouped, "nhanes_2013_grouped.rds")

Characteristic	N = 3,690¹
Ever used grass or hash?
Yes	1,991 (54%)
No	1,699 (46%)
Sex
Male	1,792 (49%)
Female	1,898 (51%)
Age Category
18 - 29	1,154 (31%)
30 - 39	859 (23%)
40 - 49	855 (23%)
50 - 59	822 (22%)
¹ n (%)

Instead of using 1 and 2 I could have used the position of the column numbers name and value (all names without the `-sign.)

R Code 1.19 : Table: Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses

Code

nhanes_2013_grouped <- base::readRDS("data/chap01/nhanes_2013_grouped.rds")

nhanes_2013_grouped |> 
    labelled::set_variable_labels(
        grass = "**Ever used grass or hash?**"
    ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = age_cat,
        col = grass,
        percent = "cell"
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    ) |> 
    gtsummary::bold_labels()

Table 1.1: Proportion of marijuana use of the participant by age group in percent of total responses in the NHANES 2013 survey

	Ever used grass or hash?		Total
	Yes	No	Total
Age Category
18 - 29	662 (18%)	492 (13%)	1,154 (31%)
30 - 39	466 (13%)	393 (11%)	859 (23%)
40 - 49	405 (11%)	450 (12%)	855 (23%)
50 - 59	458 (12%)	364 (9.9%)	822 (22%)
Total	1,991 (54%)	1,699 (46%)	3,690 (100%)

I have required {gtsummary} to calculate the percentage with one decimal place. The reason was that the graphic did not match the table. For instance: Without decimal places the “No”-proportion of the age group 18-29 was with 13% equal to the “Yes”-proportion of the age group 30-39. But in the figure these two bars have a different height. The reason was a rounding error: Both values 13.3% and 12.6% were rounded to 13%. But in the figure one could see the difference of 0.7% in the length of the bars.

R Code 1.20 : Figure: Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses

Code

## age groups bar chart 1 ########
nhanes_2013_grouped |> 
    ggplot2::ggplot(ggplot2::aes(
        x = age_cat, 
        fill = grass,
         y = 100 * ggplot2::after_stat(count) / 
             base::sum(ggplot2::after_stat(count))
        )
    ) + 
    ggplot2::geom_bar(position = 'dodge') +    
    ggplot2::theme_minimal() +
    ggplot2::scale_fill_manual(values = c('darkgreen', 'darkred'),
                    name = "Ever used\ngrass or hash?") +
    ggplot2::labs(x = "Age group (in years)",
       y = "Percent of total responses")

Graph 1.5: Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses

R Code 1.21 : Table:: Percentage of NHANES 2013 survey participants by age group answering yes and no to marijuana use in percent of age group

Code

nhanes_2013_grouped <- base::readRDS("data/chap01/nhanes_2013_grouped.rds")

nhanes_2013_grouped |> 
    labelled::set_variable_labels(
            grass = "**Ever used grass or hash?**"
        ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = age_cat,
        col = grass,
        percent = "row",
        margin = "column",
        digits = c(0, 1)
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    ) |> 
    gtsummary::bold_labels()

Table 1.2: Proportion of marijuana use of the participant by age group in the NHANES 2013 survey

	Ever used grass or hash?		Total
	Yes	No	Total
Age Category
18 - 29	662 (57.4%)	492 (42.6%)	1,154 (100.0%)
30 - 39	466 (54.2%)	393 (45.8%)	859 (100.0%)
40 - 49	405 (47.4%)	450 (52.6%)	855 (100.0%)
50 - 59	458 (55.7%)	364 (44.3%)	822 (100.0%)

R Code 1.22 : Figure: Percentage of NHANES participants of the 2013 survey answering yes and no to marijuana use in percent of age group

Code

## age groups bar chart 2 ########
nhanes_2013_grouped |> 
  dplyr::group_by(grass, age_cat) |>
  dplyr::count() |> 
  dplyr::group_by(age_cat) |> 
  dplyr::mutate(perc_grass = 100 * n / base::sum(n)) |> 
  ggplot2::ggplot(ggplot2::aes(x = age_cat, fill = grass,
             y = perc_grass)) +
  ggplot2::geom_col(position = 'dodge') +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c('darkgreen', 'darkred'),
                    name = "Ever used\ngrass or hash?") +
  ggplot2::labs(x = "Age group (in years)",
       y = "Percent of responses in age group")

Graph 1.6: Percentage of NHANES participants of the 2013 survey answering yes and no to marijuana use in percent of age group

R Code 1.23 : Percentage of NHANES participants 2013 survey by age group and sex answering yes and no to marijuana use

Code

nhanes_2013_grouped <- 
    base::readRDS("data/chap01/nhanes_2013_grouped.rds")

## age and sex bar chart ########
nhanes_2013_grouped |> 
  dplyr::group_by(grass, age_cat, sex) |> 
  dplyr::count() |> 
  dplyr::group_by(age_cat, sex) |> 
  dplyr::mutate(perc_grass_2 = 100 * n / base::sum(n)) |> 
  ggplot2::ggplot(ggplot2::aes(
      x = age_cat, 
      y = perc_grass_2,
      fill = sex)) +
  ggplot2::geom_col(position = 'dodge') +
  ggplot2::theme_minimal() +
  ggplot2::scale_fill_manual(values = c('#f98e31', '#730b6d'),
                             name = "sex") +
  ggplot2::labs(
          x = "Age group (in years)",
          y = "Percent of males/females that used marijuana or hashish"
      )

Graph 1.7: Proportion of marijuana use of the participant by age group and sex in the NHANES 2013 survey

With the exception of the oldest age group the graphic shows that the proportion of marijuana / hashish user increases with younger people. One possible explanation for the untypical high use in the oldest age group could be that people between 50-59 of the 2013 survey were born between 1954-1963 and were young adults during the time of the Flower Power generation part of the counterculture of the 1960s-1970s.

Graph 1.7 is a little difficult to interpret. Both bars colors shows the proportion of “Yes”-respondents in the appropriate age group. For instance: About 62% of males and about 51% of women of the age group 18-29 said “Yes”.

Therefore the bars to not sum up to 100%. The figures shows only the “Yes” part of the dichotomous grass variable for men and women in a certain age group. The other “No” part is missing.

To distinguish the sex colors in Graph 1.7 from the Yes/No colors in Graph 1.5 and Graph 1.6 I chose at first some kind of blue for men and red for females. But I learned from An alternative to pink & blue: Colors for gender data that one should avoid this stereotypical colors.

In our western culture, these colors come with the whole gender stereotype baggage. Pink means weak, shy girls who play with dolls and don’t deserve as much as boys. Blue stands for boys who need to be strong & rough. When we create a chart with pink & blue, we endorse gender stereotypes.

1.6.6 Data interpretation

Report 1.1: Marijuana usage and opinions to legalize it (GSS and NHANES data)

Graph 1.3 with GSS data 2016 shows that younger people support the legalizing of marijuana. Therefore we can expect that in the future the proportion of people that approve legalizing will increase. With a small project on my own I have backed up this thesis. Graph 1.12 demonstrates that the percentage of supporters has increased very strongly in the last years and has now hit the 70% mark. This trend is very clear even I have not distinguished in the longitudinal study between different age groups.

The graphs of the NHANES 2013 data in Graph 1.5 and fig-grass-age-group2 show that the highest proportion of people ever used marijuana or hashish lies in youngest age. The age trend is not very clear, because we have only figures for person equal or under 59 years. (Here would be necessary another small project with longitudinal analysis including newer data.)

With the exception of the age group 40-49 the proportion of males ever used in their respective age group is higher than the percentage of women. (See Graph 1.7). But this is only a snapshot and difficult to interpret. Another possible thesis could be that there is a tiny upward trend from old top young in marijuana use. One possible explanation for the untypical higher use in the oldest age group could be that people between 50-59 of the 2013 survey were born between 1954-1963 and were young adults during the time of the Flower Power generation part of the counterculture of the 1960s-1970s.

1.7 Get the GSS data

1.7.1 Working with the GSS

I am very interested how to get GSS data directly from the GSS website, so that I could work on interesting research questions myself.

I have found several resources helping to work with the GSS. The most important one is the {gssr} package (see: Section A.35)

Resource 1.2 : asdfree: Analyze Survey Data for Free

Analyze Survey Data for Free is a bookdown website by Anthony Damico with currently 64 locations to grab free survey data. As expected it features also a description of the GSS including analysis examples with the {survey} package and — especially important for my purpose here — {lodown}, a package on GitHub that is not maintained anymore.

Experiment 1.2 features five different strategies to download GSS data:

Download extract by using the GSS Data Explorer — Tab: “Explorer”
Download files manually — Tab: “by hand”
Download files programmatically — Tab: “automated”
Download via the {lodown} package — Tab: “lodown”
Download via the {gssr} package — Tab: “gssr”

1.7.2 Five strategies to get GSS data

Experiment 1.2 : How to get the General Social Survey (GSS) data

To use all the facilities of the GSS Data Explorer (tagging, tabulating, data extracting) you need to register for a free account. The good thing is: This is a onetime procedure.

Procedure 1.1 : Downloading data extracts with the GSS Data Explorer

Create a free account for the GSS Data Explorer, a tool that allows to browse the data that have been collected in the surveys.
- Fill out the form
- Wait for an email with the verification code
- Confirm the registration with the verification code
Go for the tour to learn the interface (Link “Tour Guide”)
Now you are ready to follow the advises in the slides. If you prefer you can view the slide show in a standalone browser.

As one can see this is a somewhat cumbersome procedure to download the desired data. Following the proposed strategies in the other tabs are much easier for importing GSS data. But using the GSS Data Explorer is very helpful to explore the dataset. Apply the first three steps of the above list to find the correct variable names, to read the exact wording of the question asked and to inspect the different codes that are used for the variable. Otherwise you have to skim the more than 700 pages of the GSS codebook.😭

Another approach is to download the complete dataset (or all variables of those years you are interested in) and manage the data in such a way that it can be easily used for your research question. (See Section 1.4)

Procedure 1.2 : Download GSS individual year data sets (cross-section only)

Visit https://gss.norc.org/Get-The-Data and choose under the section “Download the Data” the “STATA” format. I read elsewhere that this is the preferred format to convert the data into R with the {haven} package.
From the STATA-page choose the appropriate link (2016 in our case) under the section “Individual Year Data Sets (cross-section only)” and download the file 2016_stata.zip (994 MB) into your preferred folder on your hard disk. After you extract the .zip file you will get the STAT file GSS2016.dta (4.4 MB).
You can now apply R Code 1.24.

R Code 1.24 : Import STATA GSS 2016 file into R using (haven)

Code

## run only once (manually) ###########
gss_2016_man <- haven::read_dta("data/chap01/GSS2016.dta")
save_data_file("chap01", gss_2016_man, "gss_2016_man.rds")

(This R code chunk has no visible output)

Code

gss_2016_man <- 
    base::readRDS("data/chap01/gss_2016_man.rds")

gss_2016_man |>
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)

#> # A tibble: 10 × 3
#>      obs age       grass                      
#>    <int> <dbl+lbl> <dbl+lbl>                  
#>  1     1 47        NA(i) [iap]                
#>  2   172 27            1 [should be legal]    
#>  3   562 70            2 [should not be legal]
#>  4   898 60            2 [should not be legal]
#>  5  1019 30            1 [should be legal]    
#>  6  1176 80            2 [should not be legal]
#>  7  1505 53        NA(i) [iap]                
#>  8  1911 54            2 [should not be legal]
#>  9  2622 62        NA(i) [iap]                
#> 10  2867 72            2 [should not be legal]

Listing / Output 1.9: Import STATA GSS 2016 file into R using (haven) and glance at the data

{haven} imports data as labelled vectors

The data structure we have found here is very different from the Excel data file provided with the book.

Labelled vectors is a completely new feature for me. I learned that value labels and other metadata tags that are commonly seen when working with other statistical software like SAS, STATA or SPSS (cf. Survey Research and Datasets in R, here section 3 Labelled Data)

A labelled vector is a common data structure in other statistical environments, allowing you to assign text labels to specific values. This class makes it possible to import such labelled vectors in to R without loss of fidelity. (Create a labelled vector)

I will go into more details in Section 1.8. The important thing here is to notice that the variable grass has labelled values that explain the short code. Code 1 represents the respondent option that marijuana should be legalized and 2 the opposite. We also learn that there is with NA i a special kind of NA value:

.i: Inapplicable (IAP). Respondents who are not asked to answer a specific question are assigned to IAP. (See Alert on the STATA download page)

On the website we see under the “Alert” section that there other kind of NA’s as well. And the 2022 GSS Codebook describes still other, less common missing values.

Additional comments

I chose for file saving the base::saveRDS() option (and not base::save()) because when later reading into R again with base::readRDS() it does not overwrite a variable with the same name respectively I can assign the file to another variable name.

R Code 1.25 : Get year 2016 of GSS data set with base::download.file()

Code

## run only once (manually) ###########
temp <- base::tempfile()
utils::download.file("https://gss.norc.org/documents/stata/2016_stata.zip",temp)
gss_2016_aut <- haven::read_dta(base::unz(temp, "GSS2016.dta"))
base::unlink(temp)
save_data_file("chap01", gss_2016_aut, "gss_2016_aut.rds")

Listing / Output 1.10: Get year 2016 of GSS data set with base::download.file()

(This R code chunk has no visible output)

This time we have the file downloaded programmatically which is much better in term of reproducibility. We don’t need now to import the data {haven} but can call base::readRDS().

Code

gss_2016_aut <- base::readRDS("data/chap01/gss_2016_aut.rds")

gss_2016_aut |> 
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)

#> # A tibble: 10 × 3
#>      obs age       grass                      
#>    <int> <dbl+lbl> <dbl+lbl>                  
#>  1     1 47        NA(i) [iap]                
#>  2   172 27            1 [should be legal]    
#>  3   562 70            2 [should not be legal]
#>  4   898 60            2 [should not be legal]
#>  5  1019 30            1 [should be legal]    
#>  6  1176 80            2 [should not be legal]
#>  7  1505 53        NA(i) [iap]                
#>  8  1911 54            2 [should not be legal]
#>  9  2622 62        NA(i) [iap]                
#> 10  2867 72            2 [should not be legal]

Listing / Output 1.11: Read previously saved .rds file into R and glance at the data

Data now have a more R like appearance, even if the variable classes with “haven_labelled, vctrs_vctr, double” are unfamiliar. But we have now lost some information, especially we have to consult the codebook to know what the codes 1 and 2 mean.

The following strategy I have taken from the bookdown book Analyze Survey Data for Free (asdf.com) It gives step by step instructions to explore public Microdata. Here I refer to the General Social Survey (GSS) section of the book.

Procedure 1.3 : Get the GSS data with the {lodown} package and glance at the data

Working with {lodown} is a three step procedure:

Retrieve a listing of all available extracts for the GSS data.
Choose what files you want to download. In our case data for the year 2016.
Download the specified dataset in the offered SPSS file format, but {lodown} produces with .rds a native R file format with the name 2016.rds.

The second step has to be done manually but I have the result of my inspection already integrated in Listing / Output 1.12.

As additional steps I renamed the downloaded file, so that I can distinguish it from similar files of the other approaches. Finally I glanced at the grass and age data.

R Code 1.26 : Get GSS data via {lodown} package

Code

## can't suppress messages, tried several options

## run only once (manually) ############
my_folder <- base::paste0(here::here(), "/data-raw")

# (1) retrieve a listing of all available extracts for the GSS data
gss_cat <- lodown::get_catalog(data_name = "gss",
                               output_dir = my_folder,
                               "GSS") |> 
## (2) choose the catalog part to download
    dplyr::filter(
        output_filename == base::paste0(my_folder, "/2016.rds"))

## (3) download the GSS microdata as 2016.rds
lodown::lodown("gss" , gss_cat)

## rename dataset to distinguish from other download approaches
old_filename <- base::paste0(my_folder, "/2016.rds")
new_filename <- base::paste0(my_folder, "/gss_2016_cat.rds")
base::file.rename(from = old_filename, to = new_filename)

(This R code chunk has no visible output beside the warning message.)

Code

## load and glance at data
gss_2016_cat <- base::readRDS("data/chap01/gss_2016_cat.rds")
gss_2016_cat |> 
    dplyr::select(c(age, grass)) |> 
    my_glance_data(N = 8, seed = 2016)

#>     obs age grass
#> 1     1  47    NA
#> 2   172  27     1
#> 3   562  70     2
#> 4   898  60     2
#> 5  1019  30     1
#> 6  1176  80     2
#> 7  1505  53    NA
#> 8  1911  54     2
#> 9  2622  62    NA
#> 10 2867  72     2

Listing / Output 1.12: Get GSS data for the year 2016 via the {lodown} package

The result is a pure .rds file where the columns are still of class “haven_labelled, vctrs_vctr, double” as in Listing / Output 1.11.

Finally I will download the 2016 data with the help of the {gssr} package (see Section A.35). This takes some minutes. At first I missed the vignette, so I had to download the package again with the additional argument build_vignettes = TRUE. Whereas the vignette explains how to analyse data the GitHub is very helpful how to get the desired data.

You can quickly get the data for any single GSS year by using gssr::gss_get_yr() to download the data file from NORC and put it directly into a tibble.

After downloaded the file we can — as in the other tabs already done — load the file and glance at the grass/age data.

R Code 1.27 : Get GSS 2016 Year Data Set (cross-section only) and glance at the data

Code

## run only once (manually) ####################
gss_2016_gssr <- gssr::gss_get_yr(year = 2016)
gss_2016_gssr <- gss_2016_gssr |>
   dplyr::select(c(age, grass))

save_data_file("chap01", gss_2016_gssr, "gss_2016_gssr.rds")

Code

## load data ####################
gss_2016_gssr <- base::readRDS("data/chap01/gss_2016_gssr.rds")

## glance at data
gss_2016_gssr |> 
    my_glance_data(N = 8, seed = 2016)

#> # A tibble: 10 × 3
#>      obs age       grass                      
#>    <int> <dbl+lbl> <dbl+lbl>                  
#>  1     1 47        NA(i) [iap]                
#>  2   172 27            1 [should be legal]    
#>  3   562 70            2 [should not be legal]
#>  4   898 60            2 [should not be legal]
#>  5  1019 30            1 [should be legal]    
#>  6  1176 80            2 [should not be legal]
#>  7  1505 53        NA(i) [iap]                
#>  8  1911 54            2 [should not be legal]
#>  9  2622 62        NA(i) [iap]                
#> 10  2867 72            2 [should not be legal]

Listing / Output 1.13: Get GSS 2016 Year Data Set (cross-section only) and glance at the data

Downloading data with {gssr} results in exactly the same format as in listing Listing / Output 1.9 from the manual download. But it has now the advantages from the {gssr} package. For instance with the integrated help it is much easier to

find the variables
to read the question text of the variable
to see in which year the questions was asked
what the code - including the different types of NA’s mean

Six different approaches to get the GSS data

Using the {gssr} packages seems to me by far the best approach.

1.8 Working with Labelled Data

1.8.1 How to work with labelled data? — Resources

During downloading GSS data I mentioned that {haven} imports data as labelled vectors. As a completely new feature for me I looked at the Internet for their advantages and how to work with them. I found four important resources:

Resource 1.3 : How to work with labelled data?

Survey Research Datasets and R

A book accompanying a workshop delivered for the 7th Biennial ACSPRI Social Science Methodology Conference. It provides a practical demonstration of several packages for accessing and working with survey data, associated metadata and official statistics in R. The short book demonstrates

Working with external data sources from common statistical packages (SPSS, SAS, Stata, Excel) and their quirks
Easily working with categorical data in R with the {labelled} R package
Accessing external databases in an R native way using {DBI} and {dbplyr} (The DBI package helps connecting R to database management systems (DBMS). dbplyr is a {dplyr} backend for data bases)
Accessing publicly available data in R scripts via the web
Resources for accessing official statistics data in R

Here I am interested only in the first two demonstration. I will refer especially to the section on Labelled data.

WATCH OUT! {labelled} and {sjlabelled} have similar purposes and functions

{labelled} and {sjlabelled} are very similar. As far as I understand is the main difference that {sjlabelled} supports not only working with labelled data but also offers functions to benefit from these features. {labelled} in contrast manipulates labelled data only as an intermediate data structure that should be converted to common R classes, loosing these kind of meta-data.

1.8.2 Displaying data including labelled vectors

To get comfortable with labelled data Experiment 1.3 I will show how labelled data appear in different viewing functions. To inspect the possibilities of the {labelled} package see Experiment 1.4.

Experiment 1.3 : Different types of data views with labelled data

R Code 1.28 : Display the internal structure of labelled data

Code

utils::str(gss_2016_gssr)

#> tibble [2,867 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ age  : dbl+lbl [1:2867] 47, 61, 72, 43, 55, 53, 50, 23, 45, 71, 33, 86, 32, 6...
#>    ..@ label       : chr "age of respondent"
#>    ..@ format.stata: chr "%29.0g"
#>    ..@ labels      : Named num [1:14] 89 NA NA NA NA NA NA NA NA NA ...
#>    .. ..- attr(*, "names")= chr [1:14] "89 or older" "don't know" "iap" "I don't have a job" ...
#>  $ grass: dbl+lbl [1:2867] NA(i),     1,     2, NA(i),     1,     1, NA(i),     ...
#>    ..@ label       : chr "should marijuana be made legal"
#>    ..@ format.stata: chr "%29.0g"
#>    ..@ labels      : Named num [1:15] 1 2 NA NA NA NA NA NA NA NA ...
#>    .. ..- attr(*, "names")= chr [1:15] "should be legal" "should not be legal" "don't know" "iap" ...

R Code 1.29 : Show attributes of labelled data

Code

base::attributes(gss_2016_gssr$grass)

#> $label
#> [1] "should marijuana be made legal"
#> 
#> $format.stata
#> [1] "%29.0g"
#> 
#> $class
#> [1] "haven_labelled" "vctrs_vctr"     "double"        
#> 
#> $labels
#>               should be legal           should not be legal 
#>                             1                             2 
#>                    don't know                           iap 
#>                            NA                            NA 
#>            I don't have a job                   dk, na, iap 
#>                            NA                            NA 
#>                     no answer                 not imputable 
#>                            NA                            NA 
#>                 not imputable                       refused 
#>                            NA                            NA 
#>                skipped on web                    uncodeable 
#>                            NA                            NA 
#> not available in this release    not available in this year 
#>                            NA                            NA 
#>                  see codebook 
#>                            NA

This works only with vectors (variables), but not with data frames. You can display specific list values with numbers (e.g., attributes(gss_2016_gssr$grass)[[4]]) or names (e.g., attributes(gss_2016_gssr$grass)[["labels"]]).

R Code 1.30 : Show first six records of the labelled variable

Code

head(gss_2016_gssr$grass)

#> <labelled<double>[6]>: should marijuana be made legal
#> [1] NA(i)     1     2 NA(i)     1     1
#> 
#> Labels:
#>  value                         label
#>      1               should be legal
#>      2           should not be legal
#>  NA(d)                    don't know
#>  NA(i)                           iap
#>  NA(j)            I don't have a job
#>  NA(m)                   dk, na, iap
#>  NA(n)                     no answer
#>  NA(p)                 not imputable
#>  NA(q)                 not imputable
#>  NA(r)                       refused
#>  NA(s)                skipped on web
#>  NA(u)                    uncodeable
#>  NA(x) not available in this release
#>  NA(y)    not available in this year
#>  NA(z)                  see codebook

Listing / Output 1.14: Use head() to show the first six records of the labelled variable.

This works only with labelled vectors (variables), but not with data frames.

One immediate advantage of labelled vectors is that value labels are used in data frame printing when using tibble (and by extension the wider tidyverse) and other packages using the pillar printing methods. (Survey Research Datasets and R)

R Code 1.31 : Using tibble resp. pillar printing methods

Code

gss_2016_gssr |> 
    dplyr::count(grass)

#> # A tibble: 3 × 2
#>   grass                           n
#>   <dbl+lbl>                   <int>
#> 1     1 [should be legal]      1126
#> 2     2 [should not be legal]   717
#> 3 NA(i) [iap]                  1024

Listing / Output 1.15: Using tibble resp. pillar printing methods to display NA’s

R Code 1.32 : Get a glimpse of labelled data

Code

dplyr::glimpse(gss_2016_gssr)

#> Rows: 2,867
#> Columns: 2
#> $ age   <dbl+lbl> 47, 61, 72, 43, 55, 53, 50, 23, 45, 71, 33, 86, 32, 60, 76, …
#> $ grass <dbl+lbl> NA(i),     1,     2, NA(i),     1,     1, NA(i),     2, NA(i…

R Code 1.33 : Print values as tibble() with labelled data

Code

base::print(gss_2016_gssr)

#> # A tibble: 2,867 × 2
#>    age       grass                      
#>    <dbl+lbl> <dbl+lbl>                  
#>  1 47        NA(i) [iap]                
#>  2 61            1 [should be legal]    
#>  3 72            2 [should not be legal]
#>  4 43        NA(i) [iap]                
#>  5 55            1 [should be legal]    
#>  6 53            1 [should be legal]    
#>  7 50        NA(i) [iap]                
#>  8 23            2 [should not be legal]
#>  9 45        NA(i) [iap]                
#> 10 71            2 [should not be legal]
#> # ℹ 2,857 more rows

R Code 1.34 : Glance randomly at some records with labelled data

Code

gss_2016_gssr |>
    my_glance_data(N = 8, seed = 2016)

#> # A tibble: 10 × 3
#>      obs age       grass                      
#>    <int> <dbl+lbl> <dbl+lbl>                  
#>  1     1 47        NA(i) [iap]                
#>  2   172 27            1 [should be legal]    
#>  3   562 70            2 [should not be legal]
#>  4   898 60            2 [should not be legal]
#>  5  1019 30            1 [should be legal]    
#>  6  1176 80            2 [should not be legal]
#>  7  1505 53        NA(i) [iap]                
#>  8  1911 54            2 [should not be legal]
#>  9  2622 62        NA(i) [iap]                
#> 10  2867 72            2 [should not be legal]

In RStudio you can see part of the labels by supplying additional information about NAs in R code chunks and variable labels in the RStudio viewer.

(a) Labelled data as a R code chunk result in RStudio

base::summary() with data frames that contain labelled data generates an error when {haven} is not loaded:

! summary.haven_labelled() not implemented.

Labelled meta data not visible in most base R viewing functions

With the exception of base::str() labelled meta data are not visible when viewing at the printed data.

And even more important: You can’t see the different types of NA’s without specialized functions. “Tagged” (labelled) missing values behaves exactly like regular R missing values. For more information see Experiment 1.4.

1.8.3 Working with labelled data

1.8.3.1 Two different approaches

The purpose of the labelled package is to provide functions to manipulate metadata as variable labels, value labels and defined missing values using the haven_labelled and haven_labelled_spss classes introduced in haven package. (Introduction to labelled)

Until now I have not worked in this book with haven_labelled_spss classes. This class comes from imports of SPSS data files via the {haven} package.

This class is only used when user_na = TRUE in read_sav(). It is similar to the labelled() class but it also models SPSS’s user-defined missings, which can be up to three distinct values, or for numeric vectors a range. (Labelled vector for SPSS) ¹

As user-defined missing values are not important here I will stick with STATA imports.

My main interest with Experiment 1.4 is to prepare and/or use the labelled data to work with R. There are principal two approaches:

Procedure 1.4 : Two different procedures to work in R with labelled data

Approach A

In approach A, haven_labelled vectors are converted into factors or into numeric/character vectors just after data import, using unlabelled(), to_factor() or unclass(). Then, data cleaning, recoding and analysis are performed using classic R vector types.

Data import
Conversion to factor / numeric
Data cleaning / data recoding
Data analysis

Approach B

In approach B, haven_labelled vectors are kept for data cleaning and coding, allowing to preserved original recoding, in particular if data should be reexported after that step. Functions provided by labelled will be useful for managing value labels. However, as in approach A, haven_labelled vectors will have to be converted into classic factors or numeric vectors before data analysis (in particular modeling) as this is the way categorical and continuous variables should be coded for analysis functions. (Introduction to labelled)

Data import
Data cleaning / data recoding (-> Data export to SPSS / SAS / STATA )
Conversion to factor / numeric
Data analysis

Graph 1.9: Two different approaches to work in R with labelled data

1.8.3.2 Inspecting labelled data

Experiment 1.4 : Inspecting labelled data

R Code 1.35 : Get variable labels of a labelled data frame

Code

labelled::var_label(gss_2016_gssr)

#> $age
#> [1] "age of respondent"
#> 
#> $grass
#> [1] "should marijuana be made legal"

Listing / Output 1.16: Use labelled::var_label() to get variables labels of a data frame

The labels of the two variable specifies in more detail their content. For the grass variable we get the question asked in the survey. This is very helpful and saves consulting the codebook.

I am using here the short form with labelled::var_label() but there exist also with labelled::get_variable_labels() a long form which has with it plural form a more consistent syntax.

R Code 1.36 : Get value labels of data frame variables

Code

labelled::val_labels(gss_2016_gssr)

#> $age
#>                   89 or older                    don't know 
#>                            89                            NA 
#>                           iap            I don't have a job 
#>                            NA                            NA 
#>                   dk, na, iap                     no answer 
#>                            NA                            NA 
#>                 not imputable                 not imputable 
#>                            NA                            NA 
#>                       refused                skipped on web 
#>                            NA                            NA 
#>                    uncodeable not available in this release 
#>                            NA                            NA 
#>    not available in this year                  see codebook 
#>                            NA                            NA 
#> 
#> $grass
#>               should be legal           should not be legal 
#>                             1                             2 
#>                    don't know                           iap 
#>                            NA                            NA 
#>            I don't have a job                   dk, na, iap 
#>                            NA                            NA 
#>                     no answer                 not imputable 
#>                            NA                            NA 
#>                 not imputable                       refused 
#>                            NA                            NA 
#>                skipped on web                    uncodeable 
#>                            NA                            NA 
#> not available in this release    not available in this year 
#>                            NA                            NA 
#>                  see codebook 
#>                            NA

Listing / Output 1.17: Use labelled::val_labels() to get value labels of data frame variables

Important is the start of the list of value labels.

age: It shows that the value 89 of variable age includes values of 89 and older. This is important for the analysis and saves the work of recoding as done in
grass: For the grass variable we learn that 1 stands for the opinion that Marijuana “should be legal” and two for the opposite.

What follows is a comprehensive list of NA values used in the survey, even if many of these values are not used for the two variables considered here.

We know from R Code 1.31 that the grass variable has 1024 NA’s, but we do not know their composition of different NA values. See Listing / Output 1.19 how to get this information.

WATCH OUT! Inconsistency in the function names

There is an inconsistency in the singular vs. plural form between labelled::var_label() and labelled::val_labels(). Both functions show a list of values (= plural) if there are more than one label available.

For value labels exist also a singular version with labelled::val_label() to get a specific value label of a variable.

R Code 1.37 : Using utils::head() on a labelled variable after {labelled} is loaded

Code

## {**labelled**} is loaded through a call in the previous chunk
utils::head(gss_2016_gssr$grass, 10)

#> <labelled<double>[10]>: should marijuana be made legal
#>  [1] NA(i)     1     2 NA(i)     1     1 NA(i)     2 NA(i)     2
#> 
#> Labels:
#>  value                         label
#>      1               should be legal
#>      2           should not be legal
#>  NA(d)                    don't know
#>  NA(i)                           iap
#>  NA(j)            I don't have a job
#>  NA(m)                   dk, na, iap
#>  NA(n)                     no answer
#>  NA(p)                 not imputable
#>  NA(q)                 not imputable
#>  NA(r)                       refused
#>  NA(s)                skipped on web
#>  NA(u)                    uncodeable
#>  NA(x) not available in this release
#>  NA(y)    not available in this year
#>  NA(z)                  see codebook

Using utils::head() on a variable when {labelled} is loaded prints a nicely formatted summary of the attached metadata, excluding formats. It shows all NA value labels in the data frame used in the survey, even only some of them is used here. (See Listing / Output 1.19 to get type of NA’s used for a variable.)

We see in the first 10 records of the variable grass

three 1 (“should be legal”)
three 2 (“should not be legal”) and
four NA(i) (“iap”)

Until now it is not clear what “iap” for the NA(i) type means. If we want download (STATA) files directly from the GSS website we see on the right side a note about the most important NA types. It explains the the abbreviation “iap” stands for “inapplicable”.

An "Alerts" explains that there are three main missing values in the data; ".i": Inapplicable (IAP). Respondents whor are not asked to answer a specific question are assigned to IAP.".d": Don't know (DK). ".n": No answer (NA). — Graph 1.10: GSS Information about the most important NA types

Two different meanings of “NA”

With the NA(n) type of missing value “NA” does not mean “Not Applicable” but “No answer”.

Returning only the NA(i) type of a missing value does not mean that there aren’t any other NA types in the data. (See Listing / Output 1.18 to get the types of tagged NA’s of a variable.)

This formatted result of labelled data is better readable as with Listing / Output 1.17, because category and value are in the same line. Compare the result of this code ({labelled} is loaded) with Listing / Output 1.14 ({labelled} is not loaded).

haven::print_labels() as equivalent of head() with {labelled} loaded

With the exception of the specified rows in the head() function, you can get the same nicely formatted list of NA types with haven::print_labels(). (See next tab “Print labels”.)

R Code 1.38 : Print labels

Code

haven::print_labels(gss_2016_gssr$grass)

#> 
#> Labels:
#>  value                         label
#>      1               should be legal
#>      2           should not be legal
#>  NA(d)                    don't know
#>  NA(i)                           iap
#>  NA(j)            I don't have a job
#>  NA(m)                   dk, na, iap
#>  NA(n)                     no answer
#>  NA(p)                 not imputable
#>  NA(q)                 not imputable
#>  NA(r)                       refused
#>  NA(s)                skipped on web
#>  NA(u)                    uncodeable
#>  NA(x) not available in this release
#>  NA(y)    not available in this year
#>  NA(z)                  see codebook

For a detailed explication see previous tab “head()”.

R Code 1.39 : Print unique tagged NA’s

Code

labelled::unique_tagged_na(gss_2016_gssr$grass) |> 
    labelled::sort_tagged_na() |> 
    haven::print_tagged_na()

#> [1]     1     2 NA(d) NA(i) NA(n)

Listing / Output 1.18: Show the unique types of NA’s of a variable

We see that variable grass has three types of NA’s:

NA(d): Don’t know. NA(i): Inapplicable, e.g., these respondents were not asked the marijuana question. NA(n): No answer

Here I have used haven::print_tagged_na() to format tagged NA’s as NA(a), NA(b), etc. (See “Tagged” missing values)

It would be interesting to know the composition of these different NA types. One could reason that there is a big difference between “question not asked” and “no answer”. (See Listing / Output 1.19 to get the composition of the NA types.)

R Code 1.40 : Get the composition of the different types of tagged NA’s

Code

gss_2016_gssr |>
  dplyr::count(
    grass,
    haven::na_tag(grass)
  )

#> # A tibble: 5 × 3
#>   grass                       `haven::na_tag(grass)`     n
#>   <dbl+lbl>                   <chr>                  <int>
#> 1     1 [should be legal]     <NA>                    1126
#> 2     2 [should not be legal] <NA>                     717
#> 3 NA(d) [don't know]          d                        110
#> 4 NA(i) [iap]                 i                        911
#> 5 NA(n) [no answer]           n                          3

Listing / Output 1.19: Get the composition of the different types of tagged NA’s

We already know from Listing / Output 1.15 that the variable grass has 1024 NA’s. Now we know also the type composition of these NA’s.

I could not find a similar function in {labelled}, so I have used here haven::na_tag().

1.8.3.3 Working with labelled data

There are two ways to work with labelled data in R:

The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels ({Haven} vignette Conversion semantics)

Experiment 1.5 : How to convert labelled data for data analysis in R

The second variable grass is the labelled original variable. The third variable (last column to the right) is the modified grass variable.

R Code 1.41 : Convert labelled vectors to factors

Code

gss_2016_grass <- gss_2016_gssr |> 
    dplyr::select(grass)

gss_2016_grass |> 
    dplyr::mutate(grass_factor = haven::as_factor(grass)) |> 
    my_glance_data(N = 8, seed = 42)

#> # A tibble: 10 × 3
#>      obs grass                       grass_factor       
#>    <int> <dbl+lbl>                   <fct>              
#>  1     1 NA(i) [iap]                 iap                
#>  2   634 NA(i) [iap]                 iap                
#>  3  1098     1 [should be legal]     should be legal    
#>  4  1152     1 [should be legal]     should be legal    
#>  5  1177 NA(i) [iap]                 iap                
#>  6  1252 NA(i) [iap]                 iap                
#>  7  2097     1 [should be legal]     should be legal    
#>  8  2369     1 [should be legal]     should be legal    
#>  9  2609     1 [should be legal]     should be legal    
#> 10  2867     2 [should not be legal] should not be legal

Here I am using haven::as_factor(). This function also knows how to re-label missing values.

R Code 1.42 : Remove labels

Code

gss_2016_grass |> 
    dplyr::mutate(grass_unlabelled = haven::zap_labels(grass)) |> 
    my_glance_data(N = 8, seed = 42)

#> # A tibble: 10 × 3
#>      obs grass                       grass_unlabelled
#>    <int> <dbl+lbl>                              <dbl>
#>  1     1 NA(i) [iap]                               NA
#>  2   634 NA(i) [iap]                               NA
#>  3  1098     1 [should be legal]                    1
#>  4  1152     1 [should be legal]                    1
#>  5  1177 NA(i) [iap]                               NA
#>  6  1252 NA(i) [iap]                               NA
#>  7  2097     1 [should be legal]                    1
#>  8  2369     1 [should be legal]                    1
#>  9  2609     1 [should be legal]                    1
#> 10  2867     2 [should not be legal]                2

R Code 1.43 : Convert special missings to regular R missings

Code

gss_2016_grass |> 
    dplyr::mutate(grass_rgular_na = haven::zap_missing(grass)) |> 
    my_glance_data(N = 8, seed = 42)

#> # A tibble: 10 × 3
#>      obs grass                       grass_rgular_na         
#>    <int> <dbl+lbl>                   <dbl+lbl>               
#>  1     1 NA(i) [iap]                 NA                      
#>  2   634 NA(i) [iap]                 NA                      
#>  3  1098     1 [should be legal]      1 [should be legal]    
#>  4  1152     1 [should be legal]      1 [should be legal]    
#>  5  1177 NA(i) [iap]                 NA                      
#>  6  1252 NA(i) [iap]                 NA                      
#>  7  2097     1 [should be legal]      1 [should be legal]    
#>  8  2369     1 [should be legal]      1 [should be legal]    
#>  9  2609     1 [should be legal]      1 [should be legal]    
#> 10  2867     2 [should not be legal]  2 [should not be legal]

R Code 1.44 : Remove variable labels from a data frame

Code

print("*******   Original data frame   ********")
labelled::var_label(gss_2016_gssr)

print("*******   Data frame after variable zapped   *******")
gss_2016_gssr |> 
    haven::zap_label() |> 
    labelled::var_label()

#> [1] "*******   Original data frame   ********"
#> $age
#> [1] "age of respondent"
#> 
#> $grass
#> [1] "should marijuana be made legal"
#> 
#> [1] "*******   Data frame after variable zapped   *******"
#> $age
#> NULL
#> 
#> $grass
#> NULL

You can combine all three zap-commands.

1.9 Table excursion

1.9.1 Review several table packages

Creating tables is an important result of preparing data — the main subject of this section. In my research I came up with eight packages that seems to me the most important ones. What follows is a short introduction into these packages.

Experiment 1.6 : Review table packages: {DT}, {flextable}, {gt}, {gtsummary}, {htmlTable}, {kableExtra}, {tableone}, {xtable}

{DT}

DT is an abbreviation of ‘DataTables.’ Data objects in R can be rendered as HTML tables using the JavaScript library ‘DataTables’ (typically via R Markdown or Shiny).

This R package provides an enhanced version of data.frames with more options for authors and users:

Authors: DT supports data wrangling beside subsetting rows and selecting columns (e.g., computing on columns, sort, grouping). I am not sure if this is really an advantage because it requires the somewhat complex square brackets notation. Using {dplyr} with other {tidyverse} packages seems — for me at least — easier to use. Users: The really advantage of {DT} in my opinion is that the interface allows users to change values and display formats interactively. There are some option in the standard installations and much more with extensions.

Standard: Out of the box {DT} allows user to
- search,
- sort,
- choose the number of rows shown per page,
- filter columns
- add/remove row names and even
- edit cells individually.
Extensions: There are also some extension to enhance the interactive options: Extensions allow users to
- edit groups of cells,
- provide buttons for copying or downloading the table in different formats,
- to re-order and/or hide columns,
- to fix columns and/or header if the table is too wide and/or to long and you have to scroll,
- moving around with arrow keys like in Excel tables,
- making the table responsive to smaller screens by collapsing columns,
- grouping values by a column,
- reordering rows,
- scrolling through the rows and
- selecting parts of variables interactively.

Remark 1.1. My conclusion to {DT}

{DT} is my first choice for interactive data exploring and very useful for Explorative Data Analysis (EDA). I have already some experiences with this package but haven’t use it lastly.

Compared with the other table packages {DT} is not in the same category: The following packages are developed as Display Tables (in contrast to Data Tables. Beside of the interactive use {DT}s are Data Tables.

{flextable}

This package helps you to create reporting table from a data frame easily. You can

merge cells,
add headers,
add footers,
change formatting,
set how data in cells is displayed
Table content can also contain mixed types of text and image content.
Tables can be embedded from R Markdown documents into HTML, PDF, Word, and PowerPoint documents and can be embedded using Package Officer for Microsoft Word or PowerPoint documents.
Tables can also be exported as R plots or graphic files, e.g., png, pdf, and jpeg.

It is said that {flextable} — together with the sparsely used {huxtable} — “supports the widest range of output formats … [both] support HTML, $\LaTeX$, and Office formats, and contain most common table features (e.g., conditional formatting).” (Yihui: R Markdown Coobook)

Remark 1.2. My conclusion to {flextable}

{flextable} seems pretty well suited to create sophisticated tables for publications. It is often especially mentioned for exporting into Word files. Maybe I will look into this package in the near future, but currently my priority is {gtsummary}.

{gt}

{gt} stands for “grammar of tables”. It aims to do for tables what {ggplot2} did for graphics. (Hadley Wickham on X)

{gt} offers a different and easy-to-use set of functions that helps to build display tables from tabular data. The {gt} philosophy states that a comprehensive collection of table parts can be used to create a broad range of functional tables.

It allows you to compose a table by putting together different parts of the table, such as

the table header (title and subtitle),
the column labels,
the table body,
row group labels,
and the table footer.

Some parts are optional. You can also format numbers and add background shading to cells.

At the beginning of the development process {gt} supported only HTML, but now you can also use it for LATEX and RTF outputs.

{gt} is developed by the RStudio/Posit team, that stands not only for high quality but also for {tidyverse} compatibility. It was a little surprise for me that the download statistics does not reflect the normally very high figures of RStudio/Posit products. After trying to work with {gt} I believe I know the reasons:

With {gt} you have big flexibility but you have to provide all the details. This could be cumbersome, especially when other table packages are providing not only theses details but incorporate also many other features as calculations, tests etc. {gt} is therefore better used in combination with other packages like {gtsummary}.
{gt} is still very early in the development process. The issue tracker show this in the relation of open / closed issue: 272 open / 633 closed (2024-01-18).

Remark 1.3. My conclusion to {gt}

I tried to work with {gt} but it is cumbersome. I noticed a bug even following the official online tutorial. Maybe {gt} (together with {gtsummary} is better suited for creating easily complex tables in many formats.

The {gtsummary} package provides an efficient way to create publication-ready analytical and summary tables using the R programming language. {gtsummary} summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities.

{gtsummary} “was written to be a companion to the {gt} package from RStudio. But not all output types are supported by the {gt} package. Therefore, [the developers] have made it possible to print {gtsummary} tables with various engines.” (gtsummary + Quarto/R Markdown)

Table of summaries for output formats: The table header reads from left to the right: Print Engine, Function, HTML, Word, PDF, RTF. As print engines feature in this table: gt, flextable, huxtable, kableExtra, kable and tibble. All functions start with 'as_' followed by the name of the package, e.g., as_gt(), as_flex_table(), as_kable_extra() etc. The table states that gt is still working on RTF output, flextable and kableExtra does not support RTF, kableExtra also has no output format for Word. Kable has all four options in a very reduced form (missing indentation, footnotes and spanning headers) and tibble fails on all four formats. — Graph 1.11: Summary of the various Quarto and R Markdown output types and the print engines supporting them

Remark 1.4. My conclusion to {gtsummary}

I have tried out {gtsummary} and I am enthusiastic about it! I wrote an introduction with some examples (Section 1.9.2) Now that I understand how it works I will definitely use it. (By the way: Besides the extensive documentation I found that the most effective introduction for me was watching the Youtube video presentation by the developer Daniel D. Sjoberg).

{htmlTable}

I found {htmlTable} per accident searching the Internet with “r creating complex table”. This package is not mentioned in the literature I have read so far, but it seems not only able to produce complex tables but is also used quite often. Maybe it is the HTML limitation which reduces it application for Online journals because it requires a separate step to produce $\LaTeX$ tables ready for printing.

It is a pretty old package (version 1.0 appeared already in 2014) but still currently in active development. The last change today (2024-01-18) was version 2.4.2 from 2023-10-29.

From the package documentation:

Advanced Tables for Markdown/HTML

Tables with state-of-the-art layout elements such as row spanners, column spanners, table spanners, zebra striping, and more. While allowing advanced layout, the underlying css-structure is simple in order to maximize compatibility with common word processors. The package also contains a few text formatting functions that help outputting text compatible with HTML/LaTeX.

Beside the extensive vignette documentation there are also some blog articles:

Remark 1.5. My conclusion to {htmlTable}

{htmlTable} seems an interesting package not only for creating complex tables but also for exporting them into Word. But to learn this package is at the moment not on my task list.

{kableExtra}

{kableExtra} extends the basic functionality of knitr::kable() tables. Although knitr::kable() is simple by design, it has many features missing which are usually available in other packages. {kableExtra} has filled the gap nicely. One of the best thing about {kableExtra} is that most of its table capabilities work for both HTML and PDF formats.

Remark 1.6. My conclusion to {**kableExtra*}

I have already some experiences with {kableExtra}. I used it quasi as an add-on to knitr::kable() during personal HTML publications (Blog, bookdown, Quarto). I believe that other “Table 1” packages like {gtsummary} or maybe also {tableOne) are better suited for display and publication.

{tableone}

The first time I learned about the existence of {tableone} was in Chapter 2 of Statistics with R. Consulting the online documentation it says:

Creates “Table 1”, i.e., description of baseline patient characteristics, which is essential in every medical research. Supports both continuous and categorical variables, as well as p-values and standardized mean differences. Weighted data are supported via the {survey} package. … Most important functions are CreateTableOne() and svyCreateTableOne().

Remark 1.7. My conclusion to {tableone}

The package is completely new for me. I have to check it out to see how and when to use. But my priority now is {gtsummary} because

{gtsummary} is much more used than {tableone}: over 2000 versus under 500 downloads per day
last release of {gtsummary} was version 1.7.2 from July 15, 2023 versus release 26 from July 26, 2020.

{xtable}

{xtable} is the oldest package and still the table package most used! From the documentation it is clear that it is very useful for the creation of publication ready tables. But I am little hesitant to learn it, because it is not mentioned in current literature and the last change was in April 2019.

Remark 1.8. My conclusion to {xtable}

Learning {xtable} is not on my agenda: It is not clear for me if {xtable} is still in developments and if it is supporting newer approaches. Learning other table packages, especially {gtsummary} probabily together with {gt} has priority for me.

There are other packages as well. But the above eight packages are a first starting point for everybody to learn creating and displaying sophisticated data tables in R. It is good to know about different packages but lastly one has to decide which one to learn and use in daily practice. At the moment my favorites is {gtsummary}.

Resource 1.4 : Other interesting table packages, but currently not considered for my personal purposes

Just for a possible later reference I will add some other table packages I know of.

{reactable}: reactable() creates a data table from tabular data with sorting and pagination by default. The data table is an HTML widget that can be used in R Markdown documents and Shiny applications or viewed from an R console. It is based on the React Table library and made with reactR. Features are:
- It creates a data table with sorting, filtering, and pagination
- It has built-in column formatting
- It supports custom rendering via R or JavaScript
- It works seamlessly within R Markdown documents and the Shiny app
{ractablefmtr}: The package improves the appearance and formatting of tables created using the reactable R library. It includes many conditional formatters that are highly customizable and easy to use.

The authors of R Markdown Cookbook (Yihui Xie, Christophe Dervieux, Emily Riederer) mention also several other table packages in the section Other packages for creating tables:

{rhandsontable}: Interface to the Handsontable.js Library.). Similar to {DT}, and has an Excel feel (e.g., you can edit data directly in the table). Visit https://jrowen.github.io/rhandsontable/ to learn more about it.
{pixiedust}: Tables so Beautifully Fine-Tuned You Will Believe It’s Magic: Features creating tables for models (such as linear models) converted through the {broom} package (Broom: Convert Statistical Objects into Tidy Tibbles). It supports Markdown, HTML, and $\LaTeX$ output formats.
stargazer: (beautiful $\LaTeX$, HTML and ASCII tables from R statistical output): Features formatting regression models and summary statistics tables.
xtable (Dahl et al. 2019): Perhaps the oldest package for creating tables—the first release was made in 2000. It supports both $\LaTeX$ and HTML formats.

I’m not going to introduce the rest of packages, but will just list them here:

tables: Formula-Driven Table Generation),
pander (Daróczi and Tsegelskyi 2022),
tangram (S. Garbett 2023),
ztable (Moon 2021), and
condformat (Introduction to condformat).

Summary of my table packages review

It does not make sense to learn every package. SO I will concentrate on five packages with different usage:

{DT}, see: Section A.19: An important package for interactive HTML tables. It also manages quite good big datasets
{gt}, see Section A.36: Although this package is still lacking many features in follows a systematic approach (“Grammar of Tables”) that allows to make fine grained changes. It is there somewhat cumbersome to apply it in everyday use. But it is especially valuable when you have another package for the standard features and you need some specialized change. I plan to use {gt} together with {gtsummary}.
{gtsummary}, see: Section A.37: I just learned this package and I am enthusiastic about it! I will use it as replacement for {tableone} that is used by Harris for the book to prepare publishing ready Table 1 tables.
{kableExtra}, see: Section A.42: It uses the kable() function from {knitr} (see: Section A.43), which is an essential part of the publishing infrastructure for literate programming. It is pretty flexible and could be used in tandem with {gtsummary}.

For simple tables one can also use knitr::kable() or just printing the tibble.

As a consequence of my table packages review I experimented with the {gtsummary}. I wanted to create a table where I have information by the variables grass, age in four categories and sex with all their levels in absolute numbers and in percents.

1.9.2 Experiments with the gtsummary package

Experiment 1.7 : Experiments with {gtsummary}

Display dataset used for the {gtsummary} introduction.
A very basic summary table without any adaptions or modifications.
Changing dichotomous to a categorical variable.
Ditto with a different method.
Changing variable labels with two different methods.
Output grouped by the grass variable.
Header modified.
two cross tables of summary statistics stacked.
Table stratified by the grass variable.

R Code 1.45 : Data used in this {gtsummary} introduction

Code

nhanes_2013_clean <- 
    base::readRDS("data/chap01/nhanes_2013_clean.rds")

gtsummary_basic <- nhanes_2013_clean |> 
 
    ## cut age into several defined age groups
    dplyr::mutate(age_cat = base::cut(age,
                   breaks = c(-Inf, 29, 39, 49, 59),
                   labels = c("18 - 29", "30 - 39", "40 - 49", "50 - 59"))) |>
    dplyr::select(-c(nr, age))

save_data_file("chap01", gtsummary_basic, "gtsummary_clean.rds")

base::summary(gtsummary_basic)

#>  grass          sex          age_cat    
#>  Yes:1991   Male  :1792   18 - 29:1154  
#>  No :1699   Female:1898   30 - 39: 859  
#>                           40 - 49: 855  
#>                           50 - 59: 822

Code

gtsummary_basic

#> # A tibble: 3,690 × 3
#>    grass sex    age_cat
#>    <fct> <fct>  <fct>  
#>  1 Yes   Male   50 - 59
#>  2 Yes   Male   50 - 59
#>  3 Yes   Female 50 - 59
#>  4 No    Female 18 - 29
#>  5 No    Female 30 - 39
#>  6 Yes   Male   30 - 39
#>  7 No    Male   18 - 29
#>  8 Yes   Female 30 - 39
#>  9 No    Male   50 - 59
#> 10 Yes   Male   18 - 29
#> # ℹ 3,680 more rows

I am going to use this dataset to experiment with {gtsummary}.

R Code 1.46 : A very basic summary without any adaptions or modifications

Code

gtsummary_basic <- 
    base::readRDS("data/chap01/gtsummary_clean.rds")

gtsummary_basic |> 
    gtsummary::tbl_summary()

Characteristic	N = 3,690¹
grass	1,991 (54%)
sex
Male	1,792 (49%)
Female	1,898 (51%)
age_cat
18 - 29	1,154 (31%)
30 - 39	859 (23%)
40 - 49	855 (23%)
50 - 59	822 (22%)
¹ n (%)

This is a very basic summary without any adaptions or modifications. It is strange, that there are for grass not separated levels presented. From a video of an introduction to {gtsummary} by the package author Daniel D. Sjoberg I learned that a categorical variable with two levels is interpreted as dichotomous whenever the two levels are labelled 1 and 0, true and false or Yes and No. And dichotomous variable are different as other categorical variables as they are reported by {gtsummary} only with one number (the 1, true or yes value).

Here I am going to change level names of the grass variable to prevent that it is interpreted as a dichotomous variable.

R Code 1.47 : Numbered R Code Title

Code

gtsummary_basic2 <- 
    base::readRDS("data/chap01/gtsummary_clean.rds") |> 
    dplyr::mutate(grass = 
      forcats::fct_recode(grass,
                          "used" = "Yes", 
                          "not used" = "No")
            ) 

gtsummary_basic2 |>
    gtsummary::tbl_summary()

Characteristic	N = 3,690¹
grass
used	1,991 (54%)
not used	1,699 (46%)
sex
Male	1,792 (49%)
Female	1,898 (51%)
age_cat
18 - 29	1,154 (31%)
30 - 39	859 (23%)
40 - 49	855 (23%)
50 - 59	822 (22%)
¹ n (%)

This time we can see both levels of grass. But we loose the nice Yes/No answer of the survey.

Another strategy to prevent a dichotomous interpretation is to tell {gtsummary} expressively with the type = argument how a variable has to be interpreted.

R Code 1.48 : Preventing a dichotomous interpretation

Code

gtsummary_basic |> 
    gtsummary::tbl_summary(
        type = grass ~ "categorical"
    )

Characteristic	N = 3,690¹
grass
Yes	1,991 (54%)
No	1,699 (46%)
sex
Male	1,792 (49%)
Female	1,898 (51%)
age_cat
18 - 29	1,154 (31%)
30 - 39	859 (23%)
40 - 49	855 (23%)
50 - 59	822 (22%)
¹ n (%)

The same output as in the previous tab, but now with the desired Yes/No answer categories.

If there are labelled data {gtsummary} uses them as variable labels. This is a big advantages, especially if the data are already labelled and imported with {haven}. But you can use the {labelled} package to label the variables.

If there are no labelled metadata then the variable name is taken, e.g., in most of the cases you have to provide a better name for the table display.

I am using both methods in the following code chunk: I provided labelled metadata for the grass variable but used new variable names for age_cat and sex.

R Code 1.49 : Changing variable labels

Code

## variant 1 to label variables
# labelled::var_label(
#     gtsummary_basic$grass) <- "Ever used grass or hash?"

## variant 2 to label variables
gtsummary_basic |> 
    labelled::set_variable_labels(
        grass = "Ever used grass or hash?"
    ) |> 
    gtsummary::tbl_summary(
        # by = grass,
        type = grass ~ "categorical",
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        )
    )

Characteristic	N = 3,690¹
Ever used grass or hash?
Yes	1,991 (54%)
No	1,699 (46%)
Sex
Male	1,792 (49%)
Female	1,898 (51%)
Age Category
18 - 29	1,154 (31%)
30 - 39	859 (23%)
40 - 49	855 (23%)
50 - 59	822 (22%)
¹ n (%)

R Code 1.50 : Summarize data by a grouped variable

Code

gtsummary_group <-  gtsummary_basic |> 
    labelled::set_variable_labels(
        grass = "Ever used grass or hash?"
    ) |>
    gtsummary::tbl_summary(
        # by = grass,
        type = grass ~ "categorical",
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        )
    )

gtsummary_group

Characteristic	N = 3,690¹
Ever used grass or hash?
Yes	1,991 (54%)
No	1,699 (46%)
Sex
Male	1,792 (49%)
Female	1,898 (51%)
Age Category
18 - 29	1,154 (31%)
30 - 39	859 (23%)
40 - 49	855 (23%)
50 - 59	822 (22%)
¹ n (%)

The variable grass is not grouped dplyr::group_by() but inside gtsummary::tbl_summary().

R Code 1.51 : Change table header

Code

gtsummary_group |> 
  gtsummary::modify_header(label ~ "**Variable**") |> 
  gtsummary::modify_spanning_header(
      gtsummary::show_header_names(gtsummary_group) ~ "**Ever used grass or hash?**")

#> Column Name   Header                 level*          N*            n*            p*         
#> label         "**Characteristic**"   Overall <chr>   3,690 <int>   3,690 <int>   1 <dbl>    
#> stat_0        "**N = 3,690**"        Overall <chr>   3,690 <int>   3,690 <int>   1 <dbl>

#> * These values may be dynamically placed into headers (and other locations).
#> ℹ Review the `modify_header()` (`?gtsummary::modify()`) help for examples.

Variable	N = 3,690¹
Ever used grass or hash?
Yes	1,991 (54%)
No	1,699 (46%)
Sex
Male	1,792 (49%)
Female	1,898 (51%)
Age Category
18 - 29	1,154 (31%)
30 - 39	859 (23%)
40 - 49	855 (23%)
50 - 59	822 (22%)
¹ n (%)

R Code 1.52 : Prepare cross tables and stack them

Code

gtsummary_basic <- 
    base::readRDS("data/chap01/gtsummary_clean.rds")

gtsummary_cross1 <- gtsummary_basic |> 
    labelled::set_variable_labels(
            grass = "**Ever used grass or hash?**"
        ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = age_cat,
        col = grass,
        percent = "cell"
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    ) 

gtsummary_cross2 <- gtsummary_basic |> 
    labelled::set_variable_labels(
            grass = "**Ever used grass or hash?**"
        ) |> 
    gtsummary::tbl_cross(
        label = list(
            age_cat = "Age Category",
            sex = "Sex"
        ),
        row = sex,
        col = grass,
        percent = "cell"
    ) |> 
    gtsummary::modify_header(
      stat_1 = '**Yes**',
      stat_2 = '**No**',
      stat_0 = '**Total**'
    )

gtsummary::tbl_stack(
    list(gtsummary_cross1,
         gtsummary_cross2)
    ) |> 
    gtsummary::bold_labels()

	Ever used grass or hash?		Total
	Yes	No	Total
Age Category
18 - 29	662 (18%)	492 (13%)	1,154 (31%)
30 - 39	466 (13%)	393 (11%)	859 (23%)
40 - 49	405 (11%)	450 (12%)	855 (23%)
50 - 59	458 (12%)	364 (9.9%)	822 (22%)
Total	1,991 (54%)	1,699 (46%)	3,690 (100%)
Sex
Male	1,072 (29%)	720 (20%)	1,792 (49%)
Female	919 (25%)	979 (27%)	1,898 (51%)
Total	1,991 (54%)	1,699 (46%)	3,690 (100%)

R Code 1.53 : Stratify tables

Code

gtsummary_basic <- 
    base::readRDS("data/chap01/gtsummary_clean.rds")

gtsummary_basic |> 
    gtsummary::tbl_strata(
        strata = grass,
        ~ .x |> 
    gtsummary::tbl_summary(
        by = sex,
        label = age_cat ~ "Age Category"
        ) |> 
    gtsummary::add_overall() |> 
    gtsummary::modify_header(
        label = "**Variables**",
        gtsummary::all_stat_cols() ~ "**{level}**"
        )
    ) |> 
  gtsummary::bold_labels()

Table 1.3: Did you ever use marijuana or hashish?

Variables	Yes			No
Variables	Overall¹	Male¹	Female¹	Overall¹	Male¹	Female¹
Age Category
18 - 29	662 (33%)	360 (34%)	302 (33%)	492 (29%)	208 (29%)	284 (29%)
30 - 39	466 (23%)	267 (25%)	199 (22%)	393 (23%)	163 (23%)	230 (23%)
40 - 49	405 (20%)	202 (19%)	203 (22%)	450 (26%)	198 (28%)	252 (26%)
50 - 59	458 (23%)	243 (23%)	215 (23%)	364 (21%)	151 (21%)	213 (22%)
¹ n (%)

1.10 Opinions about legalizing marijuana (1973-2022)

Experiment 1.8 : Developing a graph showing the timeline of respondents opinion about legalizing marijuana (1973-2022)

R Code 1.54 : Load the GSS data from 1972-2022 and select appropriate variables

Code

## run only once (manually) #########

library(gssr)   ## load package
data(gss_all)   ## attach GSS data 
data(gss_dict)  ## attach codebook
gss_grass_legal <- gss_all |> 
    dplyr::select(year, age, grass) 

save_data_file("chap01", gss_grass_legal, "gss_grass_legal.rds")

base::unloadNamespace("gssr")

(This R code chunk has no visible output)

R Code 1.55 : Clean grass-legal dataset

Code

## run only once (manually) ##########
gss_grass_legal <- 
    base::readRDS("data/chap01/gss_grass_legal.rds")

gss_grass_legal_clean <- gss_grass_legal |> 
    tidyr::drop_na() |> 
    dplyr::mutate(
        grass = forcats::as_factor(grass),
        grass = forcats::fct_drop(grass),
        year = base::as.numeric(year),
        age = base::as.numeric(age)
    )

save_data_file("chap01", gss_grass_legal_clean, "gss_grass_legal_clean.rds")

(This R code chunk has no visible output)

R Code 1.56 : Show summary of cleaned dataset

Code

grass_legal <- 
    base::readRDS("data/chap01/gss_grass_legal_clean.rds")

## remove comma from big numbers
gtsummary::theme_gtsummary_language("en", big.mark = "")


grass_legal |> 
  gtsummary::tbl_summary(
    label = list(
        year ~ "Year",
        age ~ "Age",
        grass ~ "Should marijuana be legal?"
    ),
    type = list(age ~ "continuous2", year ~ "continuous2"),
    statistic = 
         list(
            age ~ c(
      "{mean} ({sd})",
      "{median} ({p25}, {p75})",
      "{min}, {max}"
         ), 
            year ~ c("{min}, {max}")
    )
  ) |> 
  gtsummary::italicize_levels() |> 
  gtsummary::bold_labels()

Table 1.4: Summary of GSS surveys about opinion about legalizing marijuana between 1972-2022

Characteristic	N = 38479¹
Year
Min, Max	1973, 2022
Age
Mean (SD)	46 (18)
Median (Q1, Q3)	43 (31, 59)
Min, Max	18, 89
Should marijuana be legal?
should be legal	12589 (33%)
should not be legal	25890 (67%)
¹ n (%)

WATCH OUT! Summary with labelled data

I used for the summary the labelled data from the {gssr} package. This has the big advantage that I didn’t recode the variable names and could use the variable and value metadata!

But {gtsummary} warned me to see that this is only “an intermediate data structure not meant for analysis”.

Column(s) age are class “haven_labelled”. This is an intermediate datastructure not meant for analysis. Convert columns with haven::as_factor(), labelled::to_factor(), labelled::unlabelled(), and unclass(). “haven_labelled” value labels are ignored when columns are not converted. Failure to convert may have unintended consequences or result in error.

https://haven.tidyverse.org/articles/semantics.html

https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled

R Code 1.57 : Prepare data for timeline graph

Code

grass_legal <- 
    base::readRDS("data/chap01/gss_grass_legal_clean.rds")


grass_legal2 <- grass_legal |> 
    # labelled::remove_labels() |> 
    dplyr::group_by(year) |> 
    dplyr::add_count(name = "N") |> 
    dplyr::add_count(grass) |> 
    dplyr::mutate(perc = n / N * 100) |> 
    dplyr::ungroup()

(This R code chunk has no visible output)

R Code 1.58 : Timeline of respondents opinion about legalizing marijuana (1973-2022)

Code

grass_legal2 |> 
    # dplyr::group_by(year) |> 
    ggplot2::ggplot(ggplot2::aes(
        x = year, 
        y = perc,
        color = grass)) +
    ggplot2::geom_point() + 
    ggplot2::theme_bw() +
    ggplot2::scale_color_manual(
        values = c("#79a778", '#7662aa')
        ) +
    ggplot2::scale_x_continuous(
        breaks = c(1973, seq(1978, 2022, 4), 2022)
        ) +
    ggplot2::theme(axis.text.x =
          ggplot2::element_text(angle = 45, vjust = 0.5)) +
    ggplot2::labs(
        x = "Should marijuana be legal?",
        y = "Percent of responses",
        color = "Marijuana") +
    ggplot2::theme(legend.position = c(.82, .85))

#> Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
#> 3.5.0.
#> ℹ Please use the `legend.position.inside` argument of `theme()` instead.

Graph 1.12: Timeline of respondents opinion about legalizing marijuana between 1972-2022 of the GSS data

Graph 1.12 shows that the proportion of people that support legalizing marijuana is (with the exception of 1978-1990) continuously rising. Starting from 19% support in 1973 approval hits 70% in 2022, the last date where data are available. The turning point where the proportion of affirmation was the first time greater than the disapproval was in 2014 (55/45).

Glossary

term	definition
ACSPRI	Australian Consortium for Social and Political Research Inc. oganizes conferences and delivers course. (<a href="https://www.acspri.org.au/">ACSPRI</a>)
CSV	Text files where the values are separated with commas (Comma Separated Values = CSV). These files have the file extension .csv
Data Tables	Data Tables are --- in contrast to Display Tables --- for analysis and not for presentation. Another purpose of Data Tables is to store, retrieve and share information digitally. In R they are tibbles, data.frames etc.
Display Tables	Display Tables are in constrast to Data Tables. You would find them in a web page, a journal article, or in a magazine. Such tables are for presentation, often to facilitate the construction of “Table 1”, i.e., baseline characteristics table commonly found in research papers. (<a href= "https://gt.rstudio.com/articles/gt.html">Introduction to Creating gt Tables</a>)
EDA	Explorative Data Analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. (<a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">Wikipedia</a>)
General Social Survey	A large survey of a sample of people in the United States conducted regularly since 1972; the General Social Survey is abbreviated GSS and is conducted by the National Opinion Research Center at the University of Chicago. (Harris, Glossary)
GSS	A large survey of a sample of people in the United States conducted regularly since 1972; the General Social Survey is abbreviated GSS and is conducted by the National Opinion Research Center at the University of Chicago. (Harris, Glossary)
Literate Programming	Literate programming is a methodology that combines a programming language with a documentation language, thereby making programs more robust, more portable, more easily maintained, and arguably more fun to write than programs that are written only in a high-level language. The main idea is to treat a program as a piece of literature, addressed to human beings rather than to a computer.(<a href="https://www-cs-faculty.stanford.edu/~knuth/lp.html">Donald Knuth</a>)
Microdata	Microdata are unit-level data obtained from sample surveys, censuses, and administrative systems. They provide information about characteristics of individual people or entities such as households, business enterprises, facilities, farms or even geographical areas such as villages or towns. They allow in-depth understanding of socio-economic issues by studying relationships and interactions among phenomena. Microdata are thus key to designing projects and formulating policies, targeting interventions and monitoring and measuring the impact and results of projects, interventions and policies. ([The World Bank]
NHANES	The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey is unique in that it combines interviews and physical examinations. (<a href="https://www.cdc.gov/nchs/nhanes/about_nhanes.htm">NHANES</a>)
Prolog	A set of comments at the top of a code file that provides information about what is in the file (SwR, Glossary)
Table 1	Descriptive statistics are often displayed in the first table in a published article or report and are therefore often called <i>Table 1 statistics</i> or the <i>descriptives</i>. (SwR)

Session Info

Code

sessioninfo::session_info()

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       macOS Sequoia 15.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Vienna
#>  date     2024-11-13
#>  pandoc   3.5 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version    date (UTC) lib source
#>  base64enc     0.1-3      2015-07-28 [2] CRAN (R 4.4.0)
#>  bit           4.5.0      2024-09-20 [2] CRAN (R 4.4.1)
#>  bit64         4.5.2      2024-09-22 [2] CRAN (R 4.4.1)
#>  cards         0.3.0      2024-10-03 [2] CRAN (R 4.4.1)
#>  cli           3.6.3      2024-06-21 [2] CRAN (R 4.4.1)
#>  codetools     0.2-20     2024-03-31 [2] CRAN (R 4.4.2)
#>  colorspace    2.1-1      2024-07-26 [2] CRAN (R 4.4.0)
#>  commonmark    1.9.2      2024-10-04 [2] CRAN (R 4.4.1)
#>  crayon        1.5.3      2024-06-20 [2] CRAN (R 4.4.0)
#>  curl          6.0.0      2024-11-05 [2] CRAN (R 4.4.1)
#>  digest        0.6.37     2024-08-19 [2] CRAN (R 4.4.1)
#>  dplyr         1.1.4      2023-11-17 [2] CRAN (R 4.4.0)
#>  evaluate      1.0.1      2024-10-10 [2] CRAN (R 4.4.1)
#>  fansi         1.0.6      2023-12-08 [2] CRAN (R 4.4.0)
#>  farver        2.1.2      2024-05-13 [2] CRAN (R 4.4.0)
#>  fastmap       1.2.0      2024-05-15 [2] CRAN (R 4.4.0)
#>  forcats       1.0.0      2023-01-29 [2] CRAN (R 4.4.0)
#>  generics      0.1.3      2022-07-05 [2] CRAN (R 4.4.0)
#>  ggplot2       3.5.1      2024-04-23 [2] CRAN (R 4.4.0)
#>  glossary    * 1.0.0.9003 2024-08-05 [2] Github (debruine/glossary@05e4a61)
#>  glue          1.8.0      2024-09-30 [2] CRAN (R 4.4.1)
#>  gt            0.11.1     2024-10-04 [2] CRAN (R 4.4.1)
#>  gtable        0.3.6      2024-10-25 [2] CRAN (R 4.4.1)
#>  gtsummary     2.0.3      2024-10-04 [2] CRAN (R 4.4.1)
#>  haven         2.5.4      2023-11-30 [2] CRAN (R 4.4.0)
#>  here          1.0.1      2020-12-13 [2] CRAN (R 4.4.0)
#>  hms           1.1.3      2023-03-21 [2] CRAN (R 4.4.0)
#>  htmltools     0.5.8.1    2024-04-04 [2] CRAN (R 4.4.0)
#>  htmlwidgets   1.6.4      2023-12-06 [2] CRAN (R 4.4.0)
#>  jsonlite      1.8.9      2024-09-20 [2] CRAN (R 4.4.1)
#>  kableExtra    1.4.0      2024-01-24 [2] CRAN (R 4.4.0)
#>  knitr         1.49       2024-11-08 [2] CRAN (R 4.4.1)
#>  labeling      0.4.3      2023-08-29 [2] CRAN (R 4.4.0)
#>  labelled      2.13.0     2024-04-23 [2] CRAN (R 4.4.0)
#>  lifecycle     1.0.4      2023-11-07 [2] CRAN (R 4.4.0)
#>  magrittr      2.0.3      2022-03-30 [2] CRAN (R 4.4.0)
#>  markdown      1.13       2024-06-04 [2] CRAN (R 4.4.0)
#>  munsell       0.5.1      2024-04-01 [2] CRAN (R 4.4.0)
#>  pillar        1.9.0      2023-03-22 [2] CRAN (R 4.4.0)
#>  pkgconfig     2.0.3      2019-09-22 [2] CRAN (R 4.4.0)
#>  purrr         1.0.2      2023-08-10 [2] CRAN (R 4.4.0)
#>  R6            2.5.1      2021-08-19 [2] CRAN (R 4.4.0)
#>  readr         2.1.5      2024-01-10 [2] CRAN (R 4.4.0)
#>  repr          1.1.7      2024-03-22 [2] CRAN (R 4.4.0)
#>  rlang         1.1.4      2024-06-04 [2] CRAN (R 4.4.0)
#>  rmarkdown     2.29       2024-11-04 [2] CRAN (R 4.4.1)
#>  rprojroot     2.0.4      2023-11-05 [2] CRAN (R 4.4.0)
#>  rstudioapi    0.17.1     2024-10-22 [2] CRAN (R 4.4.1)
#>  rversions     2.1.2      2022-08-31 [2] CRAN (R 4.4.0)
#>  sass          0.4.9      2024-03-15 [2] CRAN (R 4.4.0)
#>  scales        1.3.0      2023-11-28 [2] CRAN (R 4.4.0)
#>  sessioninfo   1.2.2      2021-12-06 [2] CRAN (R 4.4.0)
#>  skimr         2.1.5      2022-12-23 [2] CRAN (R 4.4.0)
#>  stringi       1.8.4      2024-05-06 [2] CRAN (R 4.4.0)
#>  stringr       1.5.1      2023-11-14 [2] CRAN (R 4.4.0)
#>  svglite       2.1.3      2023-12-08 [2] CRAN (R 4.4.0)
#>  systemfonts   1.1.0      2024-05-15 [2] CRAN (R 4.4.0)
#>  tibble        3.2.1      2023-03-20 [2] CRAN (R 4.4.0)
#>  tidyr         1.3.1      2024-01-24 [2] CRAN (R 4.4.0)
#>  tidyselect    1.2.1      2024-03-11 [2] CRAN (R 4.4.0)
#>  tzdb          0.4.0      2023-05-12 [2] CRAN (R 4.4.0)
#>  utf8          1.2.4      2023-10-22 [2] CRAN (R 4.4.0)
#>  vctrs         0.6.5      2023-12-01 [2] CRAN (R 4.4.0)
#>  viridisLite   0.4.2      2023-05-02 [2] CRAN (R 4.4.0)
#>  vroom         1.6.5      2023-12-05 [2] CRAN (R 4.4.0)
#>  withr         3.0.2      2024-10-28 [2] CRAN (R 4.4.1)
#>  xfun          0.49       2024-10-31 [2] CRAN (R 4.4.1)
#>  xml2          1.3.6      2023-12-04 [2] CRAN (R 4.4.0)
#>  yaml          2.3.10     2024-07-26 [2] CRAN (R 4.4.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-x86_64/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

To prevent conflict with the labelled class from the {Hmisc} package {haven} has changed with version 2.0.0 its label class names from labelled and labelled_spss to haven_labelled and haven_labelled_spss. (See GitHub Issue #329.)↩︎

# Preparing Data {#sec-chap01} ```{r} #| label: setup #| include: false base::source(file = "R/helper.R") ``` ## Achievements to unlock :::::: {#obj-chap01} ::::: my-objectives ::: my-objectives-header Objectives for chapter 01 ::: ::: my-objectives-container **SwR Achievements** - (~~Observations and variables~~) - Using reproducible research practices (@sec-chap01-reproducibility) - (~~Understanding and changing data types~~) - Entering or loading data into R (@sec-chap01-import-data) - Identifying and treating missing values (Data wrangling) (@sec-chap01-data-wrangling) - Building a basic bar chart (Replicate Figure 1.1 and 1.2) (@sec-chap01-repr-bar-charts) I will skip the crossed out learning objectives in parenthesis as I know already these procedures. However I will modify and elaborate some of these achievements as mentioned in the parentheses. I will add other objectives that resulted from questions that arose during reading the book. **Questions that resulted to additional objectives** - How to download data directly from the `r glossary("General Social Survey")` (GSS) web page? {@sec-chap01-get-gss-data} - How to work with labelled data? {@sec-chap01-labelled-data} - Reviewing different packages for creating tables (@sec-chap01-review-table-packages) - Introducing {**gtsummary**}, a packages for presentation-ready data summary and analytic result tables (see: @sec-gtsummary @sec-chap01-experiments-gtsummary) - Developing timeline graph (1973-2022) about opinions of respondents from `r glossary("GSS")` surveys about legalizing marijuana (@sec-chap01-grass-timeline) ::: ::::: Achievements for chapter 01 :::::: ## Using reproducible research practices {#sec-chap01-reproducibility} ### Script files SwR explains writing script files, but I am using `r glossary("Literate Programming")` with Quarto. This has the consequence that in addition to short comments inside code cells I have the possibility to write extensively in the other parts of the file about approach, code, results etc. A practical advice for scripts is to include a `r glossary("prolog")`. Possible prolog sections: - Project name - Project purpose - Name(s) of data set(s) used in the project - Location(s) of data set(s) used in the project - Code author name (you!) - Date code created - Date last time code was edited Most of these information are naturally occurring in the writing process of Quarto books. :::::: my-resource :::: my-resource-header ::: {#lem-chap01-literate-programming} : Literate Statistical Programming ::: :::: ::: my-resource-container - Literate Programming: ([Wikipedia](https://en.wikipedia.org/wiki/Literate_programming)) - Introduction to Literate Programming with Quarto ([Online Slides](https://gesiscss.github.io/quarto-workshop/material/slides/01_introduction.html#/title-slide)) - Reproducibility and literate programming in R ([bookdown course](https://exeter-data-analytics.github.io/LitProg/index.html)) - Introduction to Data Science in R for Biologists (Module on [Literate Statistical Programming and Quarto](https://mbutler808.github.io/rclass/posts/2023-01-26-intro-quarto/index.html)) - Let’s build a blog with Quarto [Literate programming in Quarto](https://ivelasq.quarto.pub/building-a-blog-with-quarto/workflow/write-docs/)) by Isabella Velásquez. The site has other material (for Quarto blogs) as well: [Migrate from R Markdown](https://ivelasq.quarto.pub/building-a-blog-with-quarto/learn-more/migrate-blog/), [Additional resources](https://ivelasq.quarto.pub/building-a-blog-with-quarto/learn-more/resources/) - Introduction to literate programming with Quarto and Markdown by Gesis ([Slides](https://gesiscss.github.io/quarto-workshop/material/slides/01_introduction.html#/title-slide)) ::: :::::: ### Naming objects I am used to apply the [tidyverse style guide](https://style.tidyverse.org/). It requires to use underlines ("snake_code") as separators in object names. (In contrast to "camelCase" code style). But reading the book I thought it might be a good idea to use special additional styles for certain specific objects. - **Naming constants**: Prefix name of constants with `k_`. - **Naming variables**: Standard snake code. - **Naming functions**: Prefix name of private functions with a dot `.`. I had already experienced that didn't know from which package a function was. Only to learn after looking around for minutes that it was a function I wrote myself! - **Naming data frames**: Prefix name with `df_` for data.frame and `dt_` for tibble. I might also use a suffix to refer to the status e.g., `_raw` (raw data), `_clean` (cleaned data), `_v2` (version number). - **Naming files**: It could be helpful to add at the start the chapter number e.g. `chap02_`. And maybe also --- as in naming data frames --- the status as suffix. ## Import data frames from outside resources {#sec-chap01-import-data} R has many possibilities to import data from other statistical packages. ### Some common file extensions - **.csv**: comma separated values - **.txt**: text file - **.xls or .xlsx**: Excel file - **.sav**: SPSS file - **.sasb7dat**: SAS file - **.xpt**: SAS transfer file - **.dta**: Stata file ### Some packages for import data sources - {**readr**}: Read Rectangular Text Data (see: @sec-readr), it is part of {**tidyverse**} - {**vroom**}: Read and Write Rectangular Text Data Quickly - {**haven**}: Import and Export 'SPSS', 'Stata' and 'SAS' Files (see: @sec-haven) - {**foreign**}: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ... - {**readxl**}: Read Excel Files - {**openxslx**}: Read, Write and Edit xslx Files - {**readODS**}: Read and Write ODS Files (e.g. LibreOffice) - {**clipr**}: Read and Write from the System Clipboard I will not go into the import details of all the different packages here, because my focus is on the `r glossary("General Social Survey")` (GSS) data. ### Import data with a .csv file > “While the GSS data can be read into R directly from the GSS website, Kiara had experienced this and knew that it could be frustrating.” ([Harris, 2020](zotero://select/groups/5254842/items/9N29QMJB)) ([pdf](zotero://open-pdf/groups/5254842/items/3NDRGBBW?page=107&annotation=SFD9FHQD)) The author therefore offers for this chapter a `r glossary("CSV", ".csv")` file with the data. In later chapters learner can choose to use the provided files from the [SAGE webpage](https://edge.sagepub.com/harris1e/student-resources/datasets-and-r-code). Even if these data files are not yet cleaned, it is a kind of cheating, because it bypasses downloading data from the original source. The `pb_create_folder()` function in the following code chunk checks if the folder already exists. If this is the case it leaves it untouched, otherwise it creates the folder at the desired path. The code for my private functions are in the file `helper.R` inside the `R` folder at root level of this working directory. :::::::::::::::::: my-example :::: my-example-header ::: {#exm-chap01-read-csv-show-data} : Read data from a .csv file into R ::: :::: ::::::::::::::: my-example-container :::::::::::::: panel-tabset ###### summary() ::::::: my-r-code :::: my-r-code-header ::: {#cnj-code-name-a} : Read .csv file and show data summary ::: :::: :::: my-r-code-container ::: {#lst-chap01-read-csv-show-summary} ```{r} #| label: read-csv-show-summary #| cache: true pb_create_folder(base::paste0(here::here(), "/data/")) pb_create_folder(base::paste0(here::here(), "/data/chap01")) gss_2016_book <- readr::read_csv( file = "data/chap01/legal_weed_age_GSS2016_ch1.csv", show_col_types = FALSE) summary(gss_2016_book) ``` Read the provided dataset as a .csv file into R and show `base::summary()` ::: ------------------------------------------------------------------------ **Some comments** 1. In contrast to `base::read.csv()` in the book I used with `readr::read_csv()` a function from the {**tidyverse**} package collection. 2. I added the `show_col_types = FALSE` argument to prevent a message about the column specification. :::: ::::::: The output of `base::summary` is in this case not very informative. Looking around for appropriate reporting function I developed with `my_glance_data()` my private command. (See @lst-chap01-read-csv-glance-data in next tab.) ###### my_glance_data() ::::::: my-r-code :::: my-r-code-header ::: {#cnj-code-name-b} Look at data with my private function `my_glance_data()` ::: :::: :::: my-r-code-container ::: {#lst-chap01-read-csv-glance-data} ```{r} #| label: glance-data gss_2016_book |> dplyr::select(c(age, grass)) |> my_glance_data(N = 8, seed = 2016) ``` Display 8 randomly chosen rows in addition to the first and last row ::: :::: ::::::: I developed a private function `my_glance_data()` to provide a first impression about the data. The function prints the first and last row of the dataset, so you know implicitly how many rows the dataset has. Between the first and last row the function adds randomly `N` other rows (default is 8). Additionally it provides the row number of the data in a separate column. (The column header `obs` stands for "observation".) For reproducibility purposes you can also add a number for the `set.seed()` command. The idea of `my_glance_data()` is to get a first impression of the dataset. Other printing methods show only the first (or last) rows of the dataset. This could be misleading, giving a wrong impression about typical data. ###### my_glance_data Maybe you are interested to use `my_glance_data()` yourself? It isn't available through a package, but you can copy the source code from the next R chunk. I have saved the function in a `.qmd` file one level higher as this book (and all my other R projects). With `{{< include "../_common.qmd" >}}` I have the code integrated in this book. (For the `include` shortcode see the section [Includes](https://quarto.org/docs/authoring/includes.html) of the Quarto documentation.) ::: {#lst-chap01-show-function-glance-data} ```{r} #| label: show-function-glance-data #| code-fold: show #| eval: false ### function my_glance_data ############## my_glance_data <- function(df, N = 8, seed = 42){ df_temp <- first_and_last_row(df) set.seed(seed) df |> dplyr::mutate(obs = dplyr::row_number()) |> dplyr::relocate(obs) |> dplyr::slice_sample(n = N) |> dplyr::bind_rows(df_temp) |> dplyr::arrange(obs) } first_and_last_row <- function(df) { df |> dplyr::mutate(obs = dplyr::row_number()) |> dplyr::filter(dplyr::row_number() %in% base::c(1, dplyr::n())) } ``` Code of my private function `glance_date()` ::: :::::::::::::: ::::::::::::::: :::::::::::::::::: ------------------------------------------------------------------------ ## Data Wrangling {#sec-chap01-data-wrangling} ### ToDo List After we saved the data we need to do some data wrangling. To replicate the data structure for the book Figure 1.2 we need to: - filter the dataset to the year 2016 (in the case you work with the full GSS dataset 1972-2022, which we won't) - select only the variables `age` and `grass` to work with - drop all NA’s - convert `grass` into factor - recode `grass` labels - convert `age` from double to numeric - divide `age` into appropriate age intervals and label them accordingly ------------------------------------------------------------------------ ::: {#fig-chap01-bar-charts layout-ncol="2"} ![Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1)](img/chap01/fig-01-01-min.png){#fig-01-01} ![Support for marijuana legalization by age group among participants in the 2016 General Social Survey (Figure 1.2)](img/chap01/fig-01-02-min.png){#fig-01-02} We will replicate these two bar charts (Figure 1.1 and Figure 1.2 in the book) ::: ------------------------------------------------------------------------ ### Replicating data structure for bar charts :::::::::::::: my-example :::: my-example-header ::: {#exm-chap01-repl-data-bar-charts} : Replicate data structure for bar charts ::: :::: ::::::::::: my-example-container ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-repl-data-fig-book} : Replicate data structure for Figures 1.1 and 1.2 (Book) ::: :::: :::: my-r-code-container ::: {#lst-repl-data-fig-book} ```{r} #| label: replicate-data-fig-book #| results: hold gss_2016 <- readr::read_csv( file = "data/chap01/legal_weed_age_GSS2016_ch1.csv", show_col_types = FALSE) gss_2016_clean <- gss_2016 |> #### (A) rework grass ################## ## convert grass to factor dplyr::mutate(grass = forcats::as_factor(grass)) |> ## change DK and IAP to NA dplyr::mutate(grass = dplyr::na_if(x = grass, y = "DK")) |> dplyr::mutate(grass = dplyr::na_if(x = grass, y = "IAP")) |> ## drop unused levels "DK" and "IAP" dplyr::mutate(grass = forcats::fct_drop(grass)) |> ## convert to factor and recode it dplyr::mutate(grass = forcats::fct_recode( grass, "Yes" = "LEGAL", "No" = "NOT LEGAL")) |> #### (B) rework age ######################### ## change data types and recode ## turn character type of age into factor and recode "89 OR OLDER" to "89" # dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89")) |> dplyr::mutate(age = dplyr::case_match(age, "89 OR OLDER" ~ "89", .default = age)) |> ## convert data type of age from factor to numeric dplyr::mutate(age = base::as.numeric(age)) |> ## cut age into several defined age groups dplyr::mutate(age_cat = base::cut(age, breaks = c(-Inf, 29, 59, 74, Inf), labels = c("< 30", "30 - 59", "60 - 74", "75+" ))) |> ## drop NA's tidyr::drop_na() ## save cleaned data gss_2016 save_data_file("chap01", gss_2016_clean, "gss_2016_clean.rds") ## (C) check result ######### ## summarize print("************************ SUMMARY ****************************") base::summary(gss_2016_clean) ## glance at the data print("******************* GLANCE AT SOME DATA **********************") gss_2016_clean |> dplyr::select(c(age_cat, grass)) |> my_glance_data(N = 8, seed = 2016) ``` Replicate data structure for Figures 1.1 and 1.2 of the book) ::: ------------------------------------------------------------------------ The somewhat strange cut decision for age groups was motivated by the question if there is a difference between young and old voters about marijuana legalization. Compare the result of the recoding with the original data structure in @lst-chap01-read-csv-show-summary. :::: ::::::: **Some comments** 1. I have included all data wrangling changes for the Figure 1.1 even if they appeared in the book section where the graph is prepared. 2. I have used `|>` form the native R pipe instead of `%>%` exported into tidyverse from the {**magrittr**} package. 3. Otherwise whenever possible I have used {**tidyverse**} code. For `base::as.numeric()` and `base::cut()` I couldn't find {**tidyverse**} equivalents. ::::: my-watch-out ::: my-watch-out-header WATCH OUT! Difference between `forcats::fct_recode()` and `dplyr::recode()`. Use `dplyr::case_match()`! ::: ::: my-watch-out-container To recode `age` from "89 OR OLDER" to "89" I used at first `forcats::fct_recode()`. But this function changes the 72 *levels* of the factor `age` and resulted --- after I changed `age` to numeric in the next line --- in a data frame where `age` ranged from 1 to 72! After I noticed the problem and wanted to replace the {**forcats**}-line with `dplyr::mutate(age = dplyr::recode(age, "89 OR OLDER" = "89"))` I saw in the help file that > `recode()` is superseded in favor of `case_match()`, which handles the most important cases of `recode()` with a more elegant interface. ([Recode values](https://dplyr.tidyverse.org/reference/recode.html)) and > \[`dplyr::case_match()`\] allows you to vectorise multiple switch() statements. Each case is evaluated sequentially and the first match for each element determines the corresponding value in the output vector. If no cases match, the .default is used. ([A general verctorized switch()](https://dplyr.tidyverse.org/reference/case_match.html)) ::: ::::: @lst-repl-data-fig-book used some important packages: I used {**dplyr**} in the above code heavily. {**dplyr**} (@sec-dplyr) is part of the {**tidyverse**} core collection (@sec-tidyverse), together with {**forcats**} (@sec-forcats), {**ggplot2**} (@sec-ggplot2), {**readr**} (@sec-ggplot2), {**tibble**} (@sec-tibble), {**tidyr**} (@sec-tidyr), {**stringr**} (@sec-stringr) and {**purrr**} (@sec-purrr). ::::::::::: :::::::::::::: ## Reproducing bar charts {#sec-chap01-repr-bar-charts} ::::::::::::::: my-example :::: my-example-header ::: {#exm-chap01-repr-bar-charts} : Reproducing bar charts ::: :::: :::::::::::: my-example-container ::::::::::: panel-tabset ###### Figure 1.1 :::::: my-r-code :::: my-r-code-header ::: {#cnj-fig-chap01-repr-fig-1-1} a: Reproducing bar chart Figure 1.1 using `geom_bar()` ::: :::: ::: my-r-code-container ```{r} #| label: fig-chap01-repr-fig-1-1 #| fig-cap: Support for marijuana legalization among participants in the 2016 General Social Survey (Figure 1.1) #| fig-width: 4 #| fig-height: 4 gss_2016_clean <- base::readRDS("data/chap01/gss_2016_clean.rds") ## create figure 1.1 ####### gss_2016_clean |> ggplot2::ggplot(ggplot2::aes(x = grass, y = 100 * ggplot2::after_stat(count) / base::sum(ggplot2::after_stat(count)), fill = grass)) + ggplot2::geom_bar() + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c("#79a778", '#7662aa'), guide = "none") + ggplot2::labs(x = "Should marijuana be legal?", y = "Percent of responses") ``` ------------------------------------------------------------------------ I changed the code slightly because of two warnings. Newer versions of {**ggplot2**} have deprecated some older functions: - Warning: The dot-dot notation (`..count..`) was deprecated in {**ggplot2**} version 3.4.0. Please use `after_stat(count)` instead. - Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in {**ggplot2**} version 3.3.4. Please use "none" instead. (I have slightly edited both warnings) ::: :::::: Compare the reproduction with the original @fig-01-01. ###### Figure 1.2 :::::: my-r-code :::: my-r-code-header ::: {#cnj-fig-chap01-repr-fig-1-2} b: Reproducing bar chart Figure 1.2 using `geom_col()` ::: :::: ::: my-r-code-container ```{r} #| label: fig-chap01-repr-fig-1-2 #| fig-cap: Support for marijuana legalization by age group among participants in the 2016 General Social Survey (Figure 1.2) gss_2016_clean <- base::readRDS("data/chap01/gss_2016_clean.rds") ## create figure 1.2 ######## gss_2016_clean |> dplyr::group_by(grass, age_cat) |> dplyr::count() |> dplyr::group_by(age_cat) |> dplyr::mutate(perc_grass = 100 * n / base::sum(n)) |> ggplot2::ggplot(ggplot2::aes(x = age_cat, fill = grass, y = perc_grass)) + ggplot2::geom_col(position = 'dodge') + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c('#79a778', '#7662aa'), name = "Should marijuana\nbe legal?") + ggplot2::labs(x = "Age group (in years)", y = "Percent of responses in age group") ``` ------------------------------------------------------------------------ @fig-chap01-repr-fig-1-1 uses `ggplot2::geom_bar()` whereas this figure here calls the `ggplot2::geom_col()` function. > There are two types of bar charts: `geom_bar()` and `geom_col()`. `geom_bar()` makes the height of the bar proportional to the number of cases in each group … . If you want the heights of the bars to represent values in the data, use `geom_col()` instead. `geom_bar()` uses `stat_count()` by default: it counts the number of cases at each x position. `geom_col()` uses `stat_identity()`: it leaves the data as is. ([Bar charts](https://ggplot2.tidyverse.org/reference/geom_bar.html) with {**ggplot2**}) ::: :::::: Compare the reproduction with the original @fig-01-02. ::::::::::: :::::::::::: ::::::::::::::: ------------------------------------------------------------------------ ## Exercises ::::: my-objectives ::: my-objectives-header Exercises for the Coder & Hacker edition ::: ::: my-objectives-container **SwR objectives** 1. Visit the NHANES website and find the online codebook for the 2013–2014 data (@sec-visit-nhanes-website) 2. Open the 2013–2014 NHANES data file saved as `nhanes_2013_ch1.csv` with the book materials at [edge.sagepub.com/harris1e](https://edge.sagepub.com/harris1e) (Achievement 1) (@sec-chap01-import-clean-data in tab "read_csv()") 3. Examine the data types for DUQ200, RIDAGEYR, and RIAGENDR, and fix data types if needed based on the NHANES codebook (Achievements 2 and 3) (@sec-chap01-import-clean-data in tab "View data" and tab "Codebook") 4. Based on the online NHANES codebook, code missing values appropriately for DUQ200, RIDAGEYR, and RIAGENDR (Achievement 4) (@sec-chap01-import-clean-data in tab "Data cleaning") 5. Create a bar chart showing the percentage of NHANES participants answering yes and no to marijuana use (Achievement 5) (@sec-chap01-bar-chart-grass in tab "Solution") 6. Recode age into a new variable called `age_cat` with 4 categories: 18–29, 30–39, 40–49, 50–59 (Achievement 7) (@sec-chap01-grass-age-sex in tab "Numbers") 7. Create a bar chart of marijuana use by age group (Achievement 6) (@sec-chap01-grass-age-sex in tab "Age") 8. ~~Add a prolog and comments to your code (Achievement 1)~~ (I am using `r glossary("literate programming")` in writing this book.) 9. Create a bar chart of marijuana use by age group and sex with side-by-side bars (Achievement 5) (@sec-chap01-grass-age-sex in tab "Age and sex") 10. Following the R code in your code file, use comments to describe what you found. Given what you found and the information in the chapter, what do you predict will happen with marijuana legalization in the next 10 years? Discuss how the omission of older people from the marijuana use question for NHANES influenced your prediction. Write your prediction and discussion in comments at the end of your code file (@sec-chap01-data-interpretation). **My own additional exercises** 11. Download and work with SAS .xpt file data (Demo) {@sec-download-xpt-file} 12. Show table of frequencies and percentages (@sec-chap01-bar-chart-grass in Tab "Numbers") 13. Make bars thinner (@sec-chap01-bar-chart-grass in Tab "Thin bars") 14. Make bars thinner and adjust spacing accordingly (@sec-chap01-bar-chart-grass in Tab "Solution") ::: ::::: ### Visiting the NHANES website {#sec-visit-nhanes-website} Use the National Health and Nutrition Examination Survey (`r glossary("NHANES")`) data to examine marijuana use in the United States. Spend a few minutes looking through the NHANES website (<https://www.cdc.gov/nchs/nhanes/index.htm>) before you begin, including finding the online codebook for the 2013–2014 data. Complete the following tasks to explore whether age is related to marijuana use in the United States. The 2013–2014 codebook is on the page [NHANES 2013-2014](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2013). On this page are the links to different types of data, documentations or codebooks. You can choose from :::::: {#bul-NAHNES-data-sections} ::::: my-bullet-list ::: my-bullet-list-header Bullet List: NHANES has six different data sections ::: ::: my-bullet-list-container - Demographics Data - Dietary Data - Examination Data - Laboratory Data - Questionnaire Data - Limited Access Data ::: ::::: NHANES has six different data sections :::::: ------------------------------------------------------------------------ Clicking of one of these links opens a page where a table contains the following information: :::::: {#bul-NHANES-data-pages} ::::: my-bullet-list ::: my-bullet-list-header Bullet List: NHANES data pages has download links for documentation and data files ::: ::: my-bullet-list-container - Data File Name - Doc File - Data File - Data Published ::: ::::: NHANES data pages has download links for documentation and data files :::::: ------------------------------------------------------------------------ [Datasets and Documentation](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Datasets.aspx) and [Frequently Asked Questions (FAQs)](https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/faq.aspx) are important introductory pages. NHANES data are saved in a [SAS transport (.XPT) file](https://www.loc.gov/preservation/digital/formats/fdd/fdd000464.shtml). These file can be read with `foreign::read.xport()` and `haven::read_xpt()`. There is also a [page with example code](https://wwwn.cdc.gov/nchs/nhanes/tutorials/Datasets.aspx) to download/import NHANES data files. I will use {**haven**} instead of the recommended {**foreign**} package. ### Download NHANES SAS .xpt file data (Demo) {#sec-download-xpt-file} :::::::::::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-chap01-xpt-file-download} : Download and work with NHANES SAS .xpt file data (Demo) ::: :::: ::::::::::::::::::::::::::::: my-experiment-container :::::::::::::::::::::::::::: panel-tabset ###### Get .xpt file ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-save-xpt-file-demo} : Save a SAS transport (.xpt) file from the NHANES site for demo purposes and display the structure of the file ::: :::: :::: my-r-code-container ::: {#lst-chap01-downloadfile} ```{r} #| label: save-xpt-file-demo #| cache: true #| eval: false ## download only once (manually) utils::download.file( "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_BPXO.XPT", tf <- base::tempfile(), mode = "wb" ) blood_pressure_demo <- haven::read_xpt(tf) save_data_file("chap01", blood_pressure_demo, "blood_pressure_demo.rds") ``` Download a file and save it as R object ::: (*This code chunk has no visible output*) With the exception of using (**haven**) instead of {**foreign**}, I followed the instruction from the [NHANES recommendation](https://wwwn.cdc.gov/nchs/data/tutorials/file_download_import_R.R). We see labelled data as it was expected from the import with {**haven**}. In @lst-chap01-get-gss2016-automated I have already used the `utils::download.file()` function: There are some differences: - @lst-chap01-get-gss2016-automated downloaded a STATA `.zip` file, therefore I used `unz()` to unzip the file and to read `haven::read_dta()` instead `haven::read_xpt()`. - More important however was the `base::unlink(temp)` command to unlink the temporary file, which is not done here. - Another difference was using the argument `mode = "wb"` for writing binary files. As I understood this is only important for Windows user. On my Mac I could stick with the default value `mode = "w"`. But for reproducibility it would be better to use in the future the more general option with `mode = "wb"`. :::: ::::::: :::::: {#imp-chap01-nhanesA-pkg} ::::: my-important ::: my-important-header Package {**nhanesA**} facilitates downloading NHANES data ::: ::: my-important-container I just learned that there is a {**nhanesA**} package that facilitates downloading data to R. (2024-01-20) The latest package release is brand-new (2024-01-09). I am planning to use this package when I am working with again with NHANES data in @sec-chap06. Just for me to remember: - The packages is [on CRAN](https://cran.r-project.org/web/packages/nhanesA/index.html) - There is a [GitHub repository](https://github.com/cjendres1/nhanes). - A very informative [introductory vignette](https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html) does not only explain the package but also the structure of the NHANES data and codebook. ::: ::::: Package {**nhanesA**} facilitates downloading NHANES data :::::: ------------------------------------------------------------------------ ###### str() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-str-blood-pressure-demo} : Show data summary of three blood pressure variables with `utils::str()` (Demo) ::: :::: ::: my-r-code-container ```{r} #| label: str-blood-pressure-demo blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds") blood_pressure_demo |> dplyr::select(c(2, 4, 5)) |> utils::str() ``` ::: :::::: ###### summary() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-summary-blood-pressure-demo} : Show data summary of three blood pressure variables with `summary()` (Demo) ::: :::: ::: my-r-code-container ```{r} #| label: summary-blood-pressure-demo blood_pressure_demo |> dplyr::select(c(2, 4, 5)) |> base::summary() ``` ::: :::::: ###### skim() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-skim-blood-pressure-demo} : Show data summary of three blood pressure variables with `skim()` (Demo) ::: :::: ::: my-r-code-container ```{r} #| label: skim-blood-pressure-demo blood_pressure_demo |> dplyr::select(c(2, 4, 5)) |> skimr::skim() ``` ::: :::::: ###### my_glance_data() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-show-random-blood-pressure-demo} : Show some blood pressure data with my private function `my_glance_data()` ::: :::: ::: my-r-code-container ```{r} #| label: show-random-blood-pressure-demo blood_pressure_demo <- base::readRDS("data/chap01/blood_pressure_demo.rds") ## display some values blood_pressure_demo |> dplyr::select(c(2, 4, 5)) |> dplyr::rename(arm_selected = BPAOARM) |> dplyr::rename(systolic = BPXOSY1) |> dplyr::rename(diastolic = BPXODI1) |> my_glance_data() ``` ::: :::::: :::::::::::::::::::::::::::: ::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::: ### Import & clean data {#sec-chap01-import-clean-data} :::::::::::::::::::::::::: my-exercise :::: my-exercise-header ::: {#exr-chap01-read-nhanes-csv} : Coder exercises: Import & clean data ::: :::: ::::::::::::::::::::::: my-exercise-container :::::::::::::::::::::: panel-tabset ###### read_csv() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-read-nhanes-csv} : Open the 2013–2014 NHANES data file that comes with the book ::: :::: ::: my-r-code-container ```{r} #| label: read-nhanes-csv #| cache: true #| eval: false ## load and save it only once (manually) ######## nhanes_2013 <- readr::read_csv("data/chap01/nhanes_2013_ch1.csv") save_data_file("chap01", nhanes_2013, "nhanes_2013.rds") ``` (*This R Code chunk has no visible output.*) ::: :::::: ###### View data ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-NHANES-view-data} : Look at the data with `utils::str()` and `skimr::skim()` ::: :::: :::: my-r-code-container ::: {#lst-chap01-NHANES-view-data} ```{r} #| label: NHANES-view-data nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds") base::summary(nhanes_2013) utils::str(nhanes_2013) skimr::skim(nhanes_2013) ``` Look at the data with `utils::str()` and `skimr::skim()` ::: As one of the packages for displaying summaries I am using the {**skimr**} package (see @sec-skimr). :::: ::::::: ###### Codebook Unfortunately the .csv files does not include labelled data. Therefore I had to look for the variable names in all different NHANES sections as outlined in @bul-NAHNES-data-sections. Inspecting the accompanying NHANES download pages (see: @bul-NHANES-data-pages) I could look for the variable names in the download links for the documentation. - [DUQ200](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DUQ_H.htm#DUQ200): Ever used marijuana or hashish - [RIAGENDR](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#RIAGENDR): Gender - [RIDAGEYR](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#RIDAGEYR): Age in years at screening There is an additional column `...1` (see @lst-chap01-NHANES-view-data) providing row numbers. Maybe this column works as an ID, a kind sequence number, so that one could merge other data files identified by this number. (See [SEQN](https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.htm#SEQN): Respondent Sequence Number). Why this column has lost its name, I do not know. Perhaps it has to do with the data import done by the book author? To facilitate the work I made screenshots from these four variables: ![](img/chap01/codebook-NHANES-DUQ200-marijuana-min.png) ![](img/chap01/codebook-NHANES-RIAGENDR-RIDAGEYR-gender-age-min.png) ![](img/chap01/codebook-NHANES-SEQN-sequence-number-min.png) ###### Data cleaning :::::::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-NHANES-data-cleaning} : Look at the data after data cleaning ::: :::: ::::::::: my-r-code-container ```{r} #| label: NHANES-data-cleaning nhanes_2013 <- base::readRDS("data/chap01/nhanes_2013.rds") nhanes_2013_clean <- nhanes_2013 |> dplyr::rename(nr = ...1) |> dplyr::rename( sex = RIAGENDR, age = RIDAGEYR, grass = DUQ200 ) |> dplyr::mutate(grass = dplyr::na_if(x = grass, y = 7)) |> dplyr::mutate(grass = dplyr::na_if(x = grass, y = 9)) |> tidyr::drop_na() |> dplyr::mutate(sex = forcats::as_factor(sex)) |> dplyr::mutate(grass = forcats::as_factor(grass)) |> dplyr::mutate(grass = forcats::fct_recode(grass, "Yes" = "1", "No" = "2")) |> dplyr::mutate(sex = forcats::fct_recode(sex, "Male" = "1", "Female" = "2")) ## save cleaned data ############### save_data_file("chap01", nhanes_2013_clean, "nhanes_2013_clean.rds") base::summary(nhanes_2013_clean) skimr::skim(nhanes_2013_clean) ``` ::::: my-important ::: my-important-header Recoded `gender` to `sex` ::: ::: my-important-container In all R code chunks and text passages I have changed the `gender` category to `sex`. "Sex" refers only to biology but "gender" is a broader concepts: It refers to identity construction, that include biological, psycho-social, and cultural factors. > Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles associated with being a woman, man, girl or boy, as well as relationships with each other. As a social construct, gender varies from society to society and can change over time. ... Gender interacts with but is different from sex, which refers to the different biological and physiological characteristics of females, males and intersex persons, such as chromosomes, hormones and reproductive organs. Gender and sex are related to but different from gender identity. Gender identity refers to a person’s deeply felt, internal and individual experience of gender, which may or may not correspond to the person’s physiology or designated sex at birth. ([WHO](https://www.who.int/health-topics/gender#tab=tab_1)) ::: ::::: ::::: my-remark ::: my-remark-header My learning from the above procedure of data cleaning ::: ::: my-remark-container 1. It is wrong to put the column name `...1` into accents. 2. There exists a shortcut for several `dplyr::rename()` but not for `dplyr::na_if()` and `forcats::as_factor()` because here we need to change the column values with the command. 3. Sequence of commands is important: - Start with the renaming of variables, This is not mandatory but it helps to address the correct column in the following commands. - Recode different missing values to NA’s with `dplyr::na_if()` - Then drop (all) NA’s with `tidyr::drop_na()`. 4. `forcats::as_factor()` needs to rewrite the column as factor with `dplyr::mutate()` ::: ::::: ::::::::: :::::::::::: :::::::::::::::::::::: ::::::::::::::::::::::: :::::::::::::::::::::::::: ### Bar chart for marijuana use {#sec-chap01-bar-chart-grass} :::::::::::::::::::::::::::: my-exercise :::: my-exercise-header ::: {#exr-chap01-bar-chart-grass-percentage} : Bar chart (with different bar width ans space) showing the percentage of NHANES participants answering yes and no to marijuana use ::: :::: ::::::::::::::::::::::::: my-exercise-container :::::::::::::::::::::::: panel-tabset ###### Numbers :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-compute-grass-percentage} : Show frequencies and percentages of NHANES participants answering yes and no to "Ever used marijuana or hashish" ::: :::: ::: my-r-code-container ```{r} #| label: compute-grass-percentage #| cache: true nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds") tibble::enframe(table(nhanes_2013_clean$grass)) |> dplyr::rename(`Ever used grass or hash` = name, Freq = value) |> dplyr::mutate(Perc = round(100 * Freq / sum(Freq), 2)) ``` ------------------------------------------------------------------------ Instead of using `name` and `value` I could have used the position of the column numbers `1` and `2` (all names without the \`-sign.) ::: :::::: This is a manual calculation using the {**tidyverse**} approach: I am sure there are some packages that may facilitate this computation (e.g., {**Janitor**} or {**DescTools**}), but I had difficulties to apply the appropriate functions. ###### Standard :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-grass-bar-chart-percentage} : Percentage of NHANES participants answering yes and no to marijuana use ::: :::: ::: my-r-code-container ```{r} #| label: fig-grass-bar-chart-percentage-standard #| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey" #| fig-width: 4 #| fig-height: 4 nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds") ggplot2::ggplot(nhanes_2013_clean, ggplot2::aes(grass, y = 100 * ggplot2::after_stat(count) / base::sum(ggplot2::after_stat(count)), fill = grass )) + ggplot2::geom_bar() + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'), guide = "none") + ggplot2::scale_y_continuous( breaks = base::seq(0, 50, by = 10), labels = scales::label_percent(scale = 1)) + ggplot2::labs(x = "Did you ever use marijuana or hashish?", y = "Percent of responses") ``` ::: :::::: The bars are very thick. I tried with `ggplot2::geom_bar(width = 0.5)` to make them thinner. But I failed. (See @lst-chap01-grass-bar-chart-width) ###### Thin bars ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-grass-bar-chart-thin-bars} : Percentage of NHANES participants answering yes and no to marijuana use ::: :::: :::: my-r-code-container ::: {#lst-chap01-grass-bar-chart-width} ```{r} #| label: fig-grass-bar-chart-thin-bars #| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey" #| fig-width: 4 #| fig-height: 4 nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds") ggplot2::ggplot(nhanes_2013_clean, ggplot2::aes(grass, y = 100 * ggplot2::after_stat(count) / base::sum(ggplot2::after_stat(count)), fill = grass )) + ## provide thinner bars, standard = 0.9 ggplot2::geom_bar(width = 0.5) + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'), guide = "none") + ggplot2::scale_y_continuous( breaks = base::seq(0, 50, by = 10), labels = scales::label_percent(scale = 1)) + ggplot2::labs(x = "Did you ever use marijuana or hashish?", y = "Percent of responses") ``` Reducing the thickness of the bars ::: :::: ::::::: The bars are now thin but the space between didn't change. I tried several options with the position argument to [adjust bar width *and* spacing](https://r-graphics.org/recipe-bar-graph-adjust-width): > Dodging preserves the vertical position of an geom while adjusting the horizontal position. position_dodge() requires the grouping variable to be be specified in the global or geom\_\* layer. Unlike position_dodge(), position_dodge2() works without a grouping variable in a layer. position_dodge2() works with bars and rectangles, but is particulary useful for arranging box plots, which can have variable widths. All my trials with `position_dodge()` and `position_dodge2()` failed. The only two strategies I could reduce the spacing was: - reducing the width of the graphic (and adapting the title of the graphic to fit into the smaller space) - using ggplot2::theme(aspect.ratio = 3/1) as shown in @lst-fig-chap01-grass-bar-chart-solution. ###### Solution :::::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-grass-bar-chart-solution} : Percentage of NHANES participants answering yes and no to marijuana use ::: :::: ::::::: my-r-code-container ::: {#lst-fig-chap01-grass-bar-chart-solution} ```{r} #| label: fig-grass-bar-chart-solution #| fig-cap: "Proportion of marijuana use of the participant in the NHANES 2013 survey" #| fig-width: 4 #| fig-height: 4 nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds") ggplot2::ggplot(nhanes_2013_clean, ggplot2::aes(grass, y = 100 * ggplot2::after_stat(count) / base::sum(ggplot2::after_stat(count)), fill = grass )) + ggplot2::geom_bar() + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c("darkgreen", 'darkred'), guide = "none") + ggplot2::scale_y_continuous( # n.breaks = 10, breaks = base::seq(0, 50, by = 5), labels = scales::label_percent(scale = 1)) + ggplot2::labs(x = "Did you ever use marijuana or hashish?", y = "Percent of responses") + ggplot2::theme(aspect.ratio = 3/1) ``` My solution for creating a bar chart showing the percentage of NHANES participants answering yes and no to marijuana use ::: ::::: my-remark ::: my-remark-header What I learned from creating the bar chart ::: ::: my-remark-container Even with this simple example of a bar chart I learned some new options: 1. Use `ggplot2::after_stat()` inside the aesthetics of the first layer to convert frequencies to percentages. 2. Adding breaks with `n.breaks = <number>` or `breaks = base::seq(from, to, by = <number>)` inside the appropriate scale. 3. Use the {**scales**} package to add the `%`-sign after the values. As I have already calculated the percentages I need to reduce the default value of `scale = 100` to `scale = 1`. 4. Use `ggplot2::theme(aspect.ratio = <number>/1)` creating thinner bars *and* reducing the spacing between the bars accordingly. In my case I changed the aspect ratio to `3/1`. A bigger front number compresses the graph, a smaller front number extends it. Aspect ratio of `1/1` is the standard (= no change). ::: ::::: ::::::: :::::::::: :::::::::::::::::::::::: ::::::::::::::::::::::::: :::::::::::::::::::::::::::: ### Marijuana use per age and sex {#sec-chap01-grass-age-sex} ::::::::::::::::::::::::::::::: my-exercise :::: my-exercise-header ::: {#exr-chap01-bar-chart-grass-age-group} : Exploring the percentage of NHANES participants answering yes or no to marijuana use by age category and sex ::: :::: :::::::::::::::::::::::::::: my-exercise-container ::::::::::::::::::::::::::: panel-tabset ###### Summary Table :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-compute-grass-age-group-percentage} : Show frequencies and percentages of NHANES participants answering yes and no to "Ever used marijuana or hashish" ::: :::: ::: my-r-code-container ```{r} #| label: compute-grass-age-group-percentage #| results: hold nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds") nhanes_2013_grouped <- nhanes_2013_clean |> ## cut age into several defined age groups dplyr::mutate(age_cat = base::cut(age, breaks = c(-Inf, 29, 39, 49, 59), labels = c("18 - 29", "30 - 39", "40 - 49", "50 - 59"))) |> dplyr::select(grass, sex, age_cat) ## older trial, without the knowledge of {gtsummary} # tibble::enframe(table(nhanes_2013_grouped$age_cat)) |> # dplyr::rename(`Ever used grass or hash` = 1, Freq = 2) |> # dplyr::mutate(Perc = round(100 * Freq / sum(Freq), 2)) nhanes_2013_grouped |> labelled::set_variable_labels( grass = "Ever used grass or hash?" ) |> gtsummary::tbl_summary( label = list( age_cat = "Age Category", sex = "Sex" ), type = grass ~ "categorical" ) |> gtsummary::bold_labels() save_data_file("chap01", nhanes_2013_grouped, "nhanes_2013_grouped.rds") ``` ------------------------------------------------------------------------ Instead of using `1` and `2` I could have used the position of the column numbers `name` and `value` (all names without the \`-sign.) ::: :::::: This is a manual calculation using the {**tidyverse**} approach: I am sure there are some packages that may facilitate this computation (e.g., {**Janitor**} or {**DescTools**}), but I had difficulties to apply the appropriate functions. ###### Age 1 :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-grass-age-group1} : Table: Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses ::: :::: ::: my-r-code-container ```{r} #| label: tbl-grass-age-group1 #| tbl-cap: "Proportion of marijuana use of the participant by age group in percent of total responses in the NHANES 2013 survey" nhanes_2013_grouped <- base::readRDS("data/chap01/nhanes_2013_grouped.rds") nhanes_2013_grouped |> labelled::set_variable_labels( grass = "**Ever used grass or hash?**" ) |> gtsummary::tbl_cross( label = list( age_cat = "Age Category", sex = "Sex" ), row = age_cat, col = grass, percent = "cell" ) |> gtsummary::modify_header( stat_1 = '**Yes**', stat_2 = '**No**', stat_0 = '**Total**' ) |> gtsummary::bold_labels() ``` ------------------------------------------------------------------------ I have required {**gtsummary**} to calculate the percentage with one decimal place. The reason was that the graphic did not match the table. For instance: Without decimal places the "No"-proportion of the age group 18-29 was with 13% equal to the "Yes"-proportion of the age group 30-39. But in the figure these two bars have a different height. The reason was a rounding error: Both values 13.3% and 12.6% were rounded to 13%. But in the figure one could see the difference of 0.7% in the length of the bars. ::: :::::: :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-fig-grass-age-group1} : Figure: Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses ::: :::: ::: my-r-code-container ```{r} #| label: fig-grass-age-group1 #| fig-cap: "Percentage of NHANES participants by age group answering yes and no to marijuana use in percent of total responses" ## age groups bar chart 1 ######## nhanes_2013_grouped |> ggplot2::ggplot(ggplot2::aes( x = age_cat, fill = grass, y = 100 * ggplot2::after_stat(count) / base::sum(ggplot2::after_stat(count)) ) ) + ggplot2::geom_bar(position = 'dodge') + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c('darkgreen', 'darkred'), name = "Ever used\ngrass or hash?") + ggplot2::labs(x = "Age group (in years)", y = "Percent of total responses") ``` ::: :::::: ###### Age 2 :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-grass-age-group2} : Table:: Percentage of NHANES 2013 survey participants by age group answering yes and no to marijuana use in percent of age group ::: :::: ::: my-r-code-container ```{r} #| label: tbl-grass-age-group2 #| tbl-cap: "Proportion of marijuana use of the participant by age group in the NHANES 2013 survey" nhanes_2013_grouped <- base::readRDS("data/chap01/nhanes_2013_grouped.rds") nhanes_2013_grouped |> labelled::set_variable_labels( grass = "**Ever used grass or hash?**" ) |> gtsummary::tbl_cross( label = list( age_cat = "Age Category", sex = "Sex" ), row = age_cat, col = grass, percent = "row", margin = "column", digits = c(0, 1) ) |> gtsummary::modify_header( stat_1 = '**Yes**', stat_2 = '**No**', stat_0 = '**Total**' ) |> gtsummary::bold_labels() ``` ::: :::::: :::::: my-r-code :::: my-r-code-header ::: {#cnj-fig-grass-age-group2} : Figure: Percentage of NHANES participants of the 2013 survey answering yes and no to marijuana use in percent of age group ::: :::: ::: my-r-code-container ```{r} #| label: fig-grass-age-group2 #| fig-cap: "Percentage of NHANES participants of the 2013 survey answering yes and no to marijuana use in percent of age group" ## age groups bar chart 2 ######## nhanes_2013_grouped |> dplyr::group_by(grass, age_cat) |> dplyr::count() |> dplyr::group_by(age_cat) |> dplyr::mutate(perc_grass = 100 * n / base::sum(n)) |> ggplot2::ggplot(ggplot2::aes(x = age_cat, fill = grass, y = perc_grass)) + ggplot2::geom_col(position = 'dodge') + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c('darkgreen', 'darkred'), name = "Ever used\ngrass or hash?") + ggplot2::labs(x = "Age group (in years)", y = "Percent of responses in age group") ``` ::: :::::: ###### Age & Sex :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-grass-age-sex} : Percentage of NHANES participants 2013 survey by age group and sex answering yes and no to marijuana use ::: :::: ::: my-r-code-container ```{r} #| label: fig-grass-age-sex #| fig-cap: "Proportion of marijuana use of the participant by age group and sex in the NHANES 2013 survey" nhanes_2013_grouped <- base::readRDS("data/chap01/nhanes_2013_grouped.rds") ## age and sex bar chart ######## nhanes_2013_grouped |> dplyr::group_by(grass, age_cat, sex) |> dplyr::count() |> dplyr::group_by(age_cat, sex) |> dplyr::mutate(perc_grass_2 = 100 * n / base::sum(n)) |> ggplot2::ggplot(ggplot2::aes( x = age_cat, y = perc_grass_2, fill = sex)) + ggplot2::geom_col(position = 'dodge') + ggplot2::theme_minimal() + ggplot2::scale_fill_manual(values = c('#f98e31', '#730b6d'), name = "sex") + ggplot2::labs( x = "Age group (in years)", y = "Percent of males/females that used marijuana or hashish" ) ``` ------------------------------------------------------------------------ With the exception of the oldest age group the graphic shows that the proportion of marijuana / hashish user increases with younger people. One possible explanation for the untypical high use in the oldest age group could be that people between 50-59 of the 2013 survey were born between 1954-1963 and were young adults during the time of the [Flower Power](https://en.wikipedia.org/wiki/Flower_power) generation part of the [counterculture of the 1960s](Counterculture%20of%20the%201960s)-1970s. ::: :::::: @fig-grass-age-sex is a little difficult to interpret. Both bars colors shows the proportion of "Yes"-respondents in the appropriate age group. For instance: About 62% of males and about 51% of women of the age group 18-29 said "Yes". Therefore the bars to not sum up to 100%. The figures shows only the "Yes" part of the dichotomous `grass` variable for men and women in a certain age group. The other "No" part is missing. To distinguish the sex colors in @fig-grass-age-sex from the Yes/No colors in @fig-grass-age-group1 and @fig-grass-age-group2 I chose at first some kind of blue for men and red for females. But I learned from [An alternative to pink & blue: Colors for gender data](https://blog.datawrapper.de/gendercolor/) that one should avoid this stereotypical colors. > In our western culture, these colors come with the whole gender stereotype baggage. Pink means weak, shy girls who play with dolls and don’t deserve as much as boys. Blue stands for boys who need to be strong & rough. **When we create a chart with pink & blue, we endorse gender stereotypes**. ::::::::::::::::::::::::::: :::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::: ### Data interpretation {#sec-chap01-data-interpretation} ::: {#rep-chap01-gss-nhanes .callout-tip} ###### Marijuana usage and opinions to legalize it (`r glossary("GSS")` and `r glossary("NHANES")` data) @fig-chap01-repr-fig-1-2 with `r glossary("GSS")` data 2016 shows that younger people support the legalizing of marijuana. Therefore we can expect that in the future the proportion of people that approve legalizing will increase. With a small project on my own I have backed up this thesis. @fig-gss-grass-timeline demonstrates that the percentage of supporters has increased very strongly in the last years and has now hit the 70% mark. This trend is very clear even I have not distinguished in the longitudinal study between different age groups. The graphs of the `r glossary("NHANES")` 2013 data in @fig-grass-age-group1 and fig-grass-age-group2 show that the highest proportion of people ever used marijuana or hashish lies in youngest age. The age trend is not very clear, because we have only figures for person equal or under 59 years. (Here would be necessary another small project with longitudinal analysis including newer data.) With the exception of the age group 40-49 the proportion of males ever used in their respective age group is higher than the percentage of women. (See @fig-grass-age-sex). But this is only a snapshot and difficult to interpret. Another possible thesis could be that there is a tiny upward trend from old top young in marijuana use. One possible explanation for the untypical higher use in the oldest age group could be that people between 50-59 of the 2013 survey were born between 1954-1963 and were young adults during the time of the [Flower Power](https://en.wikipedia.org/wiki/Flower_power) generation part of the [counterculture of the 1960s](Counterculture%20of%20the%201960s)-1970s. ::: ------------------------------------------------------------------------ ## Get the GSS data {#sec-chap01-get-gss-data} ### Working with the GSS I am very interested how to get `r glossary("GSS")` data directly from the GSS website, so that I could work on interesting research questions myself. I have found several resources helping to work with the `r glossary("GSS")`. The most important one is the {**gssr**} package (see: @sec-gssr) :::::: my-resource :::: my-resource-header ::: {#lem-chap01-asdf} : asdfree: Analyze Survey Data for Free ::: :::: ::: my-resource-container [Analyze Survey Data for Free](http://asdfree.com/) is a bookdown website by [Anthony Damico](https://www.youtube.com/@anthonyjosephdamico/playlists) with currently 64 locations to grab free survey data. As expected it features also a [description of the GSS](http://asdfree.com/general-social-survey-gss.html) including analysis examples with the {**survey**} package and --- especially important for my purpose here --- {**lodown**}, a [package on GitHub](https://github.com/ajdamico/lodown) that is not maintained anymore. ::: :::::: @def-chap01-get-gss-data features five different strategies to download GSS data: 1. Download extract by using the GSS Data Explorer --- Tab: "Explorer" 2. Download files manually --- Tab: "by hand" 3. Download files programmatically --- Tab: "automated" 4. Download via the {**lodown**} package --- Tab: "lodown" 5. Download via the {**gssr**} package --- Tab: "gssr" ### Five strategies to get GSS data ::::::::::::::::::::::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-chap01-get-gss-data} : How to get the General Social Survey (GSS) data ::: :::: :::::::::::::::::::::::::::::::::::::::: my-experiment-container ::::::::::::::::::::::::::::::::::::::: panel-tabset ###### Explore To use all the facilities of the GSS Data Explorer (tagging, tabulating, data extracting) you need to register for a free account. The good thing is: This is a onetime procedure. :::::: my-procedure :::: my-procedure-header ::: {#prp-chap01-get-gss-data-explorer} : Downloading data extracts with the GSS Data Explorer ::: :::: ::: my-procedure-container 1. Create a free account for the [GSS Data Explorer](https://gssdataexplorer.norc.org/), a tool that allows to browse the data that have been collected in the surveys. - Fill out the form - Wait for an email with the verification code - Confirm the registration with the verification code 2. Go for the tour to learn the interface (Link "Tour Guide") 3. Now you are ready to follow the advises in the slides. If you prefer you can view the slide show in a [standalone browser](https://petzi53.quarto.pub/gss-data-explorer/#/title-slide). ::: :::::: <iframe width="650" height="400" class="slide-deck" src="https://petzi53.quarto.pub/gss-data-explorer/#/title-slide"> </iframe> As one can see this is a somewhat cumbersome procedure to download the desired data. Following the proposed strategies in the other tabs are much easier for importing GSS data. But using the GSS Data Explorer is very helpful to *explore* the dataset. Apply the first three steps of the above list to find the correct variable names, to read the exact wording of the question asked and to inspect the different codes that are used for the variable. Otherwise you have to skim the more than 700 pages of the GSS codebook.😭 ###### by hand Another approach is to download the complete dataset (or all variables of those years you are interested in) and manage the data in such a way that it can be easily used for your research question. (See @sec-chap01-data-wrangling) :::::: my-procedure :::: my-procedure-header ::: {#prp-chap01-get-gss-data-manually} : Download GSS individual year data sets (cross-section only) ::: :::: ::: my-procedure-container 1. Visit <https://gss.norc.org/Get-The-Data> and choose under the section "Download the Data" the "STATA" format. I read elsewhere that this is the preferred format to convert the data into R with the {**haven**} package. 2. From the [STATA-page](https://gss.norc.org/get-the-data/stata) choose the appropriate link (`2016` in our case) under the section "Individual Year Data Sets (cross-section only)" and download the file `2016_stata.zip` (994 MB) into your preferred folder on your hard disk. After you extract the .zip file you will get the STAT file `GSS2016.dta` (4.4 MB). 3. You can now apply @cnj-chap01-import-stata-data. ::: :::::: :::::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-import-stata-data} : Import STATA GSS 2016 file into R using (**haven**) ::: :::: ::::::: my-r-code-container ::: {#lst-chap01-import-stata-2016-file} ```{r} #| label: import-stata-2016-file #| cache: true #| eval: false ## run only once (manually) ########### gss_2016_man <- haven::read_dta("data/chap01/GSS2016.dta") save_data_file("chap01", gss_2016_man, "gss_2016_man.rds") ``` (*This R code chunk has no visible output*) ```{r} #| label: glance-data-stata-file gss_2016_man <- base::readRDS("data/chap01/gss_2016_man.rds") gss_2016_man |> dplyr::select(c(age, grass)) |> my_glance_data(N = 8, seed = 2016) ``` Import STATA GSS 2016 file into R using (**haven**) and glance at the data ::: ::::: my-important ::: my-important-header {**haven**} imports data as labelled vectors ::: ::: my-important-container The data structure we have found here is very different from the Excel data file provided with the book. ::: ::::: Labelled vectors is a completely new feature for me. I learned that value labels and other metadata tags that are commonly seen when working with other statistical software like SAS, STATA or SPSS (cf. [Survey Research and Datasets in R](https://socialresearchcentre.github.io/r_survey_datasets/), here section 3 [Labelled Data](https://socialresearchcentre.github.io/r_survey_datasets/labelled-data.html)) > A labelled vector is a common data structure in other statistical environments, allowing you to assign text labels to specific values. This class makes it possible to import such labelled vectors in to R without loss of fidelity. ([Create a labelled vector](https://haven.tidyverse.org/reference/labelled.html)) I will go into more details in @sec-chap01-labelled-data. The important thing here is to notice that the variable `grass` has labelled values that explain the short code. Code `1` represents the respondent option that marijuana should be legalized and `2` the opposite. We also learn that there is with `NA i` a special kind of `NA` value: > .i: Inapplicable (IAP). Respondents who are not asked to answer a specific question are assigned to IAP. (See [Alert on the STATA download page](https://gss.norc.org/get-the-data/stata)) On the website we see under the "Alert" section that there other kind of NA’s as well. And the 2022 GSS Codebook describes still other, less common missing values. ::::::: :::::::::: **Additional comments** I chose for file saving the `base::saveRDS()` option (and not `base::save()`) because when later reading into R again with `base::readRDS()` it does not overwrite a variable with the same name respectively I can assign the file to another variable name. ###### automated :::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-get-gss2016-data} : Get year 2016 of GSS data set with `base::download.file()` ::: :::: ::::: my-r-code-container ::: {#lst-chap01-get-gss2016-automated} ```{r} #| label: get-gss2016-automated #| code-fold: show #| cache: true #| eval: false ## run only once (manually) ########### temp <- base::tempfile() utils::download.file("https://gss.norc.org/documents/stata/2016_stata.zip",temp) gss_2016_aut <- haven::read_dta(base::unz(temp, "GSS2016.dta")) base::unlink(temp) save_data_file("chap01", gss_2016_aut, "gss_2016_aut.rds") ``` Get year 2016 of GSS data set with `base::download.file()` ::: (*This R code chunk has no visible output*) This time we have the file downloaded programmatically which is much better in term of reproducibility. We don't need now to import the data {**haven**} but can call base::readRDS(). ::: {#lst-chap01-read-rds-data} ```{r} #| label: read-rds-data gss_2016_aut <- base::readRDS("data/chap01/gss_2016_aut.rds") gss_2016_aut |> dplyr::select(c(age, grass)) |> my_glance_data(N = 8, seed = 2016) ``` Read previously saved `.rds` file into R and glance at the data ::: ::::: :::::::: Data now have a more R like appearance, even if the variable classes with <hvn_lbll> "*`r class(gss_2016_aut$age)`*" are unfamiliar. But we have now lost some information, especially we have to consult the codebook to know what the codes `1` and `2` mean. ###### lodown The following strategy I have taken from the bookdown book [Analyze Survey Data for Free](http://asdfree.com/) (asdf.com) It gives step by step instructions to explore public `r glossary("Microdata")`. Here I refer to the `r glossary("General Social Survey")` (GSS) section of the book. :::::: my-procedure :::: my-procedure-header ::: {#prp-chap01-get-gss2016-lodown} : Get the GSS data with the {**lodown**} package and glance at the data ::: :::: ::: my-procedure-container Working with {**lodown**} is a three step procedure: 1. Retrieve a listing of all available extracts for the GSS data. 2. Choose what files you want to download. In our case data for the year 2016. 3. Download the specified dataset in the offered SPSS file format, but {**lodown**} produces with `.rds` a native R file format with the name `2016.rds`. ::: :::::: The second step has to be done manually but I have the result of my inspection already integrated in @lst-chap01-get-gss2016-lodown. As additional steps I renamed the downloaded file, so that I can distinguish it from similar files of the other approaches. Finally I glanced at the `grass` and `age` data. ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-get-gss2016-lodown} : Get GSS data via {**lodown**} package ::: :::: :::: my-r-code-container ::: {#lst-chap01-get-gss2016-lodown} ```{r} #| label: get-gss2016-lodown #| eval: false #| cache: true #| results: hold ## can't suppress messages, tried several options ## run only once (manually) ############ my_folder <- base::paste0(here::here(), "/data-raw") # (1) retrieve a listing of all available extracts for the GSS data gss_cat <- lodown::get_catalog(data_name = "gss", output_dir = my_folder, "GSS") |> ## (2) choose the catalog part to download dplyr::filter( output_filename == base::paste0(my_folder, "/2016.rds")) ## (3) download the GSS microdata as 2016.rds lodown::lodown("gss" , gss_cat) ## rename dataset to distinguish from other download approaches old_filename <- base::paste0(my_folder, "/2016.rds") new_filename <- base::paste0(my_folder, "/gss_2016_cat.rds") base::file.rename(from = old_filename, to = new_filename) ``` (*This R code chunk has no visible output beside the warning message.*) ```{r} #| label: glance-data-gss-cat ## load and glance at data gss_2016_cat <- base::readRDS("data/chap01/gss_2016_cat.rds") gss_2016_cat |> dplyr::select(c(age, grass)) |> my_glance_data(N = 8, seed = 2016) ``` Get GSS data for the year 2016 via the {**lodown**} package ::: :::: ::::::: The result is a pure `.rds` file where the columns are still of class "*`r class(gss_2016_cat$grass)`*" as in @lst-chap01-read-rds-data. ###### gssr Finally I will download the 2016 data with the help of the {**gssr**} package (see @sec-gssr). This takes some minutes. At first I missed the vignette, so I had to download the package again with the additional argument `build_vignettes = TRUE`. Whereas the vignette explains how to analyse data the GitHub is very helpful how to get the desired data. > You can quickly get the data for any single GSS year by using `gssr::gss_get_yr()` to download the data file from NORC and put it directly into a tibble. After downloaded the file we can --- as in the other tabs already done --- load the file and glance at the grass/age data. ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-get-gss2016-gssr} : Get GSS 2016 Year Data Set (cross-section only) and glance at the data ::: :::: :::: my-r-code-container ::: {#lst-chap01-get-gss2016-gssr-and-glance-at-data} ```{r} #| label: get-gss2016-gssr #| eval: false #| cache: true #| results: hold ## run only once (manually) #################### gss_2016_gssr <- gssr::gss_get_yr(year = 2016) gss_2016_gssr <- gss_2016_gssr |> dplyr::select(c(age, grass)) save_data_file("chap01", gss_2016_gssr, "gss_2016_gssr.rds") ``` ```{r} #| label: glance-data-gss2016 ## load data #################### gss_2016_gssr <- base::readRDS("data/chap01/gss_2016_gssr.rds") ## glance at data gss_2016_gssr |> my_glance_data(N = 8, seed = 2016) ``` Get GSS 2016 Year Data Set (cross-section only) and glance at the data ::: :::: ::::::: Downloading data with {**gssr**} results in exactly the same format as in listing @lst-chap01-import-stata-2016-file from the manual download. But it has now the advantages from the {**gssr**} package. For instance with the integrated help it is much easier to - find the variables - to read the question text of the variable - to see in which year the questions was asked - what the code - including the different types of NA’s mean ::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::: ::::: my-important ::: my-important-header Six different approaches to get the GSS data ::: ::: my-important-container Using the {**gssr**} packages seems to me by far the best approach. ::: ::::: ## Working with Labelled Data {#sec-chap01-labelled-data} ### How to work with labelled data? --- Resources During downloading GSS data I mentioned that {**haven**} imports data as labelled vectors. As a completely new feature for me I looked at the Internet for their advantages and how to work with them. I found four important resources: ::::::::: my-resource :::: my-resource-header ::: {#lem-chap01-survey-research-data} : How to work with labelled data? ::: :::: :::::: my-resource-container [Survey Research Datasets and R](https://socialresearchcentre.github.io/r_survey_datasets/) A book accompanying a workshop delivered for the 7th Biennial `r glossary("ACSPRI")` Social Science Methodology Conference. It provides a practical demonstration of several packages for accessing and working with survey data, associated metadata and official statistics in R. The short book demonstrates - Working with external data sources from common statistical packages (SPSS, SAS, Stata, Excel) and their quirks - Easily working with categorical data in R with the {**labelled**} R package - Accessing external databases in an R native way using {**DBI**} and {**dbplyr**} (The [DBI package](https://dbi.r-dbi.org/) helps connecting R to database management systems (DBMS). [dbplyr](https://dbplyr.tidyverse.org/) is a {**dplyr**} backend for data bases) - Accessing publicly available data in R scripts via the web - Resources for accessing official statistics data in R Here I am interested only in the first two demonstration. I will refer especially to the section on [Labelled data](https://socialresearchcentre.github.io/r_survey_datasets/labelled-data.html). ------------------------------------------------------------------------ ::::: my-watch-out ::: my-watch-out-header WATCH OUT! {**labelled**} and {**sjlabelled**} have similar purposes and functions ::: ::: my-watch-out-container {**labelled**} and {**sjlabelled**} are very similar. As far as I understand is the main difference that {**sjlabelled**} supports not only working with labelled data but also offers functions to benefit from these features. {**labelled**} in contrast manipulates labelled data only as an intermediate data structure that should be converted to common R classes, loosing these kind of meta-data. ::: ::::: :::::: ::::::::: ### Displaying data including labelled vectors {#sec-chap01-labelled-data-2} To get comfortable with labelled data @def-chap01-show-labelled-data I will show how labelled data appear in different viewing functions. To inspect the possibilities of the {**labelled**} package see @def-chap01-use-labelled-haven-package. :::::::::::::::::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-chap01-show-labelled-data} : Different types of data views with labelled data ::: :::: ::::::::::::::::::::::::::::::::::: my-experiment-container :::::::::::::::::::::::::::::::::: panel-tabset ###### str() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-str-labelled-data} : Display the internal structure of labelled data ::: :::: ::: my-r-code-container ```{r} #| label: str-labelled-data utils::str(gss_2016_gssr) ``` ::: :::::: ###### attributes() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-attributes-labelled-data} : Show attributes of labelled data ::: :::: ::: my-r-code-container ```{r} #| label: attributes-labelled-data base::attributes(gss_2016_gssr$grass) ``` This works only with vectors (variables), but not with data frames. You can display specific list values with numbers (e.g., `attributes(gss_2016_gssr$grass)[[4]]`) or names (e.g., `attributes(gss_2016_gssr$grass)[["labels"]]`). ::: :::::: ###### head() ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-head-labelled-data} : Show first six records of the labelled variable ::: :::: :::: my-r-code-container ::: {#lst-chap01-head-labelled-data} ```{r} #| label: head-labelled-data head(gss_2016_gssr$grass) ``` Use `head()` to show the first six records of the labelled variable. ::: ------------------------------------------------------------------------ This works only with labelled vectors (variables), but not with data frames. :::: ::::::: ###### tibble() > One immediate advantage of labelled vectors is that value labels are used in data frame printing when using [tibble](https://tibble.tidyverse.org/) (and by extension the wider tidyverse) and other packages using the [pillar](https://cran.r-project.org/web/packages/pillar/index.html) printing methods. ([Survey Research Datasets and R](https://socialresearchcentre.github.io/r_survey_datasets/what-is-labelled-data.html)) ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-tibble-labelled-data} : Using `tibble` resp. `pillar` printing methods ::: :::: :::: my-r-code-container ::: {#lst-chap01-tibble-labelled-data} ```{r} #| label: tibble-labelled-data gss_2016_gssr |> dplyr::count(grass) ``` Using `tibble` resp. `pillar` printing methods to display NA's ::: :::: ::::::: ###### glimpse() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-glimpse-labelled-data} : Get a glimpse of labelled data ::: :::: ::: my-r-code-container ```{r} #| label: glimpse-labelled-data dplyr::glimpse(gss_2016_gssr) ``` ::: :::::: ###### print() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-print-labelled-data} : Print values as `tibble()` with labelled data ::: :::: ::: my-r-code-container ```{r} #| label: print-labelled-data base::print(gss_2016_gssr) ``` ::: :::::: ###### my_glance_data() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-glance-labelled-data} : Glance randomly at some records with labelled data ::: :::: ::: my-r-code-container ```{r} #| label: glance-labelled-data gss_2016_gssr |> my_glance_data(N = 8, seed = 2016) ``` ::: :::::: ###### RStudio In RStudio you can see part of the labels by supplying additional information about NAs in R code chunks and variable labels in the RStudio viewer. ::: {#fig-rstudio-labelled-data layout-ncol="2"} ![Labelled data as a R code chunk result in RStudio](img/chap01/rstudio-r-chunk-labelled-data-min.png){#fig-rstudio-chunk} ![Labelled data in RStudio data viewer](img/chap01/rstudio-viewer-labelled-data-min.png){#fig-rstudio-viewer} RStudio display of labelled data ::: :::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: `base::summary()` with data frames that contain labelled data generates an error when {**haven**} is not loaded: > ! `summary.haven_labelled()` not implemented. ::::: my-important ::: my-important-header Labelled meta data not visible in most base R viewing functions ::: ::: my-important-container With the exception of `base::str()` labelled meta data are not visible when viewing at the printed data. And even more important: You can't see the different types of NA's without specialized functions. "Tagged" (labelled) missing values behaves exactly like regular R missing values. For more information see @def-chap01-use-labelled-haven-package. ::: ::::: ### Working with labelled data {#sec-chap01-labelled-data-3} #### Two different approaches > The purpose of the labelled package is to provide functions to manipulate metadata as variable labels, value labels and defined missing values using the `haven_labelled` and `haven_labelled_spss` classes introduced in haven package. ([Introduction to labelled](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html)) Until now I have not worked in this book with `haven_labelled_spss` classes. This class comes from imports of SPSS data files via the {**haven**} package. > This class is only used when `user_na = TRUE` in `read_sav()`. It is similar to the `labelled()` class but it also models SPSS's user-defined missings, which can be up to three distinct values, or for numeric vectors a range. ([Labelled vector for SPSS](https://haven.tidyverse.org/reference/labelled_spss.html)) [^01-preparing-data-1] [^01-preparing-data-1]: To prevent conflict with the labelled class from the {**Hmisc**} package {**haven**} has changed with version 2.0.0 its label class names from `labelled` and `labelled_spss` to `haven_labelled` and `haven_labelled_spss`. (See [GitHub Issue #329](https://github.com/tidyverse/haven/issues/329).) As user-defined missing values are not important here I will stick with STATA imports. My main interest with @def-chap01-use-labelled-haven-package is to prepare and/or use the labelled data to work with R. There are principal two approaches: :::::: my-procedure :::: my-procedure-header ::: {#prp-chap01-labelled-data-in-r} : Two different procedures to work in R with labelled data ::: :::: ::: my-procedure-container **Approach A** > In approach A, `haven_labelled` vectors are converted into factors or into numeric/character vectors just after data import, using `unlabelled()`, `to_factor()` or `unclass()`. Then, data cleaning, recoding and analysis are performed using classic R vector types. 1. Data import 2. *Conversion to factor / numeric* 3. Data cleaning / data recoding 4. Data analysis **Approach B** > In approach B, `haven_labelled` vectors are kept for data cleaning and coding, allowing to preserved original recoding, in particular if data should be reexported after that step. Functions provided by labelled will be useful for managing value labels. However, as in approach A, `haven_labelled` vectors will have to be converted into classic factors or numeric vectors before data analysis (in particular modeling) as this is the way categorical and continuous variables should be coded for analysis functions. ([Introduction to labelled](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html)) 1. Data import 2. Data cleaning / data recoding (-\> Data export to SPSS / SAS / STATA ) 3. *Conversion to factor / numeric* 4. Data analysis [![Two different approaches to work in R with labelled data](img/chap01/labelled-approaches-min.png){#fig-chap01-two-approaches-with-labelled-data}](https://cran.r-project.org/web/packages/labelled/vignettes/intro_labelled.html) ::: :::::: #### Inspecting labelled data :::::::::::::::::::::::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-chap01-use-labelled-haven-package} : Inspecting labelled data ::: :::: ::::::::::::::::::::::::::::::::::::::::: my-experiment-container :::::::::::::::::::::::::::::::::::::::: panel-tabset ###### var_label() ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-labelled-var-label} : Get variable labels of a labelled data frame ::: :::: :::: my-r-code-container ::: {#lst-chap01-labelled-var-label} ```{r} #| label: labelled-var-label labelled::var_label(gss_2016_gssr) ``` Use `labelled::var_label()` to get variables labels of a data frame ::: The labels of the two variable specifies in more detail their content. For the `grass` variable we get the question asked in the survey. This is very helpful and saves consulting the codebook. ------------------------------------------------------------------------ :::: ::::::: I am using here the short form with `labelled::var_label()` but there exist also with `labelled::get_variable_labels()` a long form which has with it plural form a more consistent syntax. ###### val_labels() ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-labelled-val-labels} : Get value labels of data frame variables ::: :::: :::: my-r-code-container ::: {#lst-chap01-labelled-val-labels} ```{r} #| label: labelled-val-labels labelled::val_labels(gss_2016_gssr) ``` Use `labelled::val_labels()` to get value labels of data frame variables ::: :::: ::::::: Important is the start of the list of value labels. - **age**: It shows that the value `89` of variable `age` includes values of `89 and older`. This is important for the analysis and saves the work of recoding as done in 1. - **grass**: For the `grass` variable we learn that `1` stands for the opinion that Marijuana "should be legal" and two for the opposite. What follows is a comprehensive list of NA values used in the survey, even if many of these values are not used for the two variables considered here. We know from @cnj-chap01-tibble-labelled-data that the `grass` variable has 1024 NA's, but we do not know their composition of different NA values. See @lst-chap01-haven-na-tag how to get this information. ------------------------------------------------------------------------ ::::: my-watch-out ::: my-watch-out-header WATCH OUT! Inconsistency in the function names ::: ::: my-watch-out-container There is an inconsistency in the singular vs. plural form between `labelled::var_label()` and `labelled::val_labels()`. Both functions show a *list of values* (= plural) if there are more than one label available. For value labels exist also a singular version with `labelled::val_label()` to get a specific value label of a variable. ::: ::::: ###### head() ::::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-labelled-head} : Using `utils::head()` on a labelled variable after {**labelled**} is loaded ::: :::: :::::: my-r-code-container ```{r} #| label: labelled-head ## {**labelled**} is loaded through a call in the previous chunk utils::head(gss_2016_gssr$grass, 10) ``` ------------------------------------------------------------------------ Using `utils::head()` on a variable when {**labelled**} is loaded prints a nicely formatted summary of the attached metadata, excluding formats. It shows all NA value labels in the data frame used in the survey, even only some of them is used here. (See @lst-chap01-haven-na-tag to get type of NA's used for a variable.) We see in the first 10 records of the variable `grass` - three `1` ("should be legal") - three `2` ("should not be legal") and - four `NA(i)` ("iap") Until now it is not clear what "iap" for the `NA(i)` type means. If we want [download (STATA) files directly from the GSS website](https://gss.norc.org/get-the-data/stata) we see on the right side a note about the most important `NA` types. It explains the the abbreviation "iap" stands for "inapplicable". [![GSS Information about the most important NA types](img/chap01/na-gss-website-note-min.png){#fig-chap01-gss-na-note fig-alt="An \"Alerts\" explains that there are three main missing values in the data; \".i\": Inapplicable (IAP). Respondents whor are not asked to answer a specific question are assigned to IAP.\".d\": Don't know (DK). \".n\": No answer (NA)." fig-align="center"}](https://gss.norc.org/get-the-data/stata) ::::: my-note ::: my-note-header Two different meanings of "NA" ::: ::: my-note-container With the `NA(n)` type of missing value "NA" does not mean "Not Applicable" but "No answer". ::: ::::: Returning only the `NA(i)` type of a missing value does not mean that there aren't any other NA types in the data. (See @lst-chap01-labelled-print-unique-tagged to get the types of tagged NA's of a variable.) :::::: ::::::::: This formatted result of labelled data is better readable as with @lst-chap01-labelled-val-labels, because category and value are in the same line. Compare the result of this code ({**labelled**} is loaded) with @lst-chap01-head-labelled-data ({**labelled**} is not loaded). ::::: my-note ::: my-note-header haven::print_labels() as equivalent of head() with {**labelled**} loaded ::: ::: my-note-container With the exception of the specified rows in the `head()` function, you can get the same nicely formatted list of NA types with `haven::print_labels()`. (See next tab "Print labels".) ::: ::::: ###### Print labels :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-haven-print-labels} : Print labels ::: :::: ::: my-r-code-container ```{r} #| label: haven-print-labels haven::print_labels(gss_2016_gssr$grass) ``` ::: :::::: For a detailed explication see previous tab "head()". ###### Unique NA's ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-labelled-print-unique-tagged} : Print unique tagged NA's ::: :::: :::: my-r-code-container ::: {#lst-chap01-labelled-print-unique-tagged} ```{r} #| label: labelled-print-unique-tagged labelled::unique_tagged_na(gss_2016_gssr$grass) |> labelled::sort_tagged_na() |> haven::print_tagged_na() ``` Show the unique types of NA's of a variable ::: ------------------------------------------------------------------------ We see that variable `grass` has three types of NA's: `NA(d)`: Don't know. `NA(i)`: Inapplicable, e.g., these respondents were not asked the marijuana question. `NA(n)`: No answer Here I have used `haven::print_tagged_na()` to format tagged NA's as `NA(a)`, `NA(b)`, etc. (See ["Tagged" missing values](https://haven.tidyverse.org/reference/tagged_na.html)) :::: ::::::: It would be interesting to know the composition of these different NA types. One could reason that there is a big difference between "question not asked" and "no answer". (See @lst-chap01-haven-na-tag to get the composition of the NA types.) ###### NA composition ::::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-haven-na-tag} : Get the composition of the different types of tagged NA's ::: :::: :::: my-r-code-container ::: {#lst-chap01-haven-na-tag} ```{r} #| label: haven-na-tagged gss_2016_gssr |> dplyr::count( grass, haven::na_tag(grass) ) ``` Get the composition of the different types of tagged NA's ::: ------------------------------------------------------------------------ We already know from @lst-chap01-tibble-labelled-data that the variable `grass` has 1024 NA's. Now we know also the type composition of these NA's. :::: ::::::: I could not find a similar function in {**labelled**}, so I have used here `haven::na_tag()`. :::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::: #### Working with labelled data There are two ways to work with labelled data in R: > The goal of haven is not to provide a labelled vector that you can use everywhere in your analysis. The goal is to provide an intermediate datastructure that you can convert into a regular R data frame. You can do this by either converting to a factor or stripping the labels ({**Haven**} vignette [Conversion semantics](https://haven.tidyverse.org/articles/semantics.html)) ::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-conversion-labelled-data} : How to convert labelled data for data analysis in R ::: :::: :::::::::::::::::::: my-experiment-container The second variable `grass` is the labelled original variable. The third variable (last column to the right) is the modified `grass` variable. ::::::::::::::::::: panel-tabset ###### as_factor() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-haven-as-factor} : Convert labelled vectors to factors ::: :::: ::: my-r-code-container ```{r} #| label: haven-as-factor gss_2016_grass <- gss_2016_gssr |> dplyr::select(grass) gss_2016_grass |> dplyr::mutate(grass_factor = haven::as_factor(grass)) |> my_glance_data(N = 8, seed = 42) ``` Here I am using `haven::as_factor()`. This function also knows how to re-label missing values. ::: :::::: ###### zap_labels() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-haven-zap-labels} : Remove labels ::: :::: ::: my-r-code-container ```{r} #| label: haven-zap-labels gss_2016_grass |> dplyr::mutate(grass_unlabelled = haven::zap_labels(grass)) |> my_glance_data(N = 8, seed = 42) ``` ::: :::::: ###### zap_missing() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-haven-zap-missing} : Convert special missings to regular R missings ::: :::: ::: my-r-code-container ```{r} #| label: haven-zap-missing gss_2016_grass |> dplyr::mutate(grass_rgular_na = haven::zap_missing(grass)) |> my_glance_data(N = 8, seed = 42) ``` ::: :::::: ###### zap_label() :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-haven-zap-label} : Remove variable labels from a data frame ::: :::: ::: my-r-code-container ```{r} #| label: haven-zap-label #| results: hold print("******* Original data frame ********") labelled::var_label(gss_2016_gssr) print("******* Data frame after variable zapped *******") gss_2016_gssr |> haven::zap_label() |> labelled::var_label() ``` ::: :::::: ::::::::::::::::::: You can combine all three zap-commands. :::::::::::::::::::: ::::::::::::::::::::::: ## Table excursion ### Review several table packages {#sec-chap01-review-table-packages} Creating tables is an important result of preparing data --- the main subject of this section. In my research I came up with eight packages that seems to me the most important ones. What follows is a short introduction into these packages. ::::::::::::::::::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-chap01-review-table-packages} : Review table packages: {**DT**}, {**flextable**}, {**gt**}, {**gtsummary**}, {**htmlTable**}, {**kableExtra**}, {**tableone**}, {**xtable**} ::: :::: :::::::::::::::::::::::::::::::::::: my-experiment-container ::::::::::::::::::::::::::::::::::: panel-tabset ###### DT [{**DT**}](https://rstudio.github.io/DT/) DT is an abbreviation of 'DataTables.' Data objects in R can be rendered as HTML tables using the JavaScript library 'DataTables' (typically via R Markdown or Shiny). This R package provides an enhanced version of data.frames with more options for authors and users: **Authors**: DT supports data wrangling beside subsetting rows and selecting columns (e.g., computing on columns, sort, grouping). I am not sure if this is really an advantage because it requires the somewhat complex square brackets notation. Using {**dplyr**} with other {**tidyverse**} packages seems --- for me at least --- easier to use. **Users**: The really advantage of {**DT**} in my opinion is that the interface allows users to change values and display formats interactively. There are some option in the standard installations and much more with extensions. - **Standard**: Out of the box {**DT**} allows user to - search, - sort, - choose the number of rows shown per page, - filter columns - add/remove row names and even - edit cells individually. - **Extensions**: There are also some extension to enhance the interactive options: Extensions allow users to - edit groups of cells, - provide buttons for copying or downloading the table in different formats, - to re-order and/or hide columns, - to fix columns and/or header if the table is too wide and/or to long and you have to scroll, - moving around with arrow keys like in Excel tables, - making the table responsive to smaller screens by collapsing columns, - grouping values by a column, - reordering rows, - scrolling through the rows and - selecting parts of variables interactively. :::::: my-remark :::: my-remark-header ::: {#rem-chap01-dt} My conclusion to {**DT**} ::: :::: ::: my-remark-container {**DT**} is my first choice for interactive data exploring and very useful for Explorative Data Analysis (`r glossary("EDA")`). I have already some experiences with this package but haven't use it lastly. Compared with the other table packages {**DT**} is not in the same category: The following packages are developed as `r glossary("Display Tables")` (in contrast to `r glossary("Data Tables")`. Beside of the interactive use {**DT**}s are Data Tables. ::: :::::: ###### flextable [{**flextable**}](https://davidgohel.github.io/flextable/) This package helps you to create reporting table from a data frame easily. You can - merge cells, - add headers, - add footers, - change formatting, - set how data in cells is displayed - Table content can also contain mixed types of text and image content. - Tables can be embedded from R Markdown documents into HTML, PDF, Word, and PowerPoint documents and can be embedded using Package Officer for Microsoft Word or PowerPoint documents. - Tables can also be exported as R plots or graphic files, e.g., png, pdf, and jpeg. It is said that {**flextable**} --- together with the sparsely used [{**huxtable**}](https://hughjonesd.github.io/huxtable/) --- "supports the widest range of output formats ... \[both\] support HTML, {{< latex >}}, and Office formats, and contain most common table features (e.g., conditional formatting)." (Yihui: [R Markdown Coobook](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html)) :::::: my-remark :::: my-remark-header ::: {#rem-chap01-flextable} My conclusion to {**flextable**} ::: :::: ::: my-remark-container {**flextable**} seems pretty well suited to create sophisticated tables for publications. It is often especially mentioned for exporting into Word files. Maybe I will look into this package in the near future, but currently my priority is {**gtsummary**}. ::: :::::: ###### gt [{**gt**}](https://gt.rstudio.com/) {**gt**} stands for "grammar of tables". It aims to do for tables what {**ggplot2**} did for graphics. ([Hadley Wickham on X](https://twitter.com/hadleywickham/status/1247997744469295104?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1247997744469295104%7Ctwgr%5Edb446e88f95e2f0e4a5e791da84f82d626ceee1c%7Ctwcon%5Es1_&ref_url=https%3A%2F%2Fcdn.embedly.com%2Fwidgets%2Fmedia.html%3Ftype%3Dtext2Fhtmlkey%3Da19fcc184b9711e1b4764040d3dc5c07schema%3Dtwitterurl%3Dhttps3A%2F%2Ftwitter.com%2Fhadleywickham%2Fstatus%2F1247997744469295104image%3D)) {**gt**} offers a different and easy-to-use set of functions that helps to build display tables from tabular data. The {**gt**} philosophy states that a comprehensive collection of table parts can be used to create a broad range of functional tables. It allows you to compose a table by putting together different parts of the table, such as - the table header (title and subtitle), - the column labels, - the table body, - row group labels, - and the table footer. Some parts are optional. You can also format numbers and add background shading to cells. [![Part of a {gt} table](https://gt.rstudio.com/reference/figures/gt_parts_of_a_table.svg){#fig-gt-table-parts fig-alt="The table shows all the different parts of a gt table: The TABLE HEADER consists of TITLE and SUBTITLE. The next part is called STUB HEAD and consists of STUBHEAD LABEL on the left, above the row names and SPANNER COLUMN LABEL with COLUMN LABELs below. The third part is called STUB and consists of ROW GROUP LABEL, ROW LABELs and SUMMARY LABEL on the left. On the right side are the values of the table in cells respectively summary cells. The last part is called TABLE FOOTER and consists of FOOTNOTES and SOURCE NOTES."}](https://gt.rstudio.com/index.html). At the beginning of the development process {**gt**} supported only HTML, but now you can also use it for LATEX and RTF outputs. {**gt**} is developed by the RStudio/Posit team, that stands not only for high quality but also for {**tidyverse**} compatibility. It was a little surprise for me that the download statistics does not reflect the normally very high figures of RStudio/Posit products. After trying to work with {**gt**} I believe I know the reasons: - With {**gt**} you have big flexibility but you have to provide all the details. This could be cumbersome, especially when other table packages are providing not only theses details but incorporate also many other features as calculations, tests etc. {**gt**} is therefore better used in combination with other packages like {**gtsummary**}. - {**gt**} is still very early in the development process. The issue tracker show this in the relation of open / closed issue: 272 open / 633 closed (2024-01-18). :::::: my-remark :::: my-remark-header ::: {#rem-chap01-gt} My conclusion to {**gt**} ::: :::: ::: my-remark-container I tried to work with {**gt**} but it is cumbersome. I noticed a [bug even following the official online tutorial](https://github.com/rstudio/gt/issues/1535). Maybe {**gt**} (together with {**gtsummary**} is better suited for creating easily complex tables in many formats. ::: :::::: ###### gtsummary The {**gtsummary**} package provides an efficient way to create publication-ready analytical and summary tables using the R programming language. {**gtsummary**} summarizes data sets, regression models, and more, using sensible defaults with highly customizable capabilities. {**gtsummary**} "was written to be a companion to the {**gt**} package from RStudio. But not all output types are supported by the {**gt**} package. Therefore, \[the developers\] have made it possible to print {**gtsummary**} tables with various engines." ([gtsummary + Quarto/R Markdown](https://www.danieldsjoberg.com/gtsummary/articles/rmarkdown.html)) [![Summary of the various Quarto and R Markdown output types and the print engines supporting them](img/chap01/gtsummary-output-formats-min.png){#fig-chap01-gtsummary-outputs fig-alt="Table of summaries for output formats: The table header reads from left to the right: Print Engine, Function, HTML, Word, PDF, RTF. As print engines feature in this table: gt, flextable, huxtable, kableExtra, kable and tibble. All functions start with 'as_' followed by the name of the package, e.g., as_gt(), as_flex_table(), as_kable_extra() etc. The table states that gt is still working on RTF output, flextable and kableExtra does not support RTF, kableExtra also has no output format for Word. Kable has all four options in a very reduced form (missing indentation, footnotes and spanning headers) and tibble fails on all four formats." fig-align="center"}](https://www.danieldsjoberg.com/gtsummary/articles/rmarkdown.html) :::::: my-remark :::: my-remark-header ::: {#rem-chap01-gtsummary} My conclusion to {**gtsummary**} ::: :::: ::: my-remark-container I have tried out {**gtsummary**} and I am enthusiastic about it! I wrote an introduction with some examples (@sec-chap01-experiments-gtsummary) Now that I understand how it works I will definitely use it. (By the way: Besides the extensive documentation I found that the most effective introduction for me was watching the [Youtube video presentation](https://www.youtube.com/watch?v=tANo9E1SYJE) by the developer Daniel D. Sjoberg). {{< video https://www.youtube.com/watch?v=tANo9E1SYJE >}} ::: :::::: ###### htmlTable [{**htmlTable**}](https://cran.r-project.org/web/packages/htmlTable/vignettes/complex_tables.html) I found {**htmlTable**} per accident searching the Internet with "r creating complex table". This package is not mentioned in the literature I have read so far, but it seems not only able to produce complex tables but is also used quite often. Maybe it is the HTML limitation which reduces it application for Online journals because it requires a separate step to produce {{< latex >}} tables ready for printing. It is a pretty old package (version 1.0 appeared already in 2014) but still currently in active development. The [last change](https://cran.r-project.org/web/packages/htmlTable/index.html) today (2024-01-18) was version 2.4.2 from 2023-10-29. From the package documentation: **Advanced Tables for Markdown/HTML** > Tables with state-of-the-art layout elements such as row spanners, column spanners, table spanners, zebra striping, and more. While allowing advanced layout, the underlying css-structure is simple in order to maximize compatibility with common word processors. The package also contains a few text formatting functions that help outputting text compatible with HTML/LaTeX. Beside the extensive vignette documentation there are also some blog articles: - [Introducing the htmlTable-package](https://gforge.se/2015/04/introducing-the-htmltable-package/) - [Tables form R into Word](https://gforge.se/2013/02/tables-from-r-into-word/) - [Fast-track publishing using knitr: table mania (part IV)](https://gforge.se/2014/01/fast-track-publishing-using-knitr-part-iv/) :::::: my-remark :::: my-remark-header ::: {#rem-chap01-htmltable} My conclusion to {**htmlTable**} ::: :::: ::: my-remark-container {**htmlTable**} seems an interesting package not only for creating complex tables but also for exporting them into Word. But to learn this package is at the moment not on my task list. ::: :::::: ###### kableExtra [{**kableExtra**}](https://haozhu233.github.io/kableExtra/) {**kableExtra**} extends the basic functionality of `knitr::kable()` tables. Although `knitr::kable()` is simple by design, it has many features missing which are usually available in other packages. {**kableExtra**} has filled the gap nicely. One of the best thing about {**kableExtra**} is that most of its table capabilities work for both HTML and PDF formats. :::::: my-remark :::: my-remark-header ::: {#rem-chap01-kableextra} My conclusion to {\*\*kableExtra\*} ::: :::: ::: my-remark-container I have already some experiences with {**kableExtra**}. I used it quasi as an add-on to `knitr::kable()` during personal HTML publications (Blog, bookdown, Quarto). I believe that other "Table 1" packages like {**gtsummary**} or maybe also {**tableOne**) are better suited for display and publication. ::: :::::: ###### tableone [{**tableone**}](https://cran.r-project.org/web/packages/tableone/index.html) The first time I learned about the existence of {**tableone**} was in @sec-chap02 of Statistics with R. Consulting the online documentation it says: > Creates "Table 1", i.e., description of baseline patient characteristics, which is essential in every medical research. Supports both continuous and categorical variables, as well as p-values and standardized mean differences. Weighted data are supported via the {**survey**} package. … Most important functions are `CreateTableOne()` and `svyCreateTableOne()`. :::::: my-remark :::: my-remark-header ::: {#rem-chap01-tableone} My conclusion to {**tableone**} ::: :::: ::: my-remark-container The package is completely new for me. I have to check it out to see how and when to use. But my priority now is {**gtsummary**} because - {**gtsummary**} is much more used than {**tableone**}: over 2000 versus under 500 downloads per day - last release of {**gtsummary**} was version 1.7.2 from July 15, 2023 versus release 26 from July 26, 2020. ::: :::::: ###### xtable [{**xtable**}](https://cran.r-project.org/web/packages/xtable/vignettes/xtableGallery.pdf) {**xtable**} is the oldest package and still the table package most used! From the documentation it is clear that it is very useful for the creation of publication ready tables. But I am little hesitant to learn it, because it is not mentioned in current literature and the last change was in April 2019. :::::: my-remark :::: my-remark-header ::: {#rem-chap01-xtable} My conclusion to {**xtable**} ::: :::: ::: my-remark-container Learning {**xtable**} is not on my agenda: It is not clear for me if {**xtable**} is still in developments and if it is supporting newer approaches. Learning other table packages, especially {**gtsummary**} probabily together with {**gt**} has priority for me. ::: :::::: ::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::: There are other packages as well. But the above eight packages are a first starting point for everybody to learn creating and displaying sophisticated data tables in R. It is good to know about different packages but lastly one has to decide which one to learn and use in daily practice. At the moment my favorites is {**gtsummary**}. :::::: my-resource :::: my-resource-header ::: {#lem-chap01-table-packages} : Other interesting table packages, but currently not considered for my personal purposes ::: :::: ::: my-resource-container Just for a possible later reference I will add some other table packages I know of. - [{**reactable**}](https://glin.github.io/reactable/): `reactable()` creates a data table from tabular data with sorting and pagination by default. The data table is an HTML widget that can be used in R Markdown documents and Shiny applications or viewed from an R console. It is based on the React Table library and made with reactR. Features are: - It creates a data table with sorting, filtering, and pagination - It has built-in column formatting - It supports custom rendering via R or JavaScript - It works seamlessly within R Markdown documents and the Shiny app - [{**ractablefmtr**}](https://kcuilla.github.io/reactablefmtr/index.html): The package improves the appearance and formatting of tables created using the reactable R library. It includes many conditional formatters that are highly customizable and easy to use. The authors of R Markdown Cookbook (Yihui Xie, Christophe Dervieux, Emily Riederer) mention also several other table packages in the section [Other packages for creating tables](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html): - {**rhandsontable**}: [Interface to the Handsontable.js Library.](https://jrowen.github.io/rhandsontable/)). Similar to {**DT**}, and has an Excel feel (e.g., you can edit data directly in the table). Visit <https://jrowen.github.io/rhandsontable/> to learn more about it. - {**pixiedust**}: [Tables so Beautifully Fine-Tuned You Will Believe It’s Magic](https://github.com/nutterb/pixiedust): Features creating tables for models (such as linear models) converted through the {**broom**} package ([Broom: Convert Statistical Objects into Tidy Tibbles](https://broom.tidymodels.org/)). It supports Markdown, HTML, and {{< latex >}} output formats. - **stargazer**: ([beautiful {{< latex >}}, HTML and ASCII tables from R statistical output](https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf)): Features formatting regression models and summary statistics tables. - **xtable** ([Dahl et al. 2019](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-xtable)): Perhaps the oldest package for creating tables---the first release was made in 2000. It supports both {{< latex >}} and HTML formats. I'm not going to introduce the rest of packages, but will just list them here: - **tables**: [Formula-Driven Table Generation](https://dmurdoch.github.io/tables/)), - **pander** ([Daróczi and Tsegelskyi 2022](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-pander)), - **tangram** ([S. Garbett 2023](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-tangram)), - **ztable** ([Moon 2021](https://bookdown.org/yihui/rmarkdown-cookbook/table-other.html#ref-R-ztable)), and - **condformat** ([Introduction to condformat](https://zeehio.github.io/condformat/)). ::: :::::: ::::: my-important ::: my-important-header Summary of my table packages review ::: ::: my-important-container It does not make sense to learn every package. SO I will concentrate on five packages with different usage: - {**DT**}, see: @sec-DT: An important package for interactive HTML tables. It also manages quite good big datasets - {**gt**}, see @sec-gt: Although this package is still lacking many features in follows a systematic approach ("Grammar of Tables") that allows to make fine grained changes. It is there somewhat cumbersome to apply it in everyday use. But it is especially valuable when you have another package for the standard features and you need some specialized change. I plan to use {**gt**} together with {**gtsummary**}. - {**gtsummary**}, see: @sec-gtsummary: I just learned this package and I am enthusiastic about it! I will use it as replacement for {**tableone**} that is used by Harris for the book to prepare publishing ready `r glossary("Table 1")` tables. - {**kableExtra**}, see: @sec-kableExtra: It uses the `kable()` function from {**knitr**} (see: @sec-knitr), which is an essential part of the publishing infrastructure for `r glossary("literate programming")`. It is pretty flexible and could be used in tandem with {**gtsummary**}. For simple tables one can also use `knitr::kable()` or just printing the tibble. ::: ::::: As a consequence of my table packages review I experimented with the {**gtsummary**}. I wanted to create a table where I have information by the variables grass, age in four categories and sex with all their levels in absolute numbers and in percents. ### Experiments with the gtsummary package {#sec-chap01-experiments-gtsummary} ::::::::::::::::::::::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-chap01-gtsummary-intro} : Experiments with {**gtsummary**} ::: :::: :::::::::::::::::::::::::::::::::::::::: my-experiment-container 0. Display dataset used for the {**gtsummary**} introduction. 1. A very basic summary table without any adaptions or modifications. 2. Changing dichotomous to a categorical variable. 3. Ditto with a different method. 4. Changing variable labels with two different methods. 5. Output grouped by the `grass` variable. 6. Header modified. 7. two cross tables of summary statistics stacked. 8. Table stratified by the `grass` variable. ::::::::::::::::::::::::::::::::::::::: panel-tabset ###### 0 :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-data} : Data used in this {**gtsummary**} introduction ::: :::: ::: my-r-code-container ```{r} #| label: gtsummary-data #| cache: true nhanes_2013_clean <- base::readRDS("data/chap01/nhanes_2013_clean.rds") gtsummary_basic <- nhanes_2013_clean |> ## cut age into several defined age groups dplyr::mutate(age_cat = base::cut(age, breaks = c(-Inf, 29, 39, 49, 59), labels = c("18 - 29", "30 - 39", "40 - 49", "50 - 59"))) |> dplyr::select(-c(nr, age)) save_data_file("chap01", gtsummary_basic, "gtsummary_clean.rds") base::summary(gtsummary_basic) gtsummary_basic ``` ::: :::::: I am going to use this dataset to experiment with {**gtsummary**}. ###### 1 :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-basic} : A very basic summary without any adaptions or modifications ::: :::: ::: my-r-code-container ```{r} #| label: gts-basic #| cache: true gtsummary_basic <- base::readRDS("data/chap01/gtsummary_clean.rds") gtsummary_basic |> gtsummary::tbl_summary() ``` ::: :::::: This is a very basic summary without any adaptions or modifications. It is strange, that there are for `grass` not separated levels presented. From a [video of an introduction to {**gtsummary**}](https://www.youtube.com/watch?v=tANo9E1SYJE) by the package author Daniel D. Sjoberg I learned that a categorical variable with two levels is interpreted as dichotomous whenever the two levels are labelled `1` and `0`, `true` and `false` or `Yes` and `No`. And dichotomous variable are different as other categorical variables as they are reported by {**gtsummary**} only with one number (the `1`, `true` or `yes` value). ###### 2 Here I am going to change level names of the `grass` variable to prevent that it is interpreted as a dichotomous variable. :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-no-dichotomous1} : Numbered R Code Title ::: :::: ::: my-r-code-container ```{r} #| label: gts-not-dichotomous1 #| cache: true gtsummary_basic2 <- base::readRDS("data/chap01/gtsummary_clean.rds") |> dplyr::mutate(grass = forcats::fct_recode(grass, "used" = "Yes", "not used" = "No") ) gtsummary_basic2 |> gtsummary::tbl_summary() ``` ::: :::::: This time we can see both levels of `grass`. But we loose the nice Yes/No answer of the survey. ###### 3 Another strategy to prevent a dichotomous interpretation is to tell {**gtsummary**} expressively with the `type =` argument how a variable has to be interpreted. :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-no-dichotomous2} : Preventing a dichotomous interpretation ::: :::: ::: my-r-code-container ```{r} #| label: gts-not-dichotomous2 #| cache: true gtsummary_basic |> gtsummary::tbl_summary( type = grass ~ "categorical" ) ``` ::: :::::: The same output as in the previous tab, but now with the desired Yes/No answer categories. ###### 4 If there are labelled data {gtsummary} uses them as variable labels. This is a big advantages, especially if the data are already labelled and imported with {**haven**}. But you can use the {**labelled**} package to label the variables. If there are no labelled metadata then the variable name is taken, e.g., in most of the cases you have to provide a better name for the table display. I am using both methods in the following code chunk: I provided labelled metadata for the `grass` variable but used new variable names for `age_cat` and `sex`. :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-label} : Changing variable labels ::: :::: ::: my-r-code-container ```{r} #| label: gts-label #| cache: true ## variant 1 to label variables # labelled::var_label( # gtsummary_basic$grass) <- "Ever used grass or hash?" ## variant 2 to label variables gtsummary_basic |> labelled::set_variable_labels( grass = "Ever used grass or hash?" ) |> gtsummary::tbl_summary( # by = grass, type = grass ~ "categorical", label = list( age_cat = "Age Category", sex = "Sex" ) ) ``` ::: :::::: ###### 5 :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-group} : Summarize data by a grouped variable ::: :::: ::: my-r-code-container ```{r} #| label: gts-by-grass #| cache: true gtsummary_group <- gtsummary_basic |> labelled::set_variable_labels( grass = "Ever used grass or hash?" ) |> gtsummary::tbl_summary( # by = grass, type = grass ~ "categorical", label = list( age_cat = "Age Category", sex = "Sex" ) ) gtsummary_group ``` The variable `grass` is not grouped `dplyr::group_by()` but inside `gtsummary::tbl_summary()`. ::: :::::: ###### 6 :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-header} : Change table header ::: :::: ::: my-r-code-container ```{r} #| label: gts-header #| cache: true gtsummary_group |> gtsummary::modify_header(label ~ "**Variable**") |> gtsummary::modify_spanning_header( gtsummary::show_header_names(gtsummary_group) ~ "**Ever used grass or hash?**") ``` ::: :::::: ###### 7 :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gtsummary-cross-table} : Prepare cross tables and stack them ::: :::: ::: my-r-code-container ```{r} #| label: gts-cross-table #| cache: true gtsummary_basic <- base::readRDS("data/chap01/gtsummary_clean.rds") gtsummary_cross1 <- gtsummary_basic |> labelled::set_variable_labels( grass = "**Ever used grass or hash?**" ) |> gtsummary::tbl_cross( label = list( age_cat = "Age Category", sex = "Sex" ), row = age_cat, col = grass, percent = "cell" ) |> gtsummary::modify_header( stat_1 = '**Yes**', stat_2 = '**No**', stat_0 = '**Total**' ) gtsummary_cross2 <- gtsummary_basic |> labelled::set_variable_labels( grass = "**Ever used grass or hash?**" ) |> gtsummary::tbl_cross( label = list( age_cat = "Age Category", sex = "Sex" ), row = sex, col = grass, percent = "cell" ) |> gtsummary::modify_header( stat_1 = '**Yes**', stat_2 = '**No**', stat_0 = '**Total**' ) gtsummary::tbl_stack( list(gtsummary_cross1, gtsummary_cross2) ) |> gtsummary::bold_labels() ``` ::: :::::: ###### 8 :::::: my-r-code :::: my-r-code-header ::: {#cnj-ID-chap01-gtsummary-stratify} : Stratify tables ::: :::: ::: my-r-code-container ```{r} #| label: tbl-gts-stratify #| tbl-cap: "Did you ever use marijuana or hashish?" #| cache: true gtsummary_basic <- base::readRDS("data/chap01/gtsummary_clean.rds") gtsummary_basic |> gtsummary::tbl_strata( strata = grass, ~ .x |> gtsummary::tbl_summary( by = sex, label = age_cat ~ "Age Category" ) |> gtsummary::add_overall() |> gtsummary::modify_header( label = "**Variables**", gtsummary::all_stat_cols() ~ "**{level}**" ) ) |> gtsummary::bold_labels() ``` ::: :::::: ::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::: ------------------------------------------------------------------------ ## Opinions about legalizing marijuana (1973-2022) {#sec-chap01-grass-timeline} :::::::::::::::::::::::::::::: my-experiment :::: my-experiment-header ::: {#def-chap01-timeline-grass} : Developing a graph showing the timeline of respondents opinion about legalizing marijuana (1973-2022) ::: :::: ::::::::::::::::::::::::::: my-experiment-container :::::::::::::::::::::::::: panel-tabset ###### Load data :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-load-gss-grass-legal} : Load the GSS data from 1972-2022 and select appropriate variables ::: :::: ::: my-r-code-container ```{r} #| label: load-gss-grass-legal #| cache: true #| eval: false ## run only once (manually) ######### library(gssr) ## load package data(gss_all) ## attach GSS data data(gss_dict) ## attach codebook gss_grass_legal <- gss_all |> dplyr::select(year, age, grass) save_data_file("chap01", gss_grass_legal, "gss_grass_legal.rds") base::unloadNamespace("gssr") ``` ::: :::::: (*This R code chunk has no visible output*) ###### Clean data :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-clean-gss-grass-legal} : Clean `grass-legal` dataset ::: :::: ::: my-r-code-container ```{r} #| label: clean-gss-grass-legal #| results: hold ## run only once (manually) ########## gss_grass_legal <- base::readRDS("data/chap01/gss_grass_legal.rds") gss_grass_legal_clean <- gss_grass_legal |> tidyr::drop_na() |> dplyr::mutate( grass = forcats::as_factor(grass), grass = forcats::fct_drop(grass), year = base::as.numeric(year), age = base::as.numeric(age) ) save_data_file("chap01", gss_grass_legal_clean, "gss_grass_legal_clean.rds") ``` ::: :::::: (*This R code chunk has no visible output*) ###### Show summary :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gss-grass-clean-data} : Show summary of cleaned dataset ::: :::: ::: my-r-code-container ```{r} #| label: tbl-gtsummary-gss-grass-legal #| tbl-cap: "Summary of GSS surveys about opinion about legalizing marijuana between 1972-2022" #| warning: false grass_legal <- base::readRDS("data/chap01/gss_grass_legal_clean.rds") ## remove comma from big numbers gtsummary::theme_gtsummary_language("en", big.mark = "") grass_legal |> gtsummary::tbl_summary( label = list( year ~ "Year", age ~ "Age", grass ~ "Should marijuana be legal?" ), type = list(age ~ "continuous2", year ~ "continuous2"), statistic = list( age ~ c( "{mean} ({sd})", "{median} ({p25}, {p75})", "{min}, {max}" ), year ~ c("{min}, {max}") ) ) |> gtsummary::italicize_levels() |> gtsummary::bold_labels() ``` ::: :::::: ::::: my-watch-out ::: my-watch-out-header WATCH OUT! Summary with labelled data ::: ::: my-watch-out-container I used for the summary the labelled data from the {**gssr**} package. This has the big advantage that I didn't recode the variable names and could use the variable and value metadata! But {**gtsummary**} warned me to see that this is only "an intermediate data structure not meant for analysis". > Column(s) age are class "haven_labelled". This is an intermediate datastructure not meant for analysis. Convert columns with `haven::as_factor()`, `labelled::to_factor()`, `labelled::unlabelled()`, and `unclass()`. "haven_labelled" value labels are ignored when columns are not converted. Failure to convert may have unintended consequences or result in error. > > - <https://haven.tidyverse.org/articles/semantics.html> > - [https://larmarange.github.io/labelled/articles/intro_labelled.html#unlabelled](https://haven.tidyverse.org/articles/semantics.html) ::: ::::: ###### Prepare data :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gss-grass-prepare-data} : Prepare data for timeline graph ::: :::: ::: my-r-code-container ```{r} #| label: gss-grass-prepare-data grass_legal <- base::readRDS("data/chap01/gss_grass_legal_clean.rds") grass_legal2 <- grass_legal |> # labelled::remove_labels() |> dplyr::group_by(year) |> dplyr::add_count(name = "N") |> dplyr::add_count(grass) |> dplyr::mutate(perc = n / N * 100) |> dplyr::ungroup() ``` ::: :::::: (*This R code chunk has no visible output*) ###### Timeline :::::: my-r-code :::: my-r-code-header ::: {#cnj-chap01-gss-grass-timeline} : Timeline of respondents opinion about legalizing marijuana (1973-2022) ::: :::: ::: my-r-code-container ```{r} #| label: fig-gss-grass-timeline #| fig-cap: "Timeline of respondents opinion about legalizing marijuana between 1972-2022 of the GSS data" grass_legal2 |> # dplyr::group_by(year) |> ggplot2::ggplot(ggplot2::aes( x = year, y = perc, color = grass)) + ggplot2::geom_point() + ggplot2::theme_bw() + ggplot2::scale_color_manual( values = c("#79a778", '#7662aa') ) + ggplot2::scale_x_continuous( breaks = c(1973, seq(1978, 2022, 4), 2022) ) + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45, vjust = 0.5)) + ggplot2::labs( x = "Should marijuana be legal?", y = "Percent of responses", color = "Marijuana") + ggplot2::theme(legend.position = c(.82, .85)) ``` @fig-gss-grass-timeline shows that the proportion of people that support legalizing marijuana is (with the exception of 1978-1990) continuously rising. Starting from 19% support in 1973 approval hits 70% in 2022, the last date where data are available. The turning point where the proportion of affirmation was the first time greater than the disapproval was in 2014 (55/45). ::: :::::: :::::::::::::::::::::::::: ::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: ## Glossary {.unnumbered} ```{r} #| label: glossary-table #| echo: false glossary_table() ``` ## Session Info {.unnumbered} ::::: my-r-code ::: my-r-code-header Session Info ::: ::: my-r-code-container ```{r} #| label: session-info sessioninfo::session_info() ``` ::: :::::