1  Getting started with R & RStudio

1.1 Table of content for chapter 01

Chapter section list

1.2 Working with RStudio projects (empty)

1.3 R Basics

  • A novel feature of the R help system is the facility it provides to execute most examples in the help pages via the utils::example() command: \[example('log')\]

  • A quick way to determine the arguments of an R function is to use the args() function: \[args('log')\]

Warning 1.1

args() and help() functions may not be very helpful with generic functions.

  • For the full set of reserved symbols in R, see \[help('Reserved')\]

  • There are two types of logical operators:

    • vectorizes: \(\&\) and \(|\) vectorized resp.
    • single operand: \(\&\&\) and \(||\)

The unvectorized versions of the and (&&) and or (||) operators are primarily useful for writing R programs and are not appropriate for indexing vectors.

1.4 Fixing errors and getting help

Bullet List 1.1

: Getting help

  • base::traceback() provides information about the sequence of function calls leading up to an error.
  • utils::apropos('<searchString>') searches for currently accessible objects whose names contain a particular character string. It returns a character vector giving the names of objects in the search list matching (as a regular expression)
  • utils:find('<searchString>') returns where objects of a given name can be found.
  • ?? or help.search() activates a broader search because it looks not only in the title fields but in other fields of the help pages as well.
  • utils::RSiteSearch() or the search engine at https://search.r-project.org/ needs an internet connection and starts a search even broader. It looks in all standard and CRAN packages, even those not installed on your system.
  • CRAN task views describe resources in R for applications in specific areas.
    • Views can be installed automatically via ctv::install.views(“Bayesian”) or ctv::update.views(“Bayesian”, coreOnly=TRUE)
    • Query information about a particular task view on CRAN from within R for example with ctv::ctv("MachineLearning")
    • Query to obtain the list of all task views available with ctv::available.views()
  • help(package='<packageName>') calls the index help page of an installed package in the RStudio Help tab, including the hyperlinked index of help topics documented in the package.
  • Vignette:
    • utils::vignette() lists available vignettes in the packages installed on your system in the code window.
    • utils::browseVignettes() open a local web page listing vignettes in the packages installed on your system
    • utils::vignette(package='<packageName>') displays the vignettes available in a particular installed package.
    • vignette('vignetteName') or vignette(‘’, package=‘’) opens a specific vignette.
  • RStudio help:
  • Google search
  • StackOverflow and Cross Validated:
    • Stack Overflow is a question and answer site for professional and enthusiast programmers.
    • Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization.

1.5 Organizing your work and making it reproducible (empty)

1.6 An Extended Illustration

Warning 1.2: Only illustration — not yet understanding

I will follow all steps of the illustration with the Duncan data set. But be aware that I am still lacking understanding of all the procedures. Hopefully these gaps will filled with the next chapters.

I am planning whenever I find an explication of the routines that follow I will provide a link to cross reference illustration and explanation.

1.6.1 Getting, recoding and showing data

Example 1.1 : Duncan’s Occupational-Prestige Regression

R Code 1.1 : Show Duncan raw data from {carData} package

Listing / Output 1.1: Duncan raw data from {carData} package
Code
carData::Duncan
#>                    type income education prestige
#> accountant         prof     62        86       82
#> pilot              prof     72        76       83
#> architect          prof     75        92       90
#> author             prof     55        90       76
#> chemist            prof     64        86       90
#> minister           prof     21        84       87
#> professor          prof     64        93       93
#> dentist            prof     80       100       90
#> reporter             wc     67        87       52
#> engineer           prof     72        86       88
#> undertaker         prof     42        74       57
#> lawyer             prof     76        98       89
#> physician          prof     76        97       97
#> welfare.worker     prof     41        84       59
#> teacher            prof     48        91       73
#> conductor            wc     76        34       38
#> contractor         prof     53        45       76
#> factory.owner      prof     60        56       81
#> store.manager      prof     42        44       45
#> banker             prof     78        82       92
#> bookkeeper           wc     29        72       39
#> mail.carrier         wc     48        55       34
#> insurance.agent      wc     55        71       41
#> store.clerk          wc     29        50       16
#> carpenter            bc     21        23       33
#> electrician          bc     47        39       53
#> RR.engineer          bc     81        28       67
#> machinist            bc     36        32       57
#> auto.repairman       bc     22        22       26
#> plumber              bc     44        25       29
#> gas.stn.attendant    bc     15        29       10
#> coal.miner           bc      7         7       15
#> streetcar.motorman   bc     42        26       19
#> taxi.driver          bc      9        19       10
#> truck.driver         bc     21        15       13
#> machine.operator     bc     21        20       24
#> barber               bc     16        26       20
#> bartender            bc     16        28        7
#> shoe.shiner          bc      9        17        3
#> cook                 bc     14        22       16
#> soda.clerk           bc     12        30        6
#> watchman             bc     17        25       11
#> janitor              bc      7        20        8
#> policeman            bc     34        47       41
#> waiter               bc      8        32       10

The row names contains data and have to be therefore a separate column.

R Code 1.2 : Recode Duncan data

Listing / Output 1.2: Duncan data recoded
Code
(d01.1 <- carData::Duncan |> 
    tibble::rownames_to_column("occupation"))

save_data_file("chap01", d01.1, "d01.1.rds")
#>            occupation type income education prestige
#> 1          accountant prof     62        86       82
#> 2               pilot prof     72        76       83
#> 3           architect prof     75        92       90
#> 4              author prof     55        90       76
#> 5             chemist prof     64        86       90
#> 6            minister prof     21        84       87
#> 7           professor prof     64        93       93
#> 8             dentist prof     80       100       90
#> 9            reporter   wc     67        87       52
#> 10           engineer prof     72        86       88
#> 11         undertaker prof     42        74       57
#> 12             lawyer prof     76        98       89
#> 13          physician prof     76        97       97
#> 14     welfare.worker prof     41        84       59
#> 15            teacher prof     48        91       73
#> 16          conductor   wc     76        34       38
#> 17         contractor prof     53        45       76
#> 18      factory.owner prof     60        56       81
#> 19      store.manager prof     42        44       45
#> 20             banker prof     78        82       92
#> 21         bookkeeper   wc     29        72       39
#> 22       mail.carrier   wc     48        55       34
#> 23    insurance.agent   wc     55        71       41
#> 24        store.clerk   wc     29        50       16
#> 25          carpenter   bc     21        23       33
#> 26        electrician   bc     47        39       53
#> 27        RR.engineer   bc     81        28       67
#> 28          machinist   bc     36        32       57
#> 29     auto.repairman   bc     22        22       26
#> 30            plumber   bc     44        25       29
#> 31  gas.stn.attendant   bc     15        29       10
#> 32         coal.miner   bc      7         7       15
#> 33 streetcar.motorman   bc     42        26       19
#> 34        taxi.driver   bc      9        19       10
#> 35       truck.driver   bc     21        15       13
#> 36   machine.operator   bc     21        20       24
#> 37             barber   bc     16        26       20
#> 38          bartender   bc     16        28        7
#> 39        shoe.shiner   bc      9        17        3
#> 40               cook   bc     14        22       16
#> 41         soda.clerk   bc     12        30        6
#> 42           watchman   bc     17        25       11
#> 43            janitor   bc      7        20        8
#> 44          policeman   bc     34        47       41
#> 45             waiter   bc      8        32       10

Duncan was interested in how prestige is related to income and education in combination.

R Code 1.3 : Skim Duncan data

Listing / Output 1.3: Look at the recoded Duncan data
Code
skimr::skim(d01.1)
Data summary
Name d01.1
Number of rows 45
Number of columns 5
_______________________
Column type frequency:
character 1
factor 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
occupation 0 1 4 18 0 45 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
type 0 1 FALSE 3 bc: 21, pro: 18, wc: 6

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
income 0 1 41.87 24.44 7 21 42 64 81 ▇▂▅▃▅
education 0 1 52.56 29.76 7 26 45 84 100 ▆▆▃▂▇
prestige 0 1 47.69 31.51 3 16 41 81 97 ▇▃▅▂▇

1.6.2 Explorative Data Analysis

Duncan used a linear least-squares regression of prestige on income and education to predict the prestige of occupations for which the income and educational scores were known from the U.S. Census but for which there were no direct prestige ratings. He did not use occupational type in his analysis.

A sensible place to start any data analysis, including a regression analysis, is to visualize the data using a variety of graphical displays. We need the following graphs:

  • Univariate distributions of the three variables
  • Pairwise or marginal relationships among them

Example 1.2 : Explorative Data Analysis (EDA) of Duncan data

R Code 1.4 : Histogram of prestige variable

Listing / Output 1.4: Histogram of prestige variable
Code
d01.1 |> 
    ggplot2::ggplot(
        ggplot2::aes(x = prestige)
    ) +
    ggplot2::geom_histogram(
        bins = 10,
        color = "white",
        fill = "grey40")


The distribution of prestige appears to be bimodal, with cases stacking up near the boundaries, as many occupations are either low prestige, near the lower boundary, or high prestige, near the upper boundary, with relatively fewer occupations in the middle bins of the histogram. Because prestige is a percentage, this behavior is not altogether unexpected. Variables such as this often need to be transformed, perhaps with a logit (log-odds) or similar transformation. But transforming prestige turns out to be unnecessary in this problem.

Before fitting a regression model to the data, we should also examine the distributions of the predictors education and income, along with the relationship between prestige and each predictor, and the relationship between the two predictors.

R Code 1.5 : Histogram of education variable

Listing / Output 1.5: Histogram of education variable
Code
d01.1 |> 
    ggplot2::ggplot(
        ggplot2::aes(x = education)
    ) +
    ggplot2::geom_histogram(
        bins = 10,
        color = "white",
        fill = "grey40")


R Code 1.6 : Histogram of income variable

Listing / Output 1.6: Histogram of income variable
Code
d01.1 |> 
    ggplot2::ggplot(
        ggplot2::aes(x = income)
    ) +
    ggplot2::geom_histogram(
        bins = 10,
        color = "white",
        fill = "grey40")


R Code 1.7 : Numbered R Code Title (Tidyverse)

Listing / Output 1.7: Scatterplot matrix for prestige, education, and income in Duncan’s data, identifying the three most unusual points in each panel. Nonparametric density estimates for the variables appear in the diagonal panels, with a rug-plot (one-dimensional scatterplot) at the bottom of each diagonal panel.
Code
car::scatterplotMatrix( ~ prestige + education + income, 
                        id = list(n = 3), data = d01.1)


The car::scatterplotMatrix() function uses a one-sided formula to specify the variables to appear in the graph, where we read the formula ~ prestige + education + income as “plot prestige and education and income.”

The argument id = list(n = 3) tells scatterplotMatrix() to identify the three most unusual points in each panel. This argument was added by the authors after examining a preliminary plot of the data.

Warning 1.3

In contrast to Figure 1.10 my graph shows the three most unusual points in each panel with a number and not with the name of the occupation. I believe that this difference is a result of my recoding (changing row names into a column).

  • Nonparametric density estimates are using an adaptive-kernel estimator, and they appear by default in the diagonal panels, with a rug-plot (“one-dimensional scatterplot”) at the bottom of each panel, showing the location of the data values for the corresponding variable.
  • The solid line shows the marginal linear least-squares fit for the regression of the vertical-axis variable (y) on the horizontal-axis variable (x), ignoring the other variables.
  • The central broken line is a nonparametric regression smooth, which traces how the average value of y changes as x changes without making strong assumptions about the form of the relationship between the two variables.
  • The outer broken lines represent smooths of the conditional variation of the y values given x in each panel, like running quartiles.
Note 1.1: Is there a tidyverse equivalent for car::scatterplotMatrix()?

I believe that car::scatterplotMatrix() is a modified graphics::pairs() function. GGally::ggpairs() is a {ggplot2} generalized pairs plot matrix equivalent in the tidyverse tradition. I should try it out and see if I can reproduce Listing / Output 1.7 with {GGally}.

Like prestige, education appears to have a bimodal distribution. The distribution of income, in contrast, is perhaps best characterized as irregular. The pairwise relationships among the variables seem reasonably linear, which means that as we move from left to right across the plot, the average y values of the points more or less trace out a straight line. The scatter around the regression lines appears to have reasonably constant vertical variability and to be approximately symmetric.

In addition, two or three cases stand out from the others. In the scatterplot of income versus education, data point 6 (= ministers) are unusual in combining relatively low income with a relatively high level of education, and data point 16 (= conductors) and data point 27 (= railroad engineers) are unusual in combining relatively high levels of income with relatively low education. Because education and income are predictors in Duncan’s regression, these three occupations will have relatively high leverage on the regression coefficients. None of these cases, however, are outliers in the univariate distributions of the three variables.

1.6.3 Regression Analysis

Following Duncan, we next fit a linear least-squares regression to the data to model the joint dependence of prestige on the two predictors, under the assumption that the relationship of prestige to education and income is additive and linear.

Example 1.3 : Compute OLS regression

R Code 1.8 : Fit linear model

Listing / Output 1.8: Regress prestige on education and income
Code
(
    lm01.1 <- stats::lm(
        formula = prestige ~  education + income,
        data = d01.1
    )
)
#> 
#> Call:
#> stats::lm(formula = prestige ~ education + income, data = d01.1)
#> 
#> Coefficients:
#> (Intercept)    education       income  
#>     -6.0647       0.5458       0.5987
Code
save_data_file("chap01", lm01.1, "lm01.1.rds")

R Code 1.9 : Summary of lm01.1

Listing / Output 1.9: Summary of linear model where prestige is regressed on education and income
Code
base::summary(lm01.1)
#> 
#> Call:
#> stats::lm(formula = prestige ~ education + income, data = d01.1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -29.538  -6.417   0.655   6.605  34.641 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.06466    4.27194  -1.420    0.163
#> education    0.54583    0.09825   5.555 1.73e-06
#> income       0.59873    0.11967   5.003 1.05e-05
#> 
#> Residual standard error: 13.37 on 42 degrees of freedom
#> Multiple R-squared:  0.8282, Adjusted R-squared:   0.82 
#> F-statistic: 101.2 on 2 and 42 DF,  p-value: < 2.2e-16

The “statistical-significance” asterisks were suppressed with options(show.signif.stars = FALSE).

Both education and income have large regression coefficients in the “Estimate” column of the coefficient table, with small two-sided p-values in the column labeled “Pr (>|t|)”. For example, holding education constant, a 1% increase in higher income earners is associated on average with an increase of about 0.6% in high prestige ratings.

1.6.4 Regression diagnostics

Example 1.4 : Regression diagnostics

R Code 1.10 : Nonparametric density estimate of the distribution of Studentized residuals

Listing / Output 1.10: Nonparametric density estimate for the distribution of the Studentized residuals from the regression of prestige on education and income.
Code
car::densityPlot(stats::rstudent(lm01.1))


If the errors in the regression are normally distributed with zero means and constant variance, then the Studentized residuals are each t-distributed with \(n − k − 2\) degrees of freedom, where k is the number of coefficients in the model, excluding the regression constant, and n is the number of cases.

R Code 1.11 : Quantile-comparison plot for the Studentized residuals

Listing / Output 1.11: Quantile-comparison plot for the Studentized residuals from the regression of prestige on education and income. The broken lines show a bootstrapped pointwise 95% confidence envelope for the points.
Code
car::qqPlot(lm01.1, id = list(n = 3))

#> [1]  6  9 17

The car::qqPlot() function extracts the Studentized residuals and plots them against the quantiles of the appropriate t-distribution. If the Studentized residuals are t-distributed, then the plotted points should lie close to a straight line. The solid comparison line on the plot is drawn by default by robust regression. The argument id = list (n = 3) identifies the three most extreme Studentized residuals, and qqPlot() returns the (names and) row numbers of these cases.

(In my case the functions returns only the row numbers.)

  • 6 : minister
  • 9 : reporter
  • 17: contractor

By default, car::qqPlot() also produces a bootstrapped pointwise 95% confidence envelope for the Studentized residuals that takes account of the correlations among them (but, because the envelope is computed pointwise, does not adjust for simultaneous inference).

Warning 1.4

Be patient! The computation of the bootstrapped pointwise 95% confidence envelope for the points takes some time.

The bootstrap procedure used by car::qqPlot() generates random samples, and so the plot that you see when you duplicate this command will not be identical to the graph shown in the text.

The residuals stay nearly within the boundaries of the envelope at both ends of the distribution, with the exception of point 6, the occupation minister.

R Code 1.12 : Test based on the largest (absolute) Studentized residual

Listing / Output 1.12: Outlier test: Studentized residuals with Bonferroni
Code
car::outlierTest(lm01.1)
#> No Studentized residuals with Bonferroni p < 0.05
#> Largest |rstudent|:
#>   rstudent unadjusted p-value Bonferroni p
#> 6 3.134519          0.0031772      0.14297

The outlier test suggests that the residual for ministers is not terribly unusual, with a Bonferroni-corrected p-value of 0.14.

R Code 1.13 : Checking for high-leverage and influential cases

Listing / Output 1.13: Index plots of Cook’s distances and hat-values, from the regression of prestige on income and education.
Code
car::influenceIndexPlot(
    model = lm01.1, 
    vars = c("Cook", "hat"),
    id = list(n = 3)
    )

Because the cases in a regression can be jointly as well as individually influential, we also examine added-variable plots for the predictors, using car::avPlots().

R Code 1.14 : Added-variable plots

Listing / Output 1.14: Added-variable plots for education and income in Duncan’s occupational-prestige regression
Code
car::avPlots(
    model = lm01.1,
    id = list(
        cex = 0.75,
        n = 3,
        method = "mahal"
        )
)


The id argument, which has several components here, customizes identification of points in the graph:

  • cex = 0.75 (where cex is a standard R argument for “character expansion”) makes the labels smaller, so that they fit better into the plots
  • n = 3 identifies the three most unusual points in each plot
  • method = "mahal" indicates that unusualness is quantified by Mahalanobis distance from the center of the point-cloud.

Mahalanobis distances from the center of the data take account of the standard deviations of the variables and the correlation between them. Each added-variable plot displays the conditional, rather than the marginal, relationship between the response and one of the predictors. Points at the extreme left or right of the plot correspond to cases that have high leverage on the corresponding coefficient and consequently are potentially influential.

Listing / Output 1.14 confirms and strengthens our previous observations: We should be concerned about the occupations minister (point 6) and conductor (point 16), which work jointly to increase the education coefficient and decrease the income coefficient. Occupation RR.engineer (point 27) has relatively high leverage on these coefficients but is more in line with the rest of the data.

R Code 1.15 : Component-plus-residual plots

Listing / Output 1.15: Component-plus-residual plots for education and income in Duncan’s occupational-prestige regression. The solid line in each panel shows a loess nonparametric-regression smooth; the broken line in each panel is the least-squares line
Code
car::crPlots(lm01.1)


Each plot includes a least-squares line, representing the regression plane viewed edge-on in the direction of the corresponding predictor, and a loess nonparametric-regression smooth.

The purpose of Listing / Output 1.15 is to detect nonlinearity, evidence of which is slightly here.

R Code 1.16 : Score tests for nonconstant variance

Listing / Output 1.16: Checking for an association of residual variability with the fitted values and with any linear combination of the predictors
Code
car::ncvTest(lm01.1)

glue::glue(" ")

car::ncvTest(
    model = lm01.1,
    var.formula = ~ income + education
    )
#> Non-constant Variance Score Test 
#> Variance formula: ~ fitted.values 
#> Chisquare = 0.3810967, Df = 1, p = 0.53702
#>  
#> Non-constant Variance Score Test 
#> Variance formula: ~ income + education 
#> Chisquare = 0.6976023, Df = 2, p = 0.70553

Both tests yield large p-values, indicating that the assumption of constant variance is tenable

R Code 1.17 : Linear regression without influential data of 6 and 16

Listing / Output 1.17: Linear model version 2 without influential data points 6 and 16
Code
 d01.2 <- d01.1 |> 
        dplyr::slice(-c(6, 16))
 
lm01.2 <- stats::update(lm01.1, subset = -c(6, 16))

base::summary(lm01.2)

save_data_file("chap01", d01.2, "d01.2.rds")
save_data_file("chap01", lm01.2, "lm01.2.rds")
#> 
#> Call:
#> stats::lm(formula = prestige ~ education + income, data = d01.1, 
#>     subset = -c(6, 16))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -28.612  -5.898   1.937   5.616  21.551 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.40899    3.65263  -1.755   0.0870
#> education    0.33224    0.09875   3.364   0.0017
#> income       0.86740    0.12198   7.111 1.31e-08
#> 
#> Residual standard error: 11.42 on 40 degrees of freedom
#> Multiple R-squared:  0.876,  Adjusted R-squared:  0.8698 
#> F-statistic: 141.3 on 2 and 40 DF,  p-value: < 2.2e-16

Rather than respecifying the regression model from scratch with stats::lm(), we refit it using the stats::update() function, removing the two potentially problematic cases via the subset argument to update().

I didn’t need car::whichNames() to get the row numbers to be removed, because I used these numbers already all the time.

R Code 1.18 : Comparing the estimated coefficients of lm01.1 and lm01.2

Listing / Output 1.18: Comparing the estimated coefficients and their standard errors across the two regressions fit to the data
Code
car::compareCoefs(lm01.1, lm01.2)
#> Calls:
#> 1: stats::lm(formula = prestige ~ education + income, data = d01.1)
#> 2: stats::lm(formula = prestige ~ education + income, data = d01.1, subset =
#>    -c(6, 16))
#> 
#>             Model 1 Model 2
#> (Intercept)   -6.06   -6.41
#> SE             4.27    3.65
#>                            
#> education    0.5458  0.3322
#> SE           0.0983  0.0987
#>                            
#> income        0.599   0.867
#> SE            0.120   0.122
#> 

The coefficients of education and income changed substantially with the deletion of the occupations minister and conductor. The education coefficient is considerably smaller and the income coefficient considerably larger than before.

R Code 1.19 : Linear regression without influential data of 6, 9, 16, and 27

Listing / Output 1.19: Linear model version 3 without influential data points 6, 9, 16, and 27.
Code
 d01.3 <- d01.1 |> 
        dplyr::slice(-c(6, 9, 16, 27))
 
lm01.3 <- stats::update(lm01.1, subset = -c(6, 9, 16, 27))

base::summary(lm01.3)

save_data_file("chap01", d01.3, "d01.3.rds")
save_data_file("chap01", lm01.3, "lm01.3.rds")
#> 
#> Call:
#> stats::lm(formula = prestige ~ education + income, data = d01.1, 
#>     subset = -c(6, 9, 16, 27))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -25.348  -5.376   1.775   4.498  20.411 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -7.1472     3.4075  -2.097   0.0427
#> education     0.3019     0.1121   2.692   0.0105
#> income        0.9465     0.1420   6.668 6.94e-08
#> 
#> Residual standard error: 10.6 on 38 degrees of freedom
#> Multiple R-squared:  0.8973, Adjusted R-squared:  0.8919 
#> F-statistic:   166 on 2 and 38 DF,  p-value: < 2.2e-16

R Code 1.20 : Comparing the estimated coefficients of lm01.1 and lm01.3

Listing / Output 1.20: Comparing the estimated coefficients and their standard errors across the two regressions fit to the data
Code
car::compareCoefs(lm01.2, lm01.3)
#> Calls:
#> 1: stats::lm(formula = prestige ~ education + income, data = d01.1, subset =
#>    -c(6, 16))
#> 2: stats::lm(formula = prestige ~ education + income, data = d01.1, subset =
#>    -c(6, 9, 16, 27))
#> 
#>             Model 1 Model 2
#> (Intercept)   -6.41   -7.15
#> SE             3.65    3.41
#>                            
#> education    0.3322  0.3019
#> SE           0.0987  0.1121
#>                            
#> income        0.867   0.947
#> SE            0.122   0.142
#> 

The coefficients of education and income changed not substantially with the additional deletion of the occupations reporter (9) and RR.engineer (27). The education coefficient in lm01.3 is only somewhat smaller and the income coefficient is only a little larger than with lm01.2.

1.7 R Functions for Basic Statistics (empty)

1.8 Generic Functions and Their Methods

Note 1.2: Generic function and their methods: An important explanation for me

The following text is almost completely copied from the R companion. It explains specific structures of the R language I already met in different situations but did not understand completely. This has changed now thanks to the explication in the R companion book.

Many of the most commonly used functions in R, such as summary(), print(), and plot(), produce different results depending on the arguments passed to the function. Enabling the same generic function, such as summary(), to be used for many purposes is accomplished in R through an object-oriented programming technique called object dispatch.

The details of object dispatch are implemented differently in the S3 and S4 object systems, so named because they originated in Versions 3 and 4, respectively, of the original S language on which R is based. There is yet another implementation of object dispatch in R for the more recently introduced system of reference classes (sometimes colloquially termed “R5”). Almost everything created in R is an object, such as a numeric vector, a matrix, a data frame, a linear regression model, and so on. In the S3 object system, described in this section and used for most R object-oriented programs, each object is assigned a class, and it is the class of the object that determines how generic functions process the object. We won’t take up the S4 and reference-class object systems in this book, but they too are class based and implement (albeit more complex) versions of object dispatch.

Generic functions operate on their arguments indirectly by calling specialized functions, referred to as method functions or, more compactly, as methods. Which method is invoked typically depends on the class of the first argument to the generic function.

In contrast, in the S4 object system, method dispatch can depend on the classes of more than one argument to a generic function. For example, the generic summary() function has the following definition:

- **summary** 
- function  (object,    ...) 
- UseMethod ("summary") 
-<bytecode: 0x000000001d03ba78> 
- <environment: namespace:base> 

The generic function summary() has one required argument, object, and the special argument ... (the ellipses) for additional arguments that could vary from one summary () method to another.

When UseMethod (“summary”) is called by the summary() generic, and the first (object) argument to summary() is of class “lm”, for example, R searches for a method function named summary.lm(), and, if it is found, executes the command summary.lm(object, ...). It is, incidentally, perfectly possible to call summary.lm() directly.

Although the generic summary() function has only one explicit argument, the method function summary.lm() has additional arguments:

- args("summary.lm") 
- function(object,  correlation =   FALSE,  symbolic.cor    =   FALSE, ...) 
- NULL

Because the arguments correlation and symbolic.cor have default values (FALSE, in both cases), they need not be specified. Thus, for example, if we enter the command summary(lm01.1, correlation=TRUE), the argument correlation=TRUE is absorbed by ... in the call to the generic summary() function and then passed to the summary.lm() method, causing it to print a correlation matrix for the coefficient estimates.

R Code 1.21 : Example of the generic summary() function with additional argument

Listing / Output 1.21: Generic function summary() with additional argument
Code
summary.lm(lm01.1,  correlation = TRUE)
#> 
#> Call:
#> stats::lm(formula = prestige ~ education + income, data = d01.1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -29.538  -6.417   0.655   6.605  34.641 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6.06466    4.27194  -1.420    0.163
#> education    0.54583    0.09825   5.555 1.73e-06
#> income       0.59873    0.11967   5.003 1.05e-05
#> 
#> Residual standard error: 13.37 on 42 degrees of freedom
#> Multiple R-squared:  0.8282, Adjusted R-squared:   0.82 
#> F-statistic: 101.2 on 2 and 42 DF,  p-value: < 2.2e-16
#> 
#> Correlation of Coefficients:
#>           (Intercept) education
#> education -0.36                
#> income    -0.30       -0.72

In this instance, we can call summary.lm() directly, but most method functions are hidden in (not “exported from”) the namespaces of the packages in which the methods are defined and thus cannot normally be used directly. In any event, it is good R form to use method functions only indirectly through their generics.

For example, the summary() method summary.boot(), for summarizing the results of bootstrap resampling is hidden in the namespace of the {car} package. To call this function directly to summarize an object of class “boot”, we could reference the function with the unintuitive package-qualified name car:::summary.boot(), but calling the unqualified method summary.boot() directly won’t work.

Suppose that we invoke the hypothetical generic function fun(), defined as:

fun <-  function(x, ...){ 
    UseMethod   ("fun") 
    } 

with real argument obj of class “cls”: fun (obj). If there is no method function named fun.cls(), then R looks for a method named fun.default(). For example, objects belonging to classes without summary() methods are summarized by summary.default(). If, under these circumstances, there is no method named fun.default(), then R reports an error.

We can get a listing of all currently accessible methods for the generic summary() function using the utils::methods() function, with hidden methods flagged by asterisks.

Example 1.5 : Generic functions and their methods

R Code 1.22 : Listing for all the methods for a generic function

Listing / Output 1.22: List all the methods for the generic function summary()
Code
utils::methods(summary)
#>  [1] summary.Anova.mlm*                  summary.aov                        
#>  [3] summary.aovlist*                    summary.aspell*                    
#>  [5] summary.bcnPowerTransform*          summary.bcnPowerTransformlmer*     
#>  [7] summary.boot*                       summary.check_packages_in_dir*     
#>  [9] summary.connection                  summary.data.frame                 
#> [11] summary.Date                        summary.default                    
#> [13] summary.ecdf*                       summary.factor                     
#> [15] summary.ggplot*                     summary.glm                        
#> [17] summary.hcl_palettes*               summary.infl*                      
#> [19] summary.lm                          summary.loess*                     
#> [21] summary.manova                      summary.matrix                     
#> [23] summary.mlm*                        summary.nls*                       
#> [25] summary.packageStatus*              summary.POSIXct                    
#> [27] summary.POSIXlt                     summary.powerTransform*            
#> [29] summary.ppr*                        summary.prcomp*                    
#> [31] summary.princomp*                   summary.proc_time                  
#> [33] summary.rlang_error*                summary.rlang_message*             
#> [35] summary.rlang_trace*                summary.rlang_warning*             
#> [37] summary.rlang:::list_of_conditions* summary.skim_df*                   
#> [39] summary.srcfile                     summary.srcref                     
#> [41] summary.stepfun                     summary.stl*                       
#> [43] summary.table                       summary.tukeysmooth*               
#> [45] summary.vctrs_sclr*                 summary.vctrs_vctr*                
#> [47] summary.warnings                   
#> see '?methods' for accessing help and source code

You can also determine what generic functions have currently available methods for objects of a particular class.

R Code 1.23 : List all available methods for a specific class

Listing / Output 1.23: List all available methods for a class = “lm”
Code
utils::methods(class = "lm")
#>  [1] add1           alias          anova          case.names     coerce        
#>  [6] confint        cooks.distance deviance       dfbeta         dfbetas       
#> [11] drop1          dummy.coef     effects        extractAIC     family        
#> [16] formula        hatvalues      influence      initialize     kappa         
#> [21] labels         logLik         model.frame    model.matrix   nobs          
#> [26] plot           predict        print          proj           qr            
#> [31] residuals      rstandard      rstudent       show           simulate      
#> [36] slotsFromS3    summary        variable.names vcov          
#> see '?methods' for accessing help and source code

In the above code chunk I have not loaded the {car} package which has many other methods for class “lm”. (See next tab)

R Code 1.24 : List all available methods for a specific class with additional package loaded

Listing / Output 1.24: Example of using the methods() function with the {car} package loaded
Code
#> Loading required package: carData
Code
utils::methods(class = "lm")
detach("package:car", unload = TRUE)
detach("package:carData", unload = TRUE)
#>  [1] add1                alias               anova              
#>  [4] Anova               avPlot              avPlot3d           
#>  [7] Boot                bootCase            boxCox             
#> [10] brief               case.names          ceresPlot          
#> [13] coerce              confidenceEllipse   confint            
#> [16] Confint             cooks.distance      crPlot             
#> [19] crPlot3d            deltaMethod         deviance           
#> [22] dfbeta              dfbetaPlots         dfbetas            
#> [25] dfbetasPlots        drop1               dummy.coef         
#> [28] durbinWatsonTest    effects             extractAIC         
#> [31] family              formula             hatvalues          
#> [34] hccm                infIndexPlot        influence          
#> [37] influencePlot       initialize          inverseResponsePlot
#> [40] kappa               labels              leveneTest         
#> [43] leveragePlot        linearHypothesis    logLik             
#> [46] mcPlot              mmp                 model.frame        
#> [49] model.matrix        ncvTest             nextBoot           
#> [52] nobs                outlierTest         plot               
#> [55] powerTransform      predict             Predict            
#> [58] print               proj                qqPlot             
#> [61] qr                  residualPlot        residualPlots      
#> [64] residuals           rstandard           rstudent           
#> [67] S                   show                sigmaHat           
#> [70] simulate            slotsFromS3         spreadLevelPlot    
#> [73] summary             symbox              variable.names     
#> [76] vcov                vif                
#> see '?methods' for accessing help and source code

Method selection is slightly more complicated for objects whose class is a vector of more than one element. Consider, for example, an object returned by the glm() function for fitting generalized linear models (anticipating a logistic-regression example).

R Code 1.25 : List all available methods for a specific vector class

Listing / Output 1.25: List all available methods for a vectorized class
Code
#> Loading required package: carData
Code
mod.mroz <- stats::glm(
    lfp ~ ., 
    family = binomial, 
    data = Mroz)
class(mod.mroz)

detach("package:car", unload = TRUE)
detach("package:carData", unload = TRUE)
#> [1] "glm" "lm"

The . on the right-hand side of the model formula indicates that the response variable lfp is to be regressed on all of the other variables in the Mroz data set (which is accessible because it resides in the {carData} package). If we invoke a generic function with mod.mroz as its argument, say fun(mod. mroz), then the R interpreter will look first for a method named fun.glm(). If a function by this name does not exist, then R will search next for fun.lm() and finally for fun.default(). We say that the object mod.mroz is of primary class “glm” and inherits from class “lm”.