Data types and structures

ACTEX Learning - AFDP: R Session

Getting into Practice with R

Data manipulation is a fundamental skill in the field of data science and the actuarial profession. It involves the process of transforming, organizing, and cleaning raw data to make it suitable for tailored analysis and visualization.

In the actuarial context, tools like R are invaluable for tasks such as: calculating premium rates for mortality and morbidity products, evaluating the probability of financial loss or return, providing business risk consulting, and planning for pensions and retirement. In essence, R is a powerful tool for actuaries and data scientists to perform complex data analysis and modeling.

R Packages

R is equipped with a vast ecosystem of packages that extend its functionality.

R packages are collections of functions, data, and compiled code in a well-defined format. They are stored in a directory called library. A fresh R installation includes a set of packages that are loaded automatically when R starts. These packages are referred to as the base packages. The base packages are always available, and they do not need to be loaded explicitly.

Others packages are available for download from CRAN (Comprehensive R Archive Network) or other repositories, such as GitHub usually for package development versions. To install a package which is stored on CRAN, we can use the install.packages("") function. To load a package, use the library() function.

Data Types

R has several data types, including numeric, character, logical, integer, and complex. The most common data types are:

  • Numeric - real numbers
  • Character - text
  • Logical - TRUE or FALSE
  • Integer - whole numbers

It allows data to be on the form of:

  • Vectors of numbers, characters, or logical values
  • Matrices which are arrays with two dimensions
  • Arrays which are multi-dimensional generalizations of matrices
  • Data Frames which are matrices with columns of different types
  • Lists which are collections of objects

Basic R Syntax

R is a case-sensitive language, meaning that it distinguishes between uppercase and lowercase letters. It uses the # symbol to add comments to the code. Comments are ignored by the R interpreter and are used to explain the code.

# Basic arithmetic and variable assignment
x <- 10
y <- 5
sum_xy <- x + y
sum_xy
[1] 15
# Simple interest formula
P <- 1000  # Principal
r <- 0.05  # Interest rate
t <- 2     # Time in years
A <- P * (1 + r * t)
A
[1] 1100

R Functions

R is a functional programming language, which means that it is based on functions. Functions are blocks of code that perform a specific task. They take input, process it, and return output.

There are two types of functions in R: base functions and user-defined functions. Base functions are built into R, while user-defined functions are created by the user.

Base Functions

For example, the mean() function calculates the average of a set of numbers. The mean() function takes a vector of numbers as input and returns the average of those numbers. To access a general function information in R, use the help() function, or the ? operator.

?mean()
help(mean)

The c() function is used to combine values into a vector or list.

mean(c(1, 2, 3, 4, 5))
[1] 3

Useful base R function:

  • mean() - calculates the average of a set of numbers
  • sum() - calculates the sum of a set of numbers
  • sd() - calculates the standard deviation of a set of numbers
  • var() - calculates the variance of a set of numbers
  • min() - returns the minimum value in a set of numbers
  • max() - returns the maximum value in a set of numbers
  • length() - returns the length of a vector
  • str() - displays the structure of an R object
  • class() - returns the class of an object
  • typeof() - returns the type of an object
  • summary() - provides a summary of an object
  • plot() - creates a plot of data …

User-Defined Functions

name <- function(variables) {
  # Code block
  }
x <- c(1, 2, 3, 4, 5)

avg <- function(x) {
  sum(x)/length(x)
  }
avg(x);
[1] 3
mean(x)
[1] 3

Essential Data Preparation Techniques

Data wrangling, manipulation, and transformation are essential techniques for preparing and refining data for analysis. A series of steps to ensure data quality by handling inconsistencies, filling gaps, correcting errors, removing duplicates, and merging dataset. Common data preparation tasks include:

  • Wrangling:

    • Handling missing values
    • Reshaping data (wide ↔︎ long)
    • Removing duplicates
    • Standardizing formats (e.g., dates, units)
    • Joining multiple datasets
  • Manipulation:

    • Selecting and filtering rows/columns (select(), filter())
    • Sorting (arrange())
    • Grouping and summarizing (group_by(), summarise())
  • Transformation:

    • Normalization or scaling (e.g., z-scores, min-max)
    • Encoding categorical variables
    • Creating new variables (mutate())
    • Applying functions (e.g., log, square root)

These processes, often performed using R packages like {dplyr}, {tidyr}, and {data.table}, are crucial steps in converting raw, disorganized data into meaningful insights. Together, they form the foundation for effective data analytics, which uses cleaned and well-structured data to perform descriptive, predictive, and inferential analyses.

As an example, consider a dataset with columns for ID, Age, and Salary. We can remove rows with missing values in the Age column and create a new column called IncomeGroup based on the Salary column.

# Create a data frame
data <- data.frame(
  ID = 1:5,
  Age = c(25, 30, NA, 45, 35),
  Salary = c(50000, 60000, 55000, NA, 70000)
)

data
  ID Age Salary
1  1  25  50000
2  2  30  60000
3  3  NA  55000
4  4  45     NA
5  5  35  70000

Handling missing values and creating a new variable can be done using the tidyverse set of packages. The dplyr package provides functions (or verbs) for data manipulation, such as filter(), mutate, select(), and more. The magrittr package provides the pipe operator (%>%) used to chain functions together, making the code more readable. Furthermore, the pipe operator has recently been included in base R as well, so called the “tee” operator (|>) or native pipe operator.

Load the tidyverse package:

cleaned_data <- data %>%
  # Remove rows with missing Age values
  filter(!is.na(Age)) %>%   
  # Create a new column based on Salary
  mutate(IncomeGroup = ifelse(Salary > 60000, "High", "Low")) 

cleaned_data
  ID Age Salary IncomeGroup
1  1  25  50000         Low
2  2  30  60000         Low
3  4  45     NA        <NA>
4  5  35  70000        High

An interesting function to use is from the janitor package. The clean_names() function cleans the column names of a data frame by converting them to lowercase, replacing spaces with underscores, and removing special characters.

cleaned_data %>%
  # Clean column names (here the janitor package is called with ::)
  janitor::clean_names() 
  id age salary income_group
1  1  25  50000          Low
2  2  30  60000          Low
3  4  45     NA         <NA>
4  5  35  70000         High

R is a powerful tool for actuaries and data scientists, it offers a wide range of functions and packages for data analysis, modeling, and visualization. By mastering R programming, actuaries can enhance their analytical skills, improve decision-making, and drive innovation in the insurance industry.

Resources

Back to top