16 Numbers and factors

Seeing numbers in a data file usually seems straightforward. However, erroneous interpretations of numeric information are among the most common errors for inexperienced data analysts.

The two key topics (and corresponding R packages) of this chapter are:

  • numbers (and their representations)

  • factors (to handle categorical variables)

Please note: This topic is not covered in Spring 2024. Please proceed to the next Chapter 17: Text data.

16.1 Introduction

Many people struggle with numeric calculations. However, even people who excel in manipulating numbers tend to overlook their representational properties.

When thinking about the representation of numbers, even simple numbers become surprisingly complicated. This is mostly due to our intimate familarity with particular forms of representation, which are quite arbitrary. When stripping away our implicit assumptions, numbers are quite complex representational constructs.

Main distinctions: Numbers as numeric values vs. as ranks vs. as (categorical) labels.

Example: How tall is someone?

Distinguish between issues of judgment vs. measurement.

When providing a quantitative value, the same measurement can be made (mapped to, represented) on different scales:

  1. size in cm: comparisons between values are scaled (e.g., 20% larger, etc.)

  2. rank within group: comparisons between people are possible, but no longer scaled

  3. categories: can be ordered (small-medium-tall) or unordered (fits vs. does not fit)

Additionally, any height value can be expressed in different ways:

  • units: 180cm in ft?
  • accuracy: rounding
  • number system: “ten” vs. “zehn”, value \(10\) as “1010”” in binary system

Why relevant?

Distinguish between the meaning and representation of objects:

  • Different uses of numbers allow different types of calculations and tests.
  • They should be assigned to different data types.

16.2 Essentials

Important aspects:

  • Different types of numbers: integers vs. doubles

  • Numbers as values vs. their name/description/representation (as strings of symbols/digits/numerals).

  • Some numbers are used to denote categories (identity and difference, but not magnitude/value).

16.3 Numbers

We typically think we know numbers. However, we typically deal with numbers that are represented in a specific numeral system. Our familiarity with specific numeral systems obscures our dependency on arbitrary conventions.

16.3.1 Types of numbers

Different types of numbers (integers vs. doubles):

  • integers
  • positive vs. negative
  • decimals, fractions
  • real numbers

Some oddities:

  • integers
  • errors due to floating point (im-)precision

Rules for rounding numbers.

16.3.2 Representing numbers

When reading a number (like \(123\)), we tend to overlook that this number is represented in a particular notational system. Technically, we need to distinguish between the numeric value \(123\) and the character string of numerical symbols (aka. digits or numerals) “123”.

Generally, numbers are represented according to notational conventions: Strings of dedicated symbols (e.g., Arabic numeral digits 0–9), to be interpreted according to rules (e.g., positional systems require expansion of polynomials).

Overall, our decimal system is a compromise (between more divisible and more unique bases) and a matter of definition. Importantly, our way of representing numbers is subject to two arbitrary conventions:

  1. a positional system with a base value of 10,

  2. representing unit values by the numeral symbols/digits 0–9.

Only changing 2. (by replacing familiar digits with arbitrary symbols) renders simple calculations much more difficult (see letter arithmetic problems).

In the following, we will preserve 2. (the digits and their meaning), but change the base value. This illustrates the difference between numeric value and their symbolic representation (as a string of numeric symbols/digits). (We only cover natural numbers, as they are complicated enough.)

Example

The digits “123” only denote the value of \(123\) in the Hindu-Arabic positional system with a decimal base (i.e., base 10 and numeric symbols 0–9).

Positional notation: Given \(n\) digits \(d_i\) and a base \(b\), a number’s numeric value \(v\) is given by expanding a polynomial sum:

\[v = \sum_{i=1}^{n} {d_i \cdot b^{i-1}}\]

with \(i\) representing each digit’s position (from right to left). Thus, the number \(123\) is a representational shortcut for

\[v = (3 \cdot 10^0) + (2 \cdot 10^1) + (1 \cdot 10^2) = 3 + 20 + 100\]

Representing numbers as base \(b\) positional system

Just like “ten” and “zehn” are two different ways for denoting the same value (\(10\)), we can write a given value in different notations. A simple way of showing this is to use positional number systems with different base values \(b\). As long as \(b \leq 10\), we do not need any new digit symbols (but note that the value of a digit must never exceed the base value).

Note two consequences:

  1. The same symbol string represents different numeric values in different notations:
    The digit string “11” happens to represent a value of \(11\) in base-10 notation. However, the same digit string “11” represents a value of \(6\) in base-5 notation, and a value of \(3\) in base-2 notation.

  2. The same numeric value is represented differently in different notations:
    A given numeric value of \(11\) is written as “11” in decimal notation, but can alternatively be written as “1011” in base-2, “102” in base-3, and “12” in base-9 notation.

Examples of alternative number systems

Alternatives to the base-10 positional system are not just an academic exercise. See

Examples in R

Viewing numbers as symbol strings (of digits): Formatting numbers (e.g., in text or tables)

ds4psy::num_as_char(1:10, n_pre_dec = 2, n_dec = 0)
#>  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"

Note the comma() function of the scales package.

ds4psy::base2dec("11")
#> [1] 3
ds4psy::base2dec("111")
#> [1] 7
ds4psy::base2dec("100", base = 5)
#> [1] 25

Demonstration: Show a similation that converts from decimal to another base and back:

Table 16.1: Simulation results of converting decimal numbers (base 10) into another base, and back.
n_org base n_base n_dec
409 15 1C4 409
798 48 GU 798
6340 4 1203010 6340
7179 18 142F 7179
7548 60 25m 7548
6252 6 44540 6252
3759 10 3759 3759
6676 58 1v6 6676
7605 8 16665 7605
3774 58 174 3774
4546 15 1531 4546
542 38 EA 542
9556 8 22524 9556
8973 50 3TN 8973
3390 2 110100111110 3390
5426 53 1nK 5426
4473 54 1Sj 4473
995 57 HQ 995
3116 24 59K 3116
1681 42 e1 1681

16.4 Factors

Factors are a special data-type in R. As many users have a poor understanding of them and may even inadvertendly use them, they have a bad reputation. However, they are often useful for representing variables that feature a small and fixed number of categories (e.g., S/M/L; female/male/other) or an ordered number of instances (e.g., days of the week, months). In statistics, factors allow testing for specific effects (e.g., comparing the results of several treatments with those of a control condition).

Categorical data could be represented as strings or as factors.

Example: Gender as “male” vs. “female”. But any survey with more than a few people will require additional categories, like “other” or “do not wish to respond”.

Factors are useful and indispensable (for graphing, statistical analysis), but additional complexity increases chance of unexpected behavior and pitfalls.

16.4.1 Example

Different factor values are internally represented as integers. This may occasionally seem confusing:

x <- 4:6
c(x)
#> [1] 4 5 6

y <- factor(x)
c(y)
#> [1] 4 5 6
#> Levels: 4 5 6

16.4.2 Basics

Factors represent categorical variables in R. When to use factors?

  • for ordered categories (e.g., in graphs)
  • for defining experimental conditions

Important to recognize them and represent them properly.

Using numbers to categorize objects: Distinguish un-ordered vs. ordered categories.

See examples unexpected behavior of factors in

16.4.3 Tasks

Structure section by factor-related tasks:

  1. Turn a variable into a factor: as.factor()

  2. Define a factor with levels: factor()

  3. Changing the labels of factor levels

  4. Re-order factor levels

  5. Combining several levels into one (both string-like labels and numeric), and

  6. Making derived factor variables.

Note that the last 4 tasks are from (McNamara & Horton, 2018).

Note that the forcats package (Wickham, 2023a) is used in Wickham & Grolemund (2017) and Wickham, Çetinkaya-Rundel, et al. (2023).

16.5 Conclusion

Numbers are trickier than we usually think. Not only are there different kinds of numbers, but any given number can be represented in many different ways. Choosing the right kind of number depends on what we want to express by it (i.e., the number’s function or our use of it).

16.5.1 Summary

Key question: Do we mean the number values or their representations?

  • If value: What do the number values denote (semantics)? What types of numbers and accuracy level is needed?
  • If representations: In which context are number representations needed? (As numerals or words? Which system? Which accuracy?)

16.5.2 Resources

16.5.2.1 Numbers

16.5.2.2 Factors

Resources for using factors in base R:

Handling factors in tidyverse contexts using the forcats package (Wickham, 2023a):

A good overview of forcats is found on one of the Posit cheatsheets:

Handling factors with forcats from Posit cheatsheets.

Figure 16.1: Handling factors with forcats from Posit cheatsheets.

16.5.3 Preview

What’s next?

16.6 Exercises

16.6.1 Converting decimal numbers into base N

  1. Create a conversion function dec2base(x, base) that converts a decimal number x into a positional number of different base (with \(2 \leq\) base \(\leq 10\)). Thus, the dec2base() function provides a complement to base2dec() from the i2ds package:
library(i2ds)
base2dec(11, base = 2)
#> [1] 3
dec2base(3,  base = 2)
#> [1] "11"
  1. Use your dec2base() function to compute the following conversions:
dec2base(100, base =  2)
dec2base(100, base =  3)
dec2base(100, base =  5)
dec2base(100, base =  9)
dec2base(100, base = 10)
  1. Create a brief simulation that samples \(N = 20\) random decimal numbers \(x_i\) and base values \(b_i\) \((2 \leq b_i <= 10)\) and shows that

base2dec(dec2base(\(x_i,\ b_i\)),\(b_i\)) ==\(x_i\)

(i.e., converting a numeric value from decimal notation into a number in base \(b_i\) notation, and back into decimal notation yields the original numeric value).

Solution

Table 16.2: Convert integer values from decimal to base notation, and back to decimal notation.
n_org base n_base n_dec same
6014 7 23351 6014 TRUE
364 8 554 364 TRUE
8030 3 102000102 8030 TRUE
8756 4 2020310 8756 TRUE
5858 7 23036 5858 TRUE
5170 8 12062 5170 TRUE
461 6 2045 461 TRUE
518 10 518 518 TRUE
9642 6 112350 9642 TRUE
7861 9 11704 7861 TRUE
4785 10 4785 4785 TRUE
5514 5 134024 5514 TRUE
4943 8 11517 4943 TRUE
2666 4 221222 2666 TRUE
305 9 368 305 TRUE
7750 3 101122001 7750 TRUE
5471 5 133341 5471 TRUE
5043 7 20463 5043 TRUE
2206 7 6301 2206 TRUE
3280 4 303100 3280 TRUE

16.6.2 Exercise