16 Numbers and factors
Seeing numbers in a data file usually seems straightforward. However, erroneous interpretations of numeric information are among the most common errors for inexperienced data analysts.
The two key topics (and corresponding R packages) of this chapter are:
numbers (and their representations)
factors (to handle categorical variables)
Please note: This topic is not covered in Spring 2024. Please proceed to the next Chapter 17: Text data.
16.1 Introduction
Many people struggle with numeric calculations. However, even people who excel in manipulating numbers tend to overlook their representational properties.
When thinking about the representation of numbers, even simple numbers become surprisingly complicated. This is mostly due to our intimate familarity with particular forms of representation, which are quite arbitrary. When stripping away our implicit assumptions, numbers are quite complex representational constructs.
Main distinctions: Numbers as numeric values vs. as ranks vs. as (categorical) labels.
Example: How tall is someone?
Distinguish between issues of judgment vs. measurement.
When providing a quantitative value, the same measurement can be made (mapped to, represented) on different scales:
size in cm: comparisons between values are scaled (e.g., 20% larger, etc.)
rank within group: comparisons between people are possible, but no longer scaled
categories: can be ordered (smallmediumtall) or unordered (fits vs. does not fit)
Additionally, any height value can be expressed in different ways:
 units: 180cm in ft?
 accuracy: rounding
 number system: “ten” vs. “zehn”, value \(10\) as “1010”” in binary system
Why relevant?
Distinguish between the meaning and representation of objects:
 Different uses of numbers allow different types of calculations and tests.
 They should be assigned to different data types.
16.2 Essentials
Important aspects:
Different types of numbers: integers vs. doubles
Numbers as values vs. their name/description/representation (as strings of symbols/digits/numerals).
Some numbers are used to denote categories (identity and difference, but not magnitude/value).
16.3 Numbers
We typically think we know numbers. However, we typically deal with numbers that are represented in a specific numeral system. Our familiarity with specific numeral systems obscures our dependency on arbitrary conventions.
16.3.1 Types of numbers
Different types of numbers (integers vs. doubles):
 integers
 positive vs. negative
 decimals, fractions
 real numbers
Some oddities:
 integers
 errors due to floating point (im)precision
Rules for rounding numbers.
16.3.2 Representing numbers
When reading a number (like \(123\)), we tend to overlook that this number is represented in a particular notational system. Technically, we need to distinguish between the numeric value \(123\) and the character string of numerical symbols (aka. digits or numerals) “123”.
Generally, numbers are represented according to notational conventions: Strings of dedicated symbols (e.g., Arabic numeral digits 0–9), to be interpreted according to rules (e.g., positional systems require expansion of polynomials).
Overall, our decimal system is a compromise (between more divisible and more unique bases) and a matter of definition. Importantly, our way of representing numbers is subject to two arbitrary conventions:
a positional system with a base value of 10,
representing unit values by the numeral symbols/digits 0–9.
Only changing 2. (by replacing familiar digits with arbitrary symbols) renders simple calculations much more difficult (see letter arithmetic problems).
In the following, we will preserve 2. (the digits and their meaning), but change the base value. This illustrates the difference between numeric value and their symbolic representation (as a string of numeric symbols/digits). (We only cover natural numbers, as they are complicated enough.)
Example
The digits “123” only denote the value of \(123\) in the HinduArabic positional system with a decimal base (i.e., base 10 and numeric symbols 0–9).
Positional notation: Given \(n\) digits \(d_i\) and a base \(b\), a number’s numeric value \(v\) is given by expanding a polynomial sum:
\[v = \sum_{i=1}^{n} {d_i \cdot b^{i1}}\]
with \(i\) representing each digit’s position (from right to left). Thus, the number \(123\) is a representational shortcut for
\[v = (3 \cdot 10^0) + (2 \cdot 10^1) + (1 \cdot 10^2) = 3 + 20 + 100\]
Representing numbers as base \(b\) positional system
Just like “ten” and “zehn” are two different ways for denoting the same value (\(10\)), we can write a given value in different notations. A simple way of showing this is to use positional number systems with different base values \(b\). As long as \(b \leq 10\), we do not need any new digit symbols (but note that the value of a digit must never exceed the base value).
Note two consequences:
The same symbol string represents different numeric values in different notations:
The digit string “11” happens to represent a value of \(11\) in base10 notation. However, the same digit string “11” represents a value of \(6\) in base5 notation, and a value of \(3\) in base2 notation.The same numeric value is represented differently in different notations:
A given numeric value of \(11\) is written as “11” in decimal notation, but can alternatively be written as “1011” in base2, “102” in base3, and “12” in base9 notation.
Examples of alternative number systems
Alternatives to the base10 positional system are not just an academic exercise. See
 binary number systems (see Wikipedia: Binary number)
 hexadecimal numbers (see Wikipedia: Hexadecimal)
Examples in R
Viewing numbers as symbol strings (of digits): Formatting numbers (e.g., in text or tables)

num_as_char()
from the ds4psy package:
ds4psy::num_as_char(1:10, n_pre_dec = 2, n_dec = 0)
#> [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"
Note the comma()
function of the scales package.

base2dec()
from the ds4psy package:
ds4psy::base2dec("11")
#> [1] 3
ds4psy::base2dec("111")
#> [1] 7
ds4psy::base2dec("100", base = 5)
#> [1] 25
Demonstration: Show a similation that converts from decimal to another base and back:
n_org  base  n_base  n_dec 

409  15  1C4  409 
798  48  GU  798 
6340  4  1203010  6340 
7179  18  142F  7179 
7548  60  25m  7548 
6252  6  44540  6252 
3759  10  3759  3759 
6676  58  1v6  6676 
7605  8  16665  7605 
3774  58  174  3774 
4546  15  1531  4546 
542  38  EA  542 
9556  8  22524  9556 
8973  50  3TN  8973 
3390  2  110100111110  3390 
5426  53  1nK  5426 
4473  54  1Sj  4473 
995  57  HQ  995 
3116  24  59K  3116 
1681  42  e1  1681 
16.4 Factors
Factors are a special datatype in R. As many users have a poor understanding of them and may even inadvertendly use them, they have a bad reputation. However, they are often useful for representing variables that feature a small and fixed number of categories (e.g., S/M/L; female/male/other) or an ordered number of instances (e.g., days of the week, months). In statistics, factors allow testing for specific effects (e.g., comparing the results of several treatments with those of a control condition).
Categorical data could be represented as strings or as factors.
Example: Gender as “male” vs. “female”. But any survey with more than a few people will require additional categories, like “other” or “do not wish to respond”.
Factors are useful and indispensable (for graphing, statistical analysis), but additional complexity increases chance of unexpected behavior and pitfalls.
16.4.1 Example
Different factor values are internally represented as integers. This may occasionally seem confusing:
16.4.2 Basics
Factors represent categorical variables in R. When to use factors?
 for ordered categories (e.g., in graphs)
 for defining experimental conditions
Important to recognize them and represent them properly.
Using numbers to categorize objects: Distinguish unordered vs. ordered categories.
See examples unexpected behavior of factors in
 McNamara, A., & Horton N.J. (2017). Wrangling categorical data in R. PeerJ Preprints, 5:e3163v2 https://doi.org/10.7287/peerj.preprints.3163v2
16.4.3 Tasks
Structure section by factorrelated tasks:
Turn a variable into a factor:
as.factor()
Define a factor with levels:
factor()
Changing the labels of factor levels
Reorder factor levels
Combining several levels into one (both stringlike labels and numeric), and
Making derived factor variables.
Note that the last 4 tasks are from (McNamara & Horton, 2018).
Note that the forcats package (Wickham, 2023a) is used in Wickham & Grolemund (2017) and Wickham, ÇetinkayaRundel, et al. (2023).
16.5 Conclusion
Numbers are trickier than we usually think. Not only are there different kinds of numbers, but any given number can be represented in many different ways. Choosing the right kind of number depends on what we want to express by it (i.e., the number’s function or our use of it).
16.5.1 Summary
Key question: Do we mean the number values or their representations?
 If value: What do the number values denote (semantics)? What types of numbers and accuracy level is needed?
 If representations: In which context are number representations needed? (As numerals or words? Which system? Which accuracy?)
16.5.2 Resources
16.5.2.2 Factors
Resources for using factors in base R:
McNamara, A., & Horton, N. J. (2018). Wrangling categorical data in R. The American Statistician, 72(1), 97–104. doi: 10.1080/00031305.2017.1356375
Also available at PeerJ.com: https://doi.org/10.7287/peerj.preprints.3163v2Son Nguyen (2020). Efficient R programming. See Section 3.4 Factors.
Handling factors in tidyverse contexts using the forcats package (Wickham, 2023a):
Chapter 15 Factors of Wickham & Grolemund (2017)
Chapter 16 Factors of Wickham, ÇetinkayaRundel, et al. (2023)
A good overview of forcats is found on one of the Posit cheatsheets:
16.6 Exercises
16.6.1 Converting decimal numbers into base N
 Create a conversion function
dec2base(x, base)
that converts a decimal numberx
into a positional number of different base (with \(2 \leq\)base
\(\leq 10\)). Thus, thedec2base()
function provides a complement tobase2dec()
from the i2ds package:
 Use your
dec2base()
function to compute the following conversions:
dec2base(100, base = 2)
dec2base(100, base = 3)
dec2base(100, base = 5)
dec2base(100, base = 9)
dec2base(100, base = 10)
 Create a brief simulation that samples \(N = 20\) random decimal numbers \(x_i\) and
base
values \(b_i\) \((2 \leq b_i <= 10)\) and shows that
base2dec(dec2base(
\(x_i,\ b_i\)),
\(b_i\)) ==
\(x_i\)
(i.e., converting a numeric value from decimal notation into a number in base \(b_i\) notation, and back into decimal notation yields the original numeric value).
Solution
n_org  base  n_base  n_dec  same 

6014  7  23351  6014  TRUE 
364  8  554  364  TRUE 
8030  3  102000102  8030  TRUE 
8756  4  2020310  8756  TRUE 
5858  7  23036  5858  TRUE 
5170  8  12062  5170  TRUE 
461  6  2045  461  TRUE 
518  10  518  518  TRUE 
9642  6  112350  9642  TRUE 
7861  9  11704  7861  TRUE 
4785  10  4785  4785  TRUE 
5514  5  134024  5514  TRUE 
4943  8  11517  4943  TRUE 
2666  4  221222  2666  TRUE 
305  9  368  305  TRUE 
7750  3  101122001  7750  TRUE 
5471  5  133341  5471  TRUE 
5043  7  20463  5043  TRUE 
2206  7  6301  2206  TRUE 
3280  4  303100  3280  TRUE 