5 How to deal with empty spaces

Warning: Empty spaces in (column) variable names or in variables often cause troubles!

5.1 Empty spaces in variable names

After reading data into R, we should have the habit of using names() (or colnames()) to have a look at all the names of the column names. If we find some column names have empty spaces, then we should pay attention to and even do something about that. Let me explain with an example.

Example:

rm(list=ls())

# load packages
library(ggplot2)

# Create a dataframe
set.seed <- 12345
fk_data <- data.frame(Year = rep(2011:2015, each = 4), 
                    AvgScore = round(rnorm(20, mean=50, sd=5), 0),
                    Gender = rep(c("Male", "Female"), 10))

# An extra variable
fk_data$"Subject " <- rep( c(rep("Maths", 2), rep("Stats", 2)), 5 )


# Make a plot
ggplot(fk_data, aes(x = Year, y = AvgScore, group = Gender, colour = Gender)) +
  geom_point() +
  geom_line() +
  facet_grid( . ~ Subject  )  
# The above plotting does not work

# What about the following
ggplot(fk_data, aes(x = Year, y = AvgScore, group = Gender, colour = Gender)) +
  geom_point() +
  geom_line() +
  facet_grid( . ~ "Subject " )
# It does not work either

# The first fix
# We don't remove the space in "Subject "
ggplot(fk_data, aes(x = Year, y = AvgScore, group = Gender, colour = Gender)) +
  geom_point() +
  geom_line() +
  facet_grid( . ~ `Subject ` ) # backticks! It works fine here.
# Thanks to http://stackoverflow.com/questions/4551424/how-to-refer-to-a-variable-name-with-spaces


# The second fix.
# We remove the space in "Subject "
fk_data$Subject <- fk_data$"Subject " # create a new column/variable
fk_data$"Subject " <- NULL # remove the column
# Now make the plot
ggplot(fk_data, aes(x = Year, y = AvgScore, group = Gender, colour = Gender)) +
  geom_point() +
  geom_line() +
  facet_grid( . ~ Subject )

5.2 Empty spaces in variable values

Sometimes we may encounter a variable with its values containing empty spaces at the beginning or at the end or both, and almost certainly we should remove these spaces. Fortunately, it is easy to do so with stringr::str_trim() or trimws().

Example 1:

rm(list=ls())

# load packages
library(stringr)
library(dplyr)

# create a dataframe
fk_data <- data.frame(student_no = 1:4,
                      major = c("maths", " English", " maths ", "English "))

# find the number of students majored in maths
(no_of_maths_majored_students <- sum(fk_data$major == "maths"))
# this does not give the right answer

# fix: remove the spaces 
no_of_students_by_major <- 
  fk_data %>% 
  mutate(major = str_trim(fk_data$major, side = "both")) %>% 
  group_by(major) %>% 
  summarise(count = n())

Example 2: Removing leading/trailing and in-between spaces

(x <- "  a   big space      problem   ")
## [1] "  a   big space      problem   "
# firstly remove leading and trailing spaces
(y <- trimws(x))
## [1] "a   big space      problem"
# secondly remove 'extra' spaces in between
# thanks to https://stackoverflow.com/questions/19128327/how-to-remove-extra-white-space-between-words-inside-a-character-vector-using
(z <- gsub("\\s+"," ", y))
## [1] "a big space problem"