16 How to restore lost zeros
I had the so-called “lost zeros” problem before. The problem can be described like this: One column, called “ID”, contains n-digit (e.g. n=7) numbers; but for whatever reasons, the leading zeros are lost, e.g. “0926541”, “0024267” and “0003115” become “926541”, “24267” and “3115”, respectively. We must restore the lost zeros back and change the variable “ID” to characteristic type, because we will need this variable to join the data set with another data set. How to solve this problem?
The major steps of my solution to this problem is as follows:
1, Write a helper function:
helper_func <- function(x)
{m <- length(x)
y <- rep("", m)
for(i in 1:m) if(x[i] > 0) y[i] <- paste(rep(0, x[i]), collapse = "")
return(y)
}
- Create a new variable, “no_zeros”, which is for the number of zeros that need to be added to each ID to make it have n digits. Note the formula is: \[ n - \hbox{ceiling}(\hbox{log10}(\hbox{ID})) \]
Use dplyr::mutate() and helper_func() to create a new column, “the_zeros”, which contains the corresponding number of lost zeros.
Use paste0() to concatenate “the_zeros” and “ID” and then assign to “ID”.
Use dplyr::select to unselect “no_zeros” and “the_zeros”.
Example:
rm(list=ls())
# load packages
library(dplyr)
# create fake data
fk_data <- data.frame(ID = sample(1000:9999999, 1000, replace = FALSE),
d = rnorm(1000))
# a helper function
helper_func <- function(x)
{m <- length(x)
y <- rep("", m)
for(i in 1:m) if(x[i] > 0) y[i] <- paste(rep(0, x[i]), collapse = "")
return(y)
}
# data manipulation
fk_data <- fk_data %>%
mutate(no_zeros = 7 - ceiling(log10(ID))) %>%
mutate(the_zeros = helper_func(no_zeros)) %>%
mutate(ID = paste0(the_zeros, ID)) %>%
select(-no_zeros, -the_zeros )
Remark: After I had done the above, from a colleague, I learned a much easier way to solve the problem, which is to use sprintf(). Below is the R code for re-doing the above example using sprintf().
rm(list=ls())
library(dplyr)
# create fake data
fk_data <- data.frame(ID = sample(1000:9999999, 1000, replace = FALSE),
d = rnorm(1000))
# data manipulation
fk_data <- fk_data %>%
mutate(ID = sprintf("%07d", ID))