3 How to explore a “new” data set
By a “new” data set, I mean it is new to us, that is, we have never seen it before. To explore this new data set, we can follow these steps.
- Read the data into R.
- Find the dimensions of this data set by using dim().
- Understand the structure of the data by using str().
- See the first 6 rows of the data using head(); see the last 6 rows of the data using tail().
- Find out the names of all the (column) variables in the data set. Pay attention to the variable with “ID” (or “id”) as part of its name, since this variable will be used when we want to join this data set with another one.
- Figure out the variables that of interest by reading the names. If the interesting variable is of categorical type, then use unique() to find out all the possible values that the variable can take. If the interesting variable is of continuous type, then use summary() to look at the 5-number summary.
- Use View() to have a quick look at the whole data set.
rm(list=ls()) # Firstly, we must read data into R # Here I will use fake data fk_data <- data.frame(ob_id = 1:100, l_lower_case = sample(letters, 100, replace = TRUE), rand_number = rnorm(100), l_upper_case = sample(LETTERS, 100, replace = TRUE), true_or_false = sample(c("T", "F"), 100, replace = TRUE) ) # Find the dimensions d <- dim(fk_data) # Find the structure str(fk_data) # See the first 6 rows head(fk_data) # See the last 6 rows tail(fk_data) # Find the column names the_names <- names(fk_data) # Find possible values of "l_lower_case" possible_values <- unique(fk_data$l_lower_case) # Find summary of "rand_number" the_summary <- summary(fk_data$rand_number) # View the data set View(fk_data)