5.7 Variable Selection and Cleaning

Now, we need to extract the following information from the data. You can verify this code assigned to each variable from technical report in the previous section.

  • ICT Interest (INTICT) was measured by 6 items: IC013Q01NA, IC013Q04NA, IC013Q05NA, IC013Q11NA,IC013Q12NA, IC013Q13NA
  • Perceived ICT Competence (COMPICT) was measured by 5 items: IC014Q03NA,IC014Q04NA,IC014Q06NA,IC014Q08NA,IC014Q09NA
  • Perceived Autonomy related to ICT Use (AUTICT) was measured by 5 items: IC015Q02NA, IC015Q03NA,IC015Q05NA,C015Q07NA,C015Q09NA
  • ICT as a topic in Social Interaction (SOIAICT) was measured with 5 items: IC016Q01NA,IC016Q02NA,IC016Q04NA,IC016Q05NA,IC016Q07NA

Note: All latent variables were measured on a four point likert scale.

In the code below, we select only the variables for the Chile sample.

# Create a vector with the code of each variable as shown in the data set

var <- c("IC013Q01NA", "IC013Q04NA", "IC013Q05NA", "IC013Q11NA", "IC013Q12NA", "IC013Q13NA", # INTICT
         "IC014Q03NA", "IC014Q04NA", "IC014Q06NA", "IC014Q08NA", "IC014Q09NA", # COMPICT
         "IC015Q02NA","IC015Q03NA","IC015Q05NA" ,"IC015Q07NA" ,"IC015Q09NA", # AUICT
         "IC016Q01NA","IC016Q02NA","IC016Q04NA","IC016Q05NA","IC016Q07NA" # SOIAICT
         )

# Select only US sample using the `CNTRYID`
data.us.only <- data[data$CNTRYID == 152, var]

# Preview the first six columns and rows
head(data.us.only)
##       IC013Q01NA IC013Q04NA IC013Q05NA IC013Q11NA IC013Q12NA IC013Q13NA IC014Q03NA
## 91050         NA         NA         NA         NA         NA         NA         NA
## 91051          3          3          4          4          3          3         NA
## 91052          3          4          4          4          4          4          3
## 91053          2          3          3          3          2          3          3
## 91054          4          4          4          4          2          4          4
## 91055          3          2          3          3          4          4          4
##       IC014Q04NA IC014Q06NA IC014Q08NA IC014Q09NA IC015Q02NA IC015Q03NA IC015Q05NA
## 91050         NA         NA         NA         NA         NA         NA         NA
## 91051         NA         NA         NA         NA          4          3          3
## 91052          3          3          3          3          2          2          3
## 91053          3          3          3         NA          3          3          3
## 91054          4          4          4          4          4          4          4
## 91055          3          4          3          3          2          2          3
##       IC015Q07NA IC015Q09NA IC016Q01NA IC016Q02NA IC016Q04NA IC016Q05NA IC016Q07NA
## 91050         NA         NA         NA         NA         NA         NA         NA
## 91051          3          4          3          2          3          4          3
## 91052          2          2          3          3          2          2          2
## 91053          3          3          3          3          3          3          3
## 91054          4          4          4          4          4          4          4
## 91055          3          3          3          2          2          3          3

It is alway a good idea to rename the columns for easy identification. We can use the below code to achieve that purpose.

# Create a vector with new names for each column. Make sure it correspond to the vector `var` defined above
cols <- c("INTICT1", "INTICT2", "INTICT3", "INTICT4", "INTICT5", "INTICT6", # INTICT
         "COMPICT1", "COMPICT2", "COMPICT3", "COMPICT4", "COMPICT5", # COMPICT
         "AUICT1","AUICT2","AUICT3" ,"AUICT4" ,"AUICT5", # AUICT
         "SOIAICT1","SOIAICT2","SOIAICT3","SOIAICT4","SOIAICT5" # SOIAICT
         )

# Rename the columns
colnames(data.us.only) <- cols

# Preview Data
head(data.us.only)
##       INTICT1 INTICT2 INTICT3 INTICT4 INTICT5 INTICT6 COMPICT1 COMPICT2 COMPICT3
## 91050      NA      NA      NA      NA      NA      NA       NA       NA       NA
## 91051       3       3       4       4       3       3       NA       NA       NA
## 91052       3       4       4       4       4       4        3        3        3
## 91053       2       3       3       3       2       3        3        3        3
## 91054       4       4       4       4       2       4        4        4        4
## 91055       3       2       3       3       4       4        4        3        4
##       COMPICT4 COMPICT5 AUICT1 AUICT2 AUICT3 AUICT4 AUICT5 SOIAICT1 SOIAICT2 SOIAICT3
## 91050       NA       NA     NA     NA     NA     NA     NA       NA       NA       NA
## 91051       NA       NA      4      3      3      3      4        3        2        3
## 91052        3        3      2      2      3      2      2        3        3        2
## 91053        3       NA      3      3      3      3      3        3        3        3
## 91054        4        4      4      4      4      4      4        4        4        4
## 91055        3        3      2      2      3      3      3        3        2        2
##       SOIAICT4 SOIAICT5
## 91050       NA       NA
## 91051        4        3
## 91052        2        2
## 91053        3        3
## 91054        4        4
## 91055        3        3
# Check summary statistics
summary(data.us.only)
##     INTICT1         INTICT2        INTICT3         INTICT4         INTICT5     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:3.00   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.00   Median :3.000   Median :3.000   Median :2.000  
##  Mean   :2.753   Mean   :3.28   Mean   :3.235   Mean   :2.854   Mean   :2.422  
##  3rd Qu.:3.000   3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.00   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##  NA's   :743     NA's   :783    NA's   :807     NA's   :820     NA's   :775    
##     INTICT6        COMPICT1        COMPICT2        COMPICT3        COMPICT4    
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :3.00   Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :3.25   Mean   :2.837   Mean   :2.906   Mean   :3.288   Mean   :2.975  
##  3rd Qu.:4.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :4.00   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##  NA's   :799    NA's   :834     NA's   :853     NA's   :880     NA's   :869    
##     COMPICT5         AUICT1          AUICT2          AUICT3        AUICT4     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.0   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.0   1st Qu.:2.000  
##  Median :3.000   Median :2.000   Median :3.000   Median :3.0   Median :3.000  
##  Mean   :2.966   Mean   :2.471   Mean   :2.501   Mean   :3.1   Mean   :2.855  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.0   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.0   Max.   :4.000  
##  NA's   :876     NA's   :853     NA's   :880     NA's   :890   NA's   :907    
##      AUICT5         SOIAICT1        SOIAICT2        SOIAICT3        SOIAICT4    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :3.143   Mean   :2.772   Mean   :2.506   Mean   :2.554   Mean   :2.529  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##  NA's   :915     NA's   :930     NA's   :967     NA's   :963     NA's   :977    
##     SOIAICT5    
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :2.612  
##  3rd Qu.:3.000  
##  Max.   :4.000  
##  NA's   :986

You should notice by the summary statistics that our data contains several missing values denoted as NA's. For example, INTICT1 and INTICT2 contain 743 and 783 missing responses. The subject of missing response is complex and we won’t go into it. Instead, we will remove all participants with missing responses from our study.

# Remove participants with NAs 
data.us.only <- subset(data.us.only, complete.cases(data.us.only))

summary(data.us.only)
##     INTICT1         INTICT2         INTICT3         INTICT4         INTICT5     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000   Median :2.000  
##  Mean   :2.768   Mean   :3.304   Mean   :3.247   Mean   :2.872   Mean   :2.422  
##  3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##     INTICT6         COMPICT1        COMPICT2        COMPICT3        COMPICT4    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :3.266   Mean   :2.848   Mean   :2.922   Mean   :3.309   Mean   :2.987  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##     COMPICT5         AUICT1          AUICT2          AUICT3          AUICT4     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :2.975   Mean   :2.491   Mean   :2.516   Mean   :3.121   Mean   :2.871  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##      AUICT5         SOIAICT1        SOIAICT2        SOIAICT3        SOIAICT4    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :3.000   Median :3.000   Median :3.000  
##  Mean   :3.161   Mean   :2.786   Mean   :2.507   Mean   :2.555   Mean   :2.527  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##     SOIAICT5    
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :2.615  
##  3rd Qu.:3.000  
##  Max.   :4.000

Because the original data set is large and takes longer time to load, we can export the data for US sample in a .csv format for further analysis.

# Export data for further analysis
write.csv(data.us.only, file = "PISA research data.csv", row.names = F)