Chapter 7 Working with Dataset

7.1 Dataset built into R

Let us look at an R dataset called ChickWeight. This dataset is built into R. To see a description of this dataset, go to the 4th panel and click “datasets”. Scroll down until you find the dataset, ChickWeight. Click the dataset. You will be taken to the Help tab which will give a description of the dataset and all its arguments.

Another way to get information on the dataset is by typing ? before the dataset. Information will appear on the 4th panel under the Help tab.

To see the whole dataset, use the command, View(data_frame). In this case, it would be View(ChickWeight). A new tab in the Source panel called ChickWeight will appear. You should see 578 rows and 4 columns of data entries.

7.2 Viewing Part of the Dataset

If you want to see only a portion of the dataset, the function head( ) or tail( ) will do the job.

The function head(data_frame) will show the first 6 rows of the dataset.

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

The function tail(data_frame) will show the last 6 rows of the dataset.

##     weight Time Chick Diet
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4

If you want to see a certain number of rows, specify it after stating the data frame. Adding an “L” after the row number tells R to treat the entry as an integer. Do not worry if you forget the “L”. R will still treat the entry as an integer, in this case.

Suppose you want to see the first 3 rows of the data frame, ChickWeight.

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1

To see the last 10 rows of the data frame, ChickWeight:

##     weight Time Chick Diet
## 569     67    4    50    4
## 570     84    6    50    4
## 571    105    8    50    4
## 572    122   10    50    4
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4

To see all but a certain number of rows, put a negative sign before the row number. Suppose we want to see all but the last 570 rows of ChickWeight.

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
## 7    106   12     1    1
## 8    125   14     1    1

The same goes for the function tail( ). Suppose we want to see all but the first 570 rows of ChickWeight.

##     weight Time Chick Diet
## 571    105    8    50    4
## 572    122   10    50    4
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4

7.3 Viewing Entries Tied to a Variable

Suppose you want to work only with chicks that are newly hatched, ie, Time = 0 days. We are going to filter the data and call the new data frame chick0.

##     weight Time Chick Diet
## 1       42    0     1    1
## 13      40    0     2    1
## 25      43    0     3    1
## 37      42    0     4    1
## 49      41    0     5    1
## 61      41    0     6    1
## 73      41    0     7    1
## 85      42    0     8    1
## 96      42    0     9    1
## 108     41    0    10    1
## 120     43    0    11    1
## 132     41    0    12    1
## 144     41    0    13    1
## 156     41    0    14    1
## 168     41    0    15    1
## 176     41    0    16    1
## 183     42    0    17    1
## 195     39    0    18    1
## 197     43    0    19    1
## 209     41    0    20    1
## 221     40    0    21    2
## 233     41    0    22    2
## 245     43    0    23    2
## 257     42    0    24    2
## 269     40    0    25    2
## 281     42    0    26    2
## 293     39    0    27    2
## 305     39    0    28    2
## 317     39    0    29    2
## 329     42    0    30    2
## 341     42    0    31    3
## 353     41    0    32    3
## 365     39    0    33    3
## 377     41    0    34    3
## 389     41    0    35    3
## 401     39    0    36    3
## 413     41    0    37    3
## 425     41    0    38    3
## 437     42    0    39    3
## 449     41    0    40    3
## 461     42    0    41    4
## 473     42    0    42    4
## 485     42    0    43    4
## 497     42    0    44    4
## 507     41    0    45    4
## 519     40    0    46    4
## 531     41    0    47    4
## 543     39    0    48    4
## 555     40    0    49    4
## 567     41    0    50    4

From 578 rows, we now have only 45 rows. Under the column, Time, all the results show 0.

7.4 Ordering Data Frame by Variable

Use the function called order( ) to order the data frame in descending or ascending order. The syntax in this case is: data_frame[order(data_frame$variable), ]

Suppose we want to order the data frame, chick0, in ascending order by weight of the chicks. We will call the new data frame, chick0_ascend.

##     weight Time Chick Diet
## 195     39    0    18    1
## 293     39    0    27    2
## 305     39    0    28    2
## 317     39    0    29    2
## 365     39    0    33    3
## 401     39    0    36    3
## 543     39    0    48    4
## 13      40    0     2    1
## 221     40    0    21    2
## 269     40    0    25    2
## 519     40    0    46    4
## 555     40    0    49    4
## 49      41    0     5    1
## 61      41    0     6    1
## 73      41    0     7    1

To order the data frame, chick0, in descending order by weight of the chicks, put a negative sign in front of the target vector. We will call the new data frame, chick0_descend.

##     weight Time Chick Diet
## 25      43    0     3    1
## 120     43    0    11    1
## 197     43    0    19    1
## 245     43    0    23    2
## 1       42    0     1    1
## 37      42    0     4    1
## 85      42    0     8    1
## 96      42    0     9    1
## 183     42    0    17    1
## 257     42    0    24    2
## 281     42    0    26    2
## 329     42    0    30    2
## 341     42    0    31    3
## 437     42    0    39    3
## 461     42    0    41    4

Be careful what variables you are sorting. Quantitative variables can be sorted. However, it does not make sense to sort categorical variables. Note that the variable, Diet, is categorical even though the data entries are numeric. Let us see what happens if we try to sort the variable, Diet, in the data frame, chick0, in descending order. Let us call this new data frame, chick0_diet.

## Warning in Ops.factor(chick0$Diet): '-' not meaningful for factors

What if we want to order data entries within a certain variable in a data frame? Suppose we want to sort the chick weight, within each diet, in ascending order. Let us call this new data frame, chick0_arr.

##     weight Time Chick Diet
## 195     39    0    18    1
## 13      40    0     2    1
## 49      41    0     5    1
## 61      41    0     6    1
## 73      41    0     7    1
## 108     41    0    10    1
## 132     41    0    12    1
## 144     41    0    13    1
## 156     41    0    14    1
## 168     41    0    15    1
## 176     41    0    16    1
## 209     41    0    20    1
## 1       42    0     1    1
## 37      42    0     4    1
## 85      42    0     8    1
## 96      42    0     9    1
## 183     42    0    17    1
## 25      43    0     3    1
## 120     43    0    11    1
## 197     43    0    19    1
## 293     39    0    27    2
## 305     39    0    28    2
## 317     39    0    29    2
## 221     40    0    21    2
## 269     40    0    25    2
## 233     41    0    22    2
## 257     42    0    24    2
## 281     42    0    26    2
## 329     42    0    30    2
## 245     43    0    23    2
## 365     39    0    33    3
## 401     39    0    36    3
## 353     41    0    32    3
## 377     41    0    34    3
## 389     41    0    35    3
## 413     41    0    37    3
## 425     41    0    38    3
## 449     41    0    40    3
## 341     42    0    31    3
## 437     42    0    39    3
## 543     39    0    48    4
## 519     40    0    46    4
## 555     40    0    49    4
## 507     41    0    45    4
## 531     41    0    47    4
## 567     41    0    50    4
## 461     42    0    41    4
## 473     42    0    42    4
## 485     42    0    43    4
## 497     42    0    44    4

To sort the chick weight, within each diet, in descending order, put a negative sign on the target vector. We will call this new data frame, chick0_arr2.

##     weight Time Chick Diet
## 25      43    0     3    1
## 120     43    0    11    1
## 197     43    0    19    1
## 1       42    0     1    1
## 37      42    0     4    1
## 85      42    0     8    1
## 96      42    0     9    1
## 183     42    0    17    1
## 49      41    0     5    1
## 61      41    0     6    1
## 73      41    0     7    1
## 108     41    0    10    1
## 132     41    0    12    1
## 144     41    0    13    1
## 156     41    0    14    1
## 168     41    0    15    1
## 176     41    0    16    1
## 209     41    0    20    1
## 13      40    0     2    1
## 195     39    0    18    1
## 245     43    0    23    2
## 257     42    0    24    2
## 281     42    0    26    2
## 329     42    0    30    2
## 233     41    0    22    2
## 221     40    0    21    2
## 269     40    0    25    2
## 293     39    0    27    2
## 305     39    0    28    2
## 317     39    0    29    2
## 341     42    0    31    3
## 437     42    0    39    3
## 353     41    0    32    3
## 377     41    0    34    3
## 389     41    0    35    3
## 413     41    0    37    3
## 425     41    0    38    3
## 449     41    0    40    3
## 365     39    0    33    3
## 401     39    0    36    3
## 461     42    0    41    4
## 473     42    0    42    4
## 485     42    0    43    4
## 497     42    0    44    4
## 507     41    0    45    4
## 531     41    0    47    4
## 567     41    0    50    4
## 519     40    0    46    4
## 555     40    0    49    4
## 543     39    0    48    4

7.5 Renaming Variables

There are a couple of ways to rename a variable. In each case, the function, names( ), is used.

One method is to call the variable by its column number and rename. The syntax is:
names(data_frame)[column_number] <- “new_variable_name


Let us rename the variable, Time to Days. Time is in column 2.

##   weight Days Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
Another way is to rename a variable is to call its name. The syntax is:
names(data_frame)[names(data_frame) == “old_variable_name”] <- “new_variable_name


Let us rename the variable, Days to Time.

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

7.6 Changing Data Entry

There are several ways to change a particular data entry.

One way is to call out the row number and column number and replace the existing value with the new value. The syntax is:
data_frame[row_number, column_number] = new_value


Let us take a look at the 7th row of the 1st column of ChickWeight. You should see the value 106.

##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1

We will now change the value 106 to 16.

##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7      16   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1
Another way to change a data entry is to call out the variable and row number. The syntax is:
data_frame$variable[row_number] = new_value


Let use replace the weight, 16, in row 7 of ChickWeight back to 106.

##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1

It is recommended that you use the latter method. That is, calling out the variable name instead of calling out the row and column number. One reason being when rows and/or columns are deleted, the row and/or column numbers shift. With big datasets, it becomes difficult to keep track of all the changes.