Chapter 7 Working with Dataset
7.1 Dataset built into R
Let us look at an R dataset called ChickWeight. This dataset is built into R. To see a description of this dataset, go to the 4th panel and click “datasets”. Scroll down until you find the dataset, ChickWeight. Click the dataset. You will be taken to the Help tab which will give a description of the dataset and all its arguments.
Another way to get information on the dataset is by typing ? before the dataset. Information will appear on the 4th panel under the Help tab.
To see the whole dataset, use the command, View(data_frame). In this case, it would be View(ChickWeight). A new tab in the Source panel called ChickWeight will appear. You should see 578 rows and 4 columns of data entries.
7.2 Viewing Part of the Dataset
If you want to see only a portion of the dataset, the function head( ) or tail( ) will do the job.
The function head(data_frame) will show the first 6 rows of the dataset.
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
The function tail(data_frame) will show the last 6 rows of the dataset.
## weight Time Chick Diet
## 573 155 12 50 4
## 574 175 14 50 4
## 575 205 16 50 4
## 576 234 18 50 4
## 577 264 20 50 4
## 578 264 21 50 4
If you want to see a certain number of rows, specify it after stating the data frame. Adding an “L” after the row number tells R to treat the entry as an integer. Do not worry if you forget the “L”. R will still treat the entry as an integer, in this case.
Suppose you want to see the first 3 rows of the data frame, ChickWeight.
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
To see the last 10 rows of the data frame, ChickWeight:
## weight Time Chick Diet
## 569 67 4 50 4
## 570 84 6 50 4
## 571 105 8 50 4
## 572 122 10 50 4
## 573 155 12 50 4
## 574 175 14 50 4
## 575 205 16 50 4
## 576 234 18 50 4
## 577 264 20 50 4
## 578 264 21 50 4
To see all but a certain number of rows, put a negative sign before the row number. Suppose we want to see all but the last 570 rows of ChickWeight.
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
The same goes for the function tail( ). Suppose we want to see all but the first 570 rows of ChickWeight.
## weight Time Chick Diet
## 571 105 8 50 4
## 572 122 10 50 4
## 573 155 12 50 4
## 574 175 14 50 4
## 575 205 16 50 4
## 576 234 18 50 4
## 577 264 20 50 4
## 578 264 21 50 4
7.3 Viewing Entries Tied to a Variable
Suppose you want to work only with chicks that are newly hatched, ie, Time = 0 days. We are going to filter the data and call the new data frame chick0.
## weight Time Chick Diet
## 1 42 0 1 1
## 13 40 0 2 1
## 25 43 0 3 1
## 37 42 0 4 1
## 49 41 0 5 1
## 61 41 0 6 1
## 73 41 0 7 1
## 85 42 0 8 1
## 96 42 0 9 1
## 108 41 0 10 1
## 120 43 0 11 1
## 132 41 0 12 1
## 144 41 0 13 1
## 156 41 0 14 1
## 168 41 0 15 1
## 176 41 0 16 1
## 183 42 0 17 1
## 195 39 0 18 1
## 197 43 0 19 1
## 209 41 0 20 1
## 221 40 0 21 2
## 233 41 0 22 2
## 245 43 0 23 2
## 257 42 0 24 2
## 269 40 0 25 2
## 281 42 0 26 2
## 293 39 0 27 2
## 305 39 0 28 2
## 317 39 0 29 2
## 329 42 0 30 2
## 341 42 0 31 3
## 353 41 0 32 3
## 365 39 0 33 3
## 377 41 0 34 3
## 389 41 0 35 3
## 401 39 0 36 3
## 413 41 0 37 3
## 425 41 0 38 3
## 437 42 0 39 3
## 449 41 0 40 3
## 461 42 0 41 4
## 473 42 0 42 4
## 485 42 0 43 4
## 497 42 0 44 4
## 507 41 0 45 4
## 519 40 0 46 4
## 531 41 0 47 4
## 543 39 0 48 4
## 555 40 0 49 4
## 567 41 0 50 4
From 578 rows, we now have only 45 rows. Under the column, Time, all the results show 0.
7.4 Ordering Data Frame by Variable
Use the function called order( ) to order the data frame in descending or ascending order. The syntax in this case is: data_frame[order(data_frame$variable), ]
Suppose we want to order the data frame, chick0, in ascending order by weight of the chicks. We will call the new data frame, chick0_ascend.
chick0_ascend <- chick0[order(chick0$weight), ]
head(chick0_ascend, 15) # Look at first 15 rows of data frame, chick0_ascend
## weight Time Chick Diet
## 195 39 0 18 1
## 293 39 0 27 2
## 305 39 0 28 2
## 317 39 0 29 2
## 365 39 0 33 3
## 401 39 0 36 3
## 543 39 0 48 4
## 13 40 0 2 1
## 221 40 0 21 2
## 269 40 0 25 2
## 519 40 0 46 4
## 555 40 0 49 4
## 49 41 0 5 1
## 61 41 0 6 1
## 73 41 0 7 1
To order the data frame, chick0, in descending order by weight of the chicks, put a negative sign in front of the target vector. We will call the new data frame, chick0_descend.
chick0_descend <- chick0[order(-chick0$weight), ]
head(chick0_descend, 15) #Look at first 15 rows of data frame, chick0_descend
## weight Time Chick Diet
## 25 43 0 3 1
## 120 43 0 11 1
## 197 43 0 19 1
## 245 43 0 23 2
## 1 42 0 1 1
## 37 42 0 4 1
## 85 42 0 8 1
## 96 42 0 9 1
## 183 42 0 17 1
## 257 42 0 24 2
## 281 42 0 26 2
## 329 42 0 30 2
## 341 42 0 31 3
## 437 42 0 39 3
## 461 42 0 41 4
Be careful what variables you are sorting. Quantitative variables can be sorted. However, it does not make sense to sort categorical variables. Note that the variable, Diet, is categorical even though the data entries are numeric. Let us see what happens if we try to sort the variable, Diet, in the data frame, chick0, in descending order. Let us call this new data frame, chick0_diet.
## Warning in Ops.factor(chick0$Diet): '-' not meaningful for factors
What if we want to order data entries within a certain variable in a data frame? Suppose we want to sort the chick weight, within each diet, in ascending order. Let us call this new data frame, chick0_arr.
## weight Time Chick Diet
## 195 39 0 18 1
## 13 40 0 2 1
## 49 41 0 5 1
## 61 41 0 6 1
## 73 41 0 7 1
## 108 41 0 10 1
## 132 41 0 12 1
## 144 41 0 13 1
## 156 41 0 14 1
## 168 41 0 15 1
## 176 41 0 16 1
## 209 41 0 20 1
## 1 42 0 1 1
## 37 42 0 4 1
## 85 42 0 8 1
## 96 42 0 9 1
## 183 42 0 17 1
## 25 43 0 3 1
## 120 43 0 11 1
## 197 43 0 19 1
## 293 39 0 27 2
## 305 39 0 28 2
## 317 39 0 29 2
## 221 40 0 21 2
## 269 40 0 25 2
## 233 41 0 22 2
## 257 42 0 24 2
## 281 42 0 26 2
## 329 42 0 30 2
## 245 43 0 23 2
## 365 39 0 33 3
## 401 39 0 36 3
## 353 41 0 32 3
## 377 41 0 34 3
## 389 41 0 35 3
## 413 41 0 37 3
## 425 41 0 38 3
## 449 41 0 40 3
## 341 42 0 31 3
## 437 42 0 39 3
## 543 39 0 48 4
## 519 40 0 46 4
## 555 40 0 49 4
## 507 41 0 45 4
## 531 41 0 47 4
## 567 41 0 50 4
## 461 42 0 41 4
## 473 42 0 42 4
## 485 42 0 43 4
## 497 42 0 44 4
To sort the chick weight, within each diet, in descending order, put a negative sign on the target vector. We will call this new data frame, chick0_arr2.
## weight Time Chick Diet
## 25 43 0 3 1
## 120 43 0 11 1
## 197 43 0 19 1
## 1 42 0 1 1
## 37 42 0 4 1
## 85 42 0 8 1
## 96 42 0 9 1
## 183 42 0 17 1
## 49 41 0 5 1
## 61 41 0 6 1
## 73 41 0 7 1
## 108 41 0 10 1
## 132 41 0 12 1
## 144 41 0 13 1
## 156 41 0 14 1
## 168 41 0 15 1
## 176 41 0 16 1
## 209 41 0 20 1
## 13 40 0 2 1
## 195 39 0 18 1
## 245 43 0 23 2
## 257 42 0 24 2
## 281 42 0 26 2
## 329 42 0 30 2
## 233 41 0 22 2
## 221 40 0 21 2
## 269 40 0 25 2
## 293 39 0 27 2
## 305 39 0 28 2
## 317 39 0 29 2
## 341 42 0 31 3
## 437 42 0 39 3
## 353 41 0 32 3
## 377 41 0 34 3
## 389 41 0 35 3
## 413 41 0 37 3
## 425 41 0 38 3
## 449 41 0 40 3
## 365 39 0 33 3
## 401 39 0 36 3
## 461 42 0 41 4
## 473 42 0 42 4
## 485 42 0 43 4
## 497 42 0 44 4
## 507 41 0 45 4
## 531 41 0 47 4
## 567 41 0 50 4
## 519 40 0 46 4
## 555 40 0 49 4
## 543 39 0 48 4
7.5 Renaming Variables
There are a couple of ways to rename a variable. In each case, the function, names( ), is used.
One method is to call the variable by its column number and rename. The syntax is:Let us rename the variable, Time to Days. Time is in column 2.
## weight Days Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Another way is to rename a variable is to call its name. The syntax is: Let us rename the variable, Days to Time.
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
7.6 Changing Data Entry
There are several ways to change a particular data entry.
One way is to call out the row number and column number and replace the existing value with the new value. The syntax is:Let us take a look at the 7th row of the 1st column of ChickWeight. You should see the value 106.
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
We will now change the value 106 to 16.
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 16 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
Another way to change a data entry is to call out the variable and row number. The syntax is: Let use replace the weight, 16, in row 7 of ChickWeight back to 106.
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
It is recommended that you use the latter method. That is, calling out the variable name instead of calling out the row and column number. One reason being when rows and/or columns are deleted, the row and/or column numbers shift. With big datasets, it becomes difficult to keep track of all the changes.