7 Features vs. Targets

This is a short additional lesson that covers some of the issues people new to R or coding might encounter at this point in the class. The first part deals with translating the mathematical expressions you see in the book to code. It briefly touches on the common problem of not assigning operations back to an object. It then covers the difference between features and targets, which is something you need to be very clear on. Finally, it goes into some of the dataframe manipulation around features and targets.

We’ll be using our cereal dataset again just to illustrate these topics.

cereal <- read_csv("https://docs.google.com/spreadsheets/d/1sD1uWYNRfbPRNFNgJl7ufWqe0TPLuVZUKuNGgLEQ5Qo/gviz/tq?tqx=out:csv")

7.1 From math to code - how to deal with this.

One thing that might be new to a lot of you is translating mathematical expressions to code. Seeing big equations is super intimidating at first, but once you see that you can break them down into bite-sized pieces it becomes a lot easier. Let’s start with a review of some commonly seem symbols.

7.1.1 Common math symbols

$n$ - $n$ refers to your sample size. This is normally the number of rows you have in your data.

$p$ - $p$ refers to the number of features in a model. So if you’re predicting the number of rides using average temperature and average rainfall, then $p = 2$

$x_i$ - So you probably know that $x$ is referring to your x-axis or a feature. But what does the subscript of $i$ mean? $i$ is how you abstract the length of a vector. Let’s say you had x <- c(13,22,35,43), then $x_2$ would have the value of 22 as you’re saying ‘give me the 2nd positions of x. Using $i$ is a way to say ’for the position specified’ in a generic way. $y_i$ is the same.

$\overline{x}$ - ‘X Bar’ is just the mean of $x$ . So if you wanted the mean of a feature sugars you would just use mean(cereal$sugars). That’s all that X Bar is.

$\hat{y}$ - ‘Y Hat’ is a way to indicate that a value is estimated. Thus, whenever you see a hat it’s tell you to use the predicted value in that equation.

${\sum_{i = 1}^n}$ - This is saying to take the sum of something starting from some value $i$ and going a certain length. In this case the bottom part of $i = 1$ is saying start at the first value in your data. The upper part is then saying go as long as your data is as $n$ means number of observations.

$\sum_{i = 1}^n (x_i - \overline{x})$ - Let’s put this together now. Looking in the parentheses we see our X Bar, which is just the mean. Our $X_i$ then is saying that we’ll take the first value of x and subtract the mean value of x from it. Then we’ll take the second value of x and subtract the mean. Then 3rd, then 4th, all the way until our $i$ reaches the end of our data set. Code wise it would look like this if we wanted to apply that math operation where x = the sugars column:

sum(cereal$sugars - mean(sugars))

R will automatically go and take each value of sugars and subtract the mean. It’ll then take the sum of each value. Boom, math to code.

7.2 Objects and overwriting objects

R relies on objects for its functionality. An object is a named item that contains data in some form. It can be as simple as below where we create an object x that contains the value 2

x <- 2

The major thing that people forget when doing data science in this class is that you can (and often should) overwrite objects. Doing the following will overwrite the value of 2 in x and replace it with the value 3.

x <- 2

This is simple and most people understand it. So where does this issue cause problems in this class? Two main places:

People either do an action to a column in a data frame but forget to write it, or they do the wrong action and write it on accident
People overwrite their main data because they didn’t update the object name.

Let’s dig into these

7.2.1 Forgetting to do the right thing

OK, let’s say we want to make a feature in our cereal dataset of the number of calories per unit weight. Currently in the dataset there are calorie counts, but they are not all for the same mass of cereal. This isn’t helpful as some cereals might have higher calorie counts just because the count is taken for more cereal. So let’s create a new feature where we divide the number of calories by the weight. We’ll call it calories_adjusted

calories_adjusted <- cereal$calories/cereal$weight

This is great and all, but we need to make sure we write this back to our data frame. We do this by calling our data frame name and then our new column name on the right side of the arrow operator.

cereal$calories_adjusted <- cereal$calories/cereal$weight

We can also use the tidyverse mutate() to do this. Note how we overwrite the whole data frame and not just a new column like the base R way.

cereal <- cereal %>%
  mutate(calories_adjusted = calories/weight)

If you wanted to just fully replace the calories column with the adjusted version but without changing the name you would do the same as above but give the column name the same as the existing one. Overwriting a column or even your whole data frame is really common after you scale something or fix something in your data!

cereal$calories <- cereal$calories/cereal$weight

7.2.2 Writing the wrong thing

The other place object assignment causes problems is when people add the wrong thing to their column. This pops up most frequently when they get a bunch of NA values from doing the wrong math operation and then overwrite their data without checking it.

Let’s say you want to make a new feature that measures if a cereal is above average or below average in terms of its calorie content. To do this you would calculate the average number of calories in the dataset, and then subtract that from each cereal’s calorie content. If it’s a positive number, then a cereal has an above average calorie content. Negative means it’s below average.

Let’s do that below and overwrite our current calories column

cereal$calories <- cereal$calories - mean(cereal$calories)

Being the good data scientists that we are we always check our data. But darn, it’s all NA values!

glimpse(cereal)

## Rows: 77
## Columns: 17
## $ name              <chr> "100% Bran", "100% Natural Bran", "All-Bran", "Al...
## $ mfr               <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P",...
## $ type              <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",...
## $ calories          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ protein           <dbl> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2...
## $ fat               <dbl> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0...
## $ sodium            <dbl> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, ...
## $ fiber             <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5....
## $ carbo             <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0,...
## $ sugars            <dbl> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, ...
## $ potass            <dbl> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35...
## $ vitamins          <dbl> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25...
## $ shelf             <dbl> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1...
## $ weight            <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1...
## $ cups              <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0...
## $ rating            <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484,...
## $ calories_adjusted <dbl> 70.00000, 120.00000, 70.00000, 50.00000, 110.0000...

There must have been an NA value somewhere in our calorie content that we missed, and since we didn’t specify na.rm = TRUE in our mean() function, it tried to subtract NA from every value. Thus, we overwrote our data with all NA values.

How do we fix this? Unfortunately we need to go and reimport our data and start over.

cereal <- read_csv("https://docs.google.com/spreadsheets/d/1sD1uWYNRfbPRNFNgJl7ufWqe0TPLuVZUKuNGgLEQ5Qo/gviz/tq?tqx=out:csv")

So are there NA values in our calories column? Yep!

summary(cereal$calories)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    50.0   100.0   110.0   107.7   110.0   160.0       4

Without dealing with our NA values in the mean() it returns only NA.

mean(cereal$calories)

## [1] NA

And subtracting that from the calories indeed returns just a vector of NA values.

cereal$calories - mean(cereal$calories)

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA

Let’s do this the right way by making sure we account for those NA values. We can see that we now have real values in there!

cereal$calories <- cereal$calories - mean(cereal$calories, na.rm = TRUE)
glimpse(cereal)

## Rows: 77
## Columns: 16
## $ name     <chr> "100% Bran", "100% Natural Bran", "All-Bran", "All-Bran wi...
## $ mfr      <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P", "Q", "G"...
## $ type     <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"...
## $ calories <dbl> -37.671233, 12.328767, -37.671233, NA, 2.328767, 2.328767,...
## $ protein  <dbl> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2, 1, 1, 3...
## $ fat      <dbl> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0, 0, 1, 3...
## $ sodium   <dbl> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, 220, 290,...
## $ fiber    <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5.0, 0.0, 2...
## $ carbo    <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0, 13.0, 12...
## $ sugars   <dbl> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2, 12,...
## $ potass   <dbl> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35, 105, 45...
## $ vitamins <dbl> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,...
## $ shelf    <dbl> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1, 2, 2, 3...
## $ weight   <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1.00, 1.00...
## $ cups     <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0.67, 0.67...
## $ rating   <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484, 29.50954...

This may seem really simple, but it really can trip a lot of people up. A good way to make sure you don’t overwrite your data is to perform your operation but assign it to a test object. For example, if I assigned this operation to xxx and checked that, I would have spotted the error and corrected it before overwriting my data.

xxx <- cereal$calories - mean(cereal$calories)
glimpse(xxx)

##  num [1:77] NA NA NA NA NA NA NA NA NA NA ...

7.3 Targets and Features

Just to mention this again - your Target is what you’re trying to predict or understand. Your feature(s) are what you’re using to predict or understand your target. You can have multiple features, but generally only have one target.

FYI, just because something is a target in one question doesn’t mean it can’t be used as a feature in another question. For example, let’s say you wanted to predict the target of an AirBnB rental price. You might rightfully use a measure of the property manager’s responsiveness to messages as a feature… more responsive managers might be able to charge more. But, that measure of responsiveness could easily become a target if you wanted to predict the factors that influence how quickly they respond. Features such as how many properties they have, how long they’ve been a property manager, and many others might explain that. So, it’s the question that dictates what is your feature and target.

7.4 Breakup makeup

Soon you’ll be doing a lot of splitting apart your whole data set into targets and features. This is because many R functions actually need these to be inputted separately. But, you’ll frequently need to then bring many columns back together into a single data frame for analysis. None of this is super hard once you get the hang of it, but it is intimidating to start and always super tedious.

The process of doing this will always be in the main lessons, but I figure I’ll cover them again here because they’re so important.

7.4.1 Splitting Targets and Feature

cereal <- read_csv("https://docs.google.com/spreadsheets/d/1sD1uWYNRfbPRNFNgJl7ufWqe0TPLuVZUKuNGgLEQ5Qo/gviz/tq?tqx=out:csv")

If we want to split our target from our features we need to first make an object from our main data that contains just the target, and then need to delete the target from our features.

Let’s make our target first. From our cereal data this is the calories column. Both the base R and tidyverse ways give the same result

cereal_target <- cereal %>% select(calories) # tidyverse way
cereal_target <- cereal[, 'calories'] # base R way

Of course, we should check our data. Doing so shows us a critical thing…. extracting a single column from a data frame keeps it as a data frame! So even though it’s only one vector of our column cereal, it’s contained within the data frame cereal_target.

glimpse(cereal_target)

## Rows: 77
## Columns: 1
## $ calories <dbl> 70, 120, 70, 50, 110, 110, 110, 130, 90, 90, 120, 110, 120...

This matters as you need to make sure you refer to the column in the data frame, not the data frame itself. For example, this doesn’t work:

mean(cereal_target, na.rm = TRUE)

## Warning in mean.default(cereal_target, na.rm = TRUE): argument is not numeric or
## logical: returning NA

## [1] NA

But calling our data within cereal_target works just fine

mean(cereal_target$calories, na.rm = TRUE)

## [1] 106.8831

Splitting features is easy as that’s just removing a column. You can do this in tidyverse in a simple way. A glimpse shows that calories is now gone.

cereal_features <- cereal %>% select(-calories) # tidyverse way
glimpse(cereal_features)

## Rows: 77
## Columns: 15
## $ name     <chr> "100% Bran", "100% Natural Bran", "All-Bran", "All-Bran wi...
## $ mfr      <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P", "Q", "G"...
## $ type     <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"...
## $ protein  <dbl> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2, 1, 1, 3...
## $ fat      <dbl> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0, 0, 1, 3...
## $ sodium   <dbl> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, 220, 290,...
## $ fiber    <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5.0, 0.0, 2...
## $ carbo    <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0, 13.0, 12...
## $ sugars   <dbl> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2, 12,...
## $ potass   <dbl> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35, 105, 45...
## $ vitamins <dbl> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,...
## $ shelf    <dbl> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1, 2, 2, 3...
## $ weight   <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1.00, 1.00...
## $ cups     <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0.67, 0.67...
## $ rating   <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484, 29.50954...

Base R is a bit more complicated to remove a column by name. The ! is R’s way of saying ‘not’. So we’re asking R to keep all column names that are NOT in the list contain the word ‘calories’.

cereal_features <- cereal [, !(names(cereal) %in% c('calories'))]
glimpse(cereal_features)

## Rows: 77
## Columns: 15
## $ name     <chr> "100% Bran", "100% Natural Bran", "All-Bran", "All-Bran wi...
## $ mfr      <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P", "Q", "G"...
## $ type     <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"...
## $ protein  <dbl> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2, 1, 1, 3...
## $ fat      <dbl> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0, 0, 1, 3...
## $ sodium   <dbl> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, 220, 290,...
## $ fiber    <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5.0, 0.0, 2...
## $ carbo    <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0, 13.0, 12...
## $ sugars   <dbl> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2, 12,...
## $ potass   <dbl> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35, 105, 45...
## $ vitamins <dbl> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,...
## $ shelf    <dbl> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1, 2, 2, 3...
## $ weight   <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1.00, 1.00...
## $ cups     <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0.67, 0.67...
## $ rating   <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484, 29.50954...

If you want to see why this works I encourage you to break out the functions one-by-one. For example, names(cereal) gives the following list of column names:

names(cereal)

##  [1] "name"     "mfr"      "type"     "calories" "protein"  "fat"     
##  [7] "sodium"   "fiber"    "carbo"    "sugars"   "potass"   "vitamins"
## [13] "shelf"    "weight"   "cups"     "rating"

And let’s see what that does when mixed with the %in% c('calories'):

!(names(cereal) %in% c('calories'))

##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE

So when we use that in our square brackets it’s only returning the columns where that statement evaluates to TRUE. This is all column names that are not in ‘calories’. Thus it keeps everything but our target.

7.4.2 Merging columns back in

You will very frequently get a vector of predictions that you’ll want to add back to your data frame. Luckily this is an easier thing to deal with. You can just use your $ operator to add it back to your data.

For example, I made a model to predict calories based on fat, protein, and sugars. I’ll fit it below.

cal_model <- lm(calories ~ protein + fat + sugars, data = cereal)

I’m going to use predict() to get my prediction for the number of calories (this is $\hat{y}$ . This will be a vector. Note how the data I use to predict contains only the features in the model…

predicted_calories <- predict(cal_model, newdata = cereal[, c('protein', 'fat', 'sugars')])
predicted_calories

##         1         2         3         4         5         6         7         8 
## 107.76066 136.18803 105.52636  87.84571 114.58277 119.05138 114.97030 116.66059 
##         9        10        11        12        13        14        15        16 
## 103.60503  96.93940 121.44216 107.25394 121.24840 114.42629 117.16732  90.39297 
##        17        18        19        20        21        22        23        24 
##  88.15867 108.42388 117.16732 120.93544  85.76789  90.39297 112.54223  94.86158 
##        25        26        27        28        29        30        31        32 
## 119.24514 106.18957 101.40800 121.12920 112.57952 114.93302 117.20460 108.23011 
##        33        34        35        36        37        38        39        40 
## 103.44854  92.47079 114.23253 119.20786 114.62005 106.18957 103.60503 112.38575 
##        41        42        43        44        45        46        47        48 
##  96.90212 114.26981 117.01084 101.05776 131.95046 131.95046 127.83211 103.60503 
##        49        50        51        52        53        54        55        56 
## 110.30793 114.42629  90.23649 121.12920 123.55726  92.47079  81.61225  83.69007 
##        57        58        59        60        61        62        63        64 
## 107.76066 100.70751 119.08866 116.66059  97.09588  86.08085  90.39297  83.69007 
##        65        66        67        68        69        70        71        72 
##  85.76789  85.76789 123.71375  98.70425  94.86158  96.90212 123.55726  98.97994 
##        73        74        75        76        77 
##  96.90212 114.93302  98.97994  98.97994 108.07363

And add that back! Now you can use this to calculate your error between the real value and the predicted one.

cereal$predicted_calories <- predicted_calories
glimpse(cereal)

## Rows: 77
## Columns: 17
## $ name               <chr> "100% Bran", "100% Natural Bran", "All-Bran", "A...
## $ mfr                <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P"...
## $ type               <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"...
## $ calories           <dbl> 70, 120, 70, 50, 110, 110, 110, 130, 90, 90, 120...
## $ protein            <dbl> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, ...
## $ fat                <dbl> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, ...
## $ sodium             <dbl> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210,...
## $ fiber              <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5...
## $ carbo              <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0...
## $ sugars             <dbl> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13,...
## $ potass             <dbl> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 3...
## $ vitamins           <dbl> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 2...
## $ shelf              <dbl> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, ...
## $ weight             <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, ...
## $ cups               <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, ...
## $ rating             <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484...
## $ predicted_calories <dbl> 107.76066, 136.18803, 105.52636, 87.84571, 114.5...