# 4 Machine Learning Basic Knowledge

Product data science interviews tend to allocate 0* minutes to an hour for basic Machine Learning knowledge. You will need to be familiar with the following:

- Linear regression
- Logistic regression
- k-means, PCA
- Basic machine learning training and evaluation frameworks and metrics

*Indeed, some PDS roles do not include an ML knowledge session in their interview loop; if these are the types of roles you are looking for, you can skip this chapter. Note however that very often ML fundamentals knowledge is being tested during the stats/probability session, so it might be hard to know in advance whether or not to prepare for some basic ML questions.*

## 4.1 Encoding categorical features

**Question:** What are some ways that you can encode categorical variables?

**Answer:** Some of the things we could do include:

One-hot-encoding (and then potentially apply some sort of dimensionality reduction technique such as PCA)

We can use embeddings (either use a pre-trained network or learn a new network to map entities into vectors)

Custom: customize them into groups (e.g., based on frequency of appearance).

Use hashing: collision occurs at random.

Target encoding

A few notes on hashing:

A study by Booking.com showed that even for 50% colliding features, the performance loss was less than 0.5% (p.131, Huyen (2022)).

There are ways that you can choose locality-sensitive hashing functions where similar categories (such as websites with similar names) are hashed into values close to each other (p.131, Huyen (2022)).

## 4.2 Imbalanced dataset

**Question:**: Assume we have a situation where we want to build a predictive classification model on a dataset that has 99% negative instances and 1% positive ones.

**A.** Is this a problem? Why?

**B.** What are some things we could do to address the issue?

**C.** What are some metrics that you would use to evaluate the offline predictive performance of your model?

**Answer:** This question focuses on class imbalance. Please see the respective answers below.

**A:**. There are several issues of why class imbalance might be problematic when trying to build predictive models. Some of the reasons include:

There might be insufficient signal for your model to learn to detect minority classes. In case there is a small number of instances in the minority class, the problem becomes a few-shot learning problem where your model only gets to see the minority class a few times before having to make a decision on it. In the case where there are no instances in the training sample of your rare class, the model might assume that the class doesn’t exist.

Class imbalance makes it easier for your model to get stuck in non-optimal solutions by exploiting a simple heuristic instead of learning anything useful about the underlying pattern of the data.

The cost of misclassification is asymmetric, as the cost of misclassifying a minority class instance tends to be much higher than the cost of misclassifying a majority class instance (e.g., labeling a fraudulent transaction as non-fraudulent has a much higher cost than labeling a legitimate transaction as fraudulent).

As we will discuss in

**C**, evaluating the performance of the model becomes non-trivial as several metrics might be overly optimistic or completely misleading.

Despite these, problems with class imbalance tend to be the more interesting, challenging, and rewarding to solve (e.g., predict the likelihood of ad conversions, predict credit card fraud, etc.).

**B**: Unfortunately, there is no perfect solution to class imbalance. Some mitigation techniques we could employ include:

SMOTE (see here: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

Use XGBoost (boosting in general) as it tends to work relatively well in imbalanced datasets

Get more data if possible

Easy ensemble (https://machinelearningmastery.com/bagging-and-random-forest-for-imbalanced-classification/)

Use class weight in the loss function (it penalizes mistakes in samples of

`class[i]`

with`class\_weight[i]`

instead of 1)Use a Cost-sensitive loss (p.111 Huyen (2022)):

\[ L = \sum_i C_{ij} P(j | x;\theta) \]

where \(L\) is the loss of a single instance, and \(i\) is the true class, and \(j\) is the predicted class

Ensemble models with different ratios in classes

Bagging with random undersampling for imbalanced classification. (

`BalancedBaggingClassifier`

, https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html)Two-phase learning: we first train a model on resampled balanced data. Then, we finetune our model on the original data.

Focal loss (adjust the loss so that if a sample has a lower probability of being right, it will have a higher weight).

Some additional points on class imbalance worthy of noting:

Class imbalance in binary classification is a much easier problem than class imbalance in multiclass classification (p.105 Huyen (2022))

Ding et al. (2017) showed that very deep neural networks – with very deep meaning over 10 layers back in 2017 – performed much better on imbalanced data than shallower neural nets.

**C**: In terms of metrics, accuracy will be misleading since a majority classifier in our example that blindly predicts “negative” will have an accuracy of 99%. Checking the confusion matrix will be a good start, but such a matrix (along with precision and recall and F-scores) is threshold-based.

A better approach would be to evaluate the predictive performance of our model in a probabilistic approach:

The area under the ROC curve is a good start, but it tends to be overly optimistic in highly imbalanced datasets as it includes the False Positive Rate in its calculation (i.e., it includes the total number of negatives in the denominator of False Positive Rate).

The Area under the Precision-Recall curve

**AUPR**is a better metric as it directly focuses on how the model is doing on the minority class through precision and recall.Finally, besides AUPR, accuracy can still be a good metric if we use it

**for each class individually**. Similarly, F-score, precision, and recall metrics that measure model performance wrt the minority class can also be good metrics.

## 4.3 Range of R2 when combining regressions

**Question:** Assume that you are given the \(R^2\) for the following regressions:

\[ \begin{align} y &= a x_1 + \varepsilon_1 \text{ with } R_1^2\\ y &= bx_2 + \varepsilon_2 \text{ with } R_2^2 \end{align} \] What is the possible range of \(R^2\) for \(y = c x_1 + d x_2 + \varepsilon\)?

**Answer:** Consider the case of a multiple linear regression model with two independent variables \(x_1\) and \(x_2\). The \(R^2\) value for this model measures the proportion of variance in the dependent variable that can be explained by both \(x_1\) and \(x_2\) together.

Since adding more independent variables to a regression model can only increase the proportion of variance that can be explained by the model, the range of \(R^2\) for \(y \sim x_1 + x_2\) will always be between the maximum R^2 value for \(y = a x_1 + \varepsilon_1\) and \(y = bx_2 + \varepsilon_2\), and 1.

That is:

\[ \max(R_1^2, R_2^2) \leq R^2 \leq 1 \]

For an example of 1, consider the following:

\(x_1 \sim y\) + noise

\(x_2 \sim y\) - noise

\(y \sim x_1 + x_2\) will result in \(R^2 = 1\).

## 4.4 k-means

**Question:** k-means is a very popular clustering algorithm:

A. How does it work? B. How do you choose k? C. How do you prepare your data for k-means? D. How do you choose starting points? E. Can you think of scenarios where k-means will fail?

**Answer:** Below we discuss the answers to the above questions:

**A.** The k-means works as follows:

First, we initialize the \(k\)centroids (assume vector \(\mu\) stores the centroids)

Then, we assign data points to the centroid that they are closest to (Euclidean distance between the data point and the centroid).

Finally, we update the centroids by recalculating the mean of all the data points assigned to a centroid.

We repeat these steps until convergence

**B.** The K-means algorithm seeks to minimize the sum of squared distances between each data point and its assigned centroid, which is also known as the within-cluster sum of squares (WCSS). The algorithm aims to find the optimal K number of clusters that minimizes the WCSS. The optimal value of K can be determined using techniques such as the **elbow** method, which involves plotting the WCSS against the number of clusters and selecting the value of K where the curve starts to level off.

An alternative approach to select \(K\) is through the **silhouette** score (see here: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html )

In general, we use two metrics to evaluate the quality of our k-means clusters:

**Distortion:**It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.**Inertia:**It is the sum of squared distances of samples to their closest cluster center.

When choosing k, the goal is to minimize distortion (and/or inertia): https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

Note that sometimes we might choose a convenient k based on the problem we are solving.

**C.** K-Means uses distance-based measurements (e.g., Euclidean Distance) to calculate how similar each data point is to centroids using values from all the features. These features usually take values in incomparable units (e.g., currency in dollars, weight in kg, temperature in Fahrenheit). In order to produce a fair result, it is recommended to standardize the raw data. We can transform the raw data to have a mean of zero and a standard deviation of one so that all the features have equal weights.

**D.** We can use something like **K-Means++** that allows overcoming getting stuck in a poor local optimum, hence improving the quality of the clustering. The intuition is simple. We will pick up initial centroids that are far away from each other so that it is more likely to pick the points from different clusters. K-Means++ can be implemented in the following steps.

Step 1: We randomly pick a data point as the first centroid.

Step 2: We compute the distance between the remaining points with the nearest centroid.

Step 3: We pick the next centroid such that the probability of picking a given point as the centroid is proportional to the distance between this given point and its nearest chosen centroid. In other words, the farther a point is away from the chosen centroids, the more likely it will be picked as the next centroid.

Repeat steps 2–3 until K centroids are picked.

**E.** In general, k-means will fail when clusters are defined by functions that are not symmetric around a centroid.

There are some nice visualizations in the following links:

## 4.5 Additional Questions

The following machine learning questions are included in the complete book that defines the bar for product data science:

Question | Topics |
---|---|

Bootstrap | Bootstrap |

MSE vs. MAE | MSE, MAE |

Outliers | Outliers, Cook’s distance, Regularization |

Missing data | Missing data |

Interpretability | ML interpretability |

Cross validation | Cross validation, Offline evaluation |

Random vs. stratified sampling | Sampling, Stratified sampling |

Linear regression assumptions | Linear regression |

Intercept | Linear regression, Intercept |

Normalization vs. Standardization | Linear regression, Standardization, Normalization |

Logistic regression assumptions | Logistic regression |

Principal Component Analysis (PCA) | PCA |

Multicollinearity | Multicollinearity, Linear regression |

### References

*2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)*, 1368–72. IEEE.

*Designing Machine Learning Systems*. " O’Reilly Media, Inc.".