4 Machine Learning Basic Knowledge

Product data science interviews tend to allocate 0* minutes to an hour for basic Machine Learning knowledge. You will need to be familiar with the following:

Linear regression
Logistic regression
k-means, PCA
Basic machine learning training and evaluation frameworks and metrics

If you want to dig deeper into these topics consider studying the related material on https://www.madinterview.com/learn/

Indeed, some PDS roles do not include an ML knowledge session in their interview loop; if these are the types of roles you are looking for, you can skip this chapter. Note however that very often ML fundamentals knowledge is being tested during the stats/probability session, so it might be hard to know in advance whether or not to prepare for some basic ML questions.

4.1 Encoding categorical features

Easy Seen in a real interviewCategorical features, Embeddings, One-hot encoding, HashingTry it

Question: What are some ways that you can encode categorical variables?

Answer: Some of the things we could do include:

One-hot-encoding (and then potentially apply some sort of dimensionality reduction technique such as PCA)
We can use embeddings (either use a pre-trained network or learn a new network to map entities into vectors)
Custom: customize them into groups (e.g., based on frequency of appearance).
Use hashing: collision occurs at random.
Target encoding

A few notes on hashing:

A study by Booking.com showed that even for 50% colliding features, the performance loss was less than 0.5% (p.131, Huyen (2022)).
There are ways that you can choose locality-sensitive hashing functions where similar categories (such as websites with similar names) are hashed into values close to each other (p.131, Huyen (2022)).

A nice discussion on target encoding can be found here: https://towardsdatascience.com/why-you-should-try-mean-encoding-17057262cd0

4.2 Imbalanced dataset

Medium Seen in a real interviewClass imbalanceTry it

Question:: Assume we have a situation where we want to build a predictive classification model on a dataset that has 99% negative instances and 1% positive ones.

A. Is this a problem? Why?

B. What are some things we could do to address the issue?

C. What are some metrics that you would use to evaluate the offline predictive performance of your model?

Answer: This question focuses on class imbalance. Please see the respective answers below.

This is a very standard question for all data science and machine learning interviews. The response below might be more detailed than what you need in a PDS interview. However, being able to talk about this issue will certainly help.

A:. There are several issues of why class imbalance might be problematic when trying to build predictive models. Some of the reasons include:

There might be insufficient signal for your model to learn to detect minority classes. In case there is a small number of instances in the minority class, the problem becomes a few-shot learning problem where your model only gets to see the minority class a few times before having to make a decision on it. In the case where there are no instances in the training sample of your rare class, the model might assume that the class doesn’t exist.
Class imbalance makes it easier for your model to get stuck in non-optimal solutions by exploiting a simple heuristic instead of learning anything useful about the underlying pattern of the data.
The cost of misclassification is asymmetric, as the cost of misclassifying a minority class instance tends to be much higher than the cost of misclassifying a majority class instance (e.g., labeling a fraudulent transaction as non-fraudulent has a much higher cost than labeling a legitimate transaction as fraudulent).
As we will discuss in C, evaluating the performance of the model becomes non-trivial as several metrics might be overly optimistic or completely misleading.

Despite these, problems with class imbalance tend to be the more interesting, challenging, and rewarding to solve (e.g., predict the likelihood of ad conversions, predict credit card fraud, etc.).

B: Unfortunately, there is no perfect solution to class imbalance. Some mitigation techniques we could employ include:

SMOTE (see here: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)
Use XGBoost (boosting in general) as it tends to work relatively well in imbalanced datasets
Get more data if possible
Easy ensemble (https://machinelearningmastery.com/bagging-and-random-forest-for-imbalanced-classification/)
Use class weight in the loss function (it penalizes mistakes in samples of class[i] with class\_weight[i] instead of 1)
Use a Cost-sensitive loss (p.111 Huyen (2022)):

$L = \sum_i C_{ij} P(j | x;\theta)$

where $L$ is the loss of a single instance, and $i$ is the true class, and $j$ is the predicted class

Ensemble models with different ratios in classes
Bagging with random undersampling for imbalanced classification. (BalancedBaggingClassifier, https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html)
Two-phase learning: we first train a model on resampled balanced data. Then, we finetune our model on the original data.
Focal loss (adjust the loss so that if a sample has a lower probability of being right, it will have a higher weight).

Some additional points on class imbalance worthy of noting:

Class imbalance in binary classification is a much easier problem than class imbalance in multiclass classification (p.105 Huyen (2022))
Ding et al. (2017) showed that very deep neural networks – with very deep meaning over 10 layers back in 2017 – performed much better on imbalanced data than shallower neural nets.

C: In terms of metrics, accuracy will be misleading since a majority classifier in our example that blindly predicts “negative” will have an accuracy of 99%. Checking the confusion matrix will be a good start, but such a matrix (along with precision and recall and F-scores) is threshold-based.

A better approach would be to evaluate the predictive performance of our model in a probabilistic approach:

The area under the ROC curve is a good start, but it tends to be overly optimistic in highly imbalanced datasets as it includes the False Positive Rate in its calculation (i.e., it includes the total number of negatives in the denominator of False Positive Rate).
The Area under the Precision-Recall curve AUPR is a better metric as it directly focuses on how the model is doing on the minority class through precision and recall.
Finally, besides AUPR, accuracy can still be a good metric if we use it for each class individually. Similarly, F-score, precision, and recall metrics that measure model performance wrt the minority class can also be good metrics.

Note that F1, precision, and recall are asymmetric metrics, which means their values change depending on which class is considered the positive class.

4.3 Range of R2 when combining regressions

Medium Seen in a real interviewLinear regression, Goodness of fit, R-squared, CorrelationTry it

Question: Assume that you are given the $R^2$ for the following regressions:

$\begin{align} y &= a x_1 + \varepsilon_1 \text{ with } R_1^2\\ y &= bx_2 + \varepsilon_2 \text{ with } R_2^2 \end{align}$ What is the possible range of $R^2$ for $y = c x_1 + d x_2 + \varepsilon$ ?

Answer: Consider the case of a multiple linear regression model with two independent variables $x_1$ and $x_2$ . The $R^2$ value for this model measures the proportion of variance in the dependent variable that can be explained by both $x_1$ and $x_2$ together.

Since adding more independent variables to a regression model can only increase the proportion of variance that can be explained by the model, the range of $R^2$ for $y \sim x_1 + x_2$ will always be between the maximum R^2 value for $y = a x_1 + \varepsilon_1$ and $y = bx_2 + \varepsilon_2$ , and 1.

That is:

$\max(R_1^2, R_2^2) \leq R^2 \leq 1$

For an example of 1, consider the following:

$x_1 \sim y$ + noise
$x_2 \sim y$ - noise
$y \sim x_1 + x_2$ will result in $R^2 = 1$ .

4.4 k-means

Medium Seen in a real interviewk-meansTry it

Question: k-means is a very popular clustering algorithm:

A. How does it work? B. How do you choose k? C. How do you prepare your data for k-means? D. How do you choose starting points? E. Can you think of scenarios where k-means will fail?

Answer: Below we discuss the answers to the above questions:

A. The k-means works as follows:

First, we initialize the $k$ centroids (assume vector $\mu$ stores the centroids)
Then, we assign data points to the centroid that they are closest to (Euclidean distance between the data point and the centroid).
Finally, we update the centroids by recalculating the mean of all the data points assigned to a centroid.
We repeat these steps until convergence

B. The K-means algorithm seeks to minimize the sum of squared distances between each data point and its assigned centroid, which is also known as the within-cluster sum of squares (WCSS). The algorithm aims to find the optimal K number of clusters that minimizes the WCSS. The optimal value of K can be determined using techniques such as the elbow method, which involves plotting the WCSS against the number of clusters and selecting the value of K where the curve starts to level off.

An alternative approach to select $K$ is through the silhouette score (see here: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html )

In general, we use two metrics to evaluate the quality of our k-means clusters:

Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
Inertia: It is the sum of squared distances of samples to their closest cluster center.

When choosing k, the goal is to minimize distortion (and/or inertia): https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

Note that sometimes we might choose a convenient k based on the problem we are solving.

C. K-Means uses distance-based measurements (e.g., Euclidean Distance) to calculate how similar each data point is to centroids using values from all the features. These features usually take values in incomparable units (e.g., currency in dollars, weight in kg, temperature in Fahrenheit). In order to produce a fair result, it is recommended to standardize the raw data. We can transform the raw data to have a mean of zero and a standard deviation of one so that all the features have equal weights.

D. We can use something like K-Means++ that allows overcoming getting stuck in a poor local optimum, hence improving the quality of the clustering. The intuition is simple. We will pick up initial centroids that are far away from each other so that it is more likely to pick the points from different clusters. K-Means++ can be implemented in the following steps.

Step 1: We randomly pick a data point as the first centroid.
Step 2: We compute the distance between the remaining points with the nearest centroid.
Step 3: We pick the next centroid such that the probability of picking a given point as the centroid is proportional to the distance between this given point and its nearest chosen centroid. In other words, the farther a point is away from the chosen centroids, the more likely it will be picked as the next centroid.
Repeat steps 2–3 until K centroids are picked.

E. In general, k-means will fail when clusters are defined by functions that are not symmetric around a centroid.

There are some nice visualizations in the following links:

4.5 Additional Questions

The following machine learning questions are included in the complete book that defines the bar for product data science:

Question	Topics
Bootstrap	Bootstrap
MSE vs. MAE	MSE, MAE
Outliers	Outliers, Cook’s distance, Regularization
Missing data	Missing data
Interpretability	ML interpretability
Cross validation	Cross validation, Offline evaluation
Random vs. stratified sampling	Sampling, Stratified sampling
Linear regression assumptions	Linear regression
Intercept	Linear regression, Intercept
Normalization vs. Standardization	Linear regression, Standardization, Normalization
Logistic regression assumptions	Logistic regression
Principal Component Analysis (PCA)	PCA
Multicollinearity	Multicollinearity, Linear regression

References

Ding, Wan, Dong-Yan Huang, Zhuo Chen, Xinguo Yu, and Weisi Lin. 2017. “Facial Action Recognition Using Very Deep Networks for Highly Imbalanced Class Distribution.” In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1368–72. IEEE.

Huyen, Chip. 2022. Designing Machine Learning Systems. " O’Reilly Media, Inc.".