Chapter 12 The basics of machine learning

The machine learning flow. source: mlr3 manual

The machine learning flow. source: mlr3 manual

Data: (yi,xi)i=1,,N where xRD.

Want to predict y from x.

Model for prediction: f(x)=θTx+θ0

12.1 Training vs prediction

12.1.1 Training vs testing (predicting) sets

  • The entire sample S=(yi,xi)i=1,,N is randomly separatedly to two disjoint sets of sample, say St and Sp, so that

StSp=S, StSp=S

  • Model training phase: when you are estimating parameters using St.

  • Model testing phase: when you are evaluating your model performance using Sp.

12.2 How to train a model

  • some measure of quality:

    • empirical risk minimization: such as SSE

min

  • Bayesian inference

12.3 Training while avoiding over/under-fitting

  • Training without proper reigning will lead to a trained model that performs well in \mathcal{S}_t, but poorly in \mathcal{S}_p.

You want to predict head or tail from a coin tossing. You collect a sample of (H,T,H,H,H). Randomly separate it into \mathcal{S}_t=(H,H,H) and \mathcal{S}_p=(H,T).

  • What is your model?

  • What is your measure of quality?

  • What is your parameter estimate?

12.3.1 Reigning while training

  • Regularization: add penalty term (call regularizer) to our loss function, such as \left\Vert \bf{\theta}\right\Vert^2, with a tunning parameter \lambda. For example,

\min_{ \bf{\theta}}\ \sum_{(y_i,\bf{x}_i)\in\mathcal{S}_t}(y_i-f(\bf{x_i}| \bf{\theta}))^2/N + \lambda\left\Vert \bf{\theta}\right\Vert^2, \ or \min_{ \bf{\theta}}\ \left\Vert \bf{y}-f(\bf{x}| \bf{\theta})\right\Vert^2/N + \lambda\left\Vert \bf{\theta}\right\Vert^2, where \lambda>0 is called a tunning parameter that needs to be chosen in the training phase as well.

  • Addinng a prior

12.3.2 Cross-validation to choose tunning parameter

12.4 Python example

Intall the basic Python machine learning package:

  • Regularization of Linear Regression Model:
    • Ridge regression
      \min_{w} || X w - y||_2^2 + \alpha ||w||_2^2
    linear_model.Ridge(alpha=0.5) # instance initiation
    linear_model.RidgeCV()
    • Lasso regression
      \min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}
    linear_model.Lasso() # instance initiation
  • Stochastic Gradient Descent:
    • SGDRegressor:
      a batch size of 1, stochastic gradient descent algorithm.
    SGDRegressor(
      loss="squared_loss", # OLS Loss function
      panelty="l2", # squared norm of parameters
      alpha=0.0001
      )

    其他參數還有max_it(全部data可pass through的次數,即epoch,上限),

    Sample\ size(1000) / Batch\ size (250) = Iteration\ times (4) One epoch is one complete sample size passing through, which requires four iterations to complete.

12.4.2 Training

## LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
## Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
##       normalize=False, random_state=None, solver='auto', tol=0.001)
## RidgeCV(alphas=array([0.1       , 0.42222222, 0.74444444, 1.06666667, 1.38888889,
##        1.71111111, 2.03333333, 2.35555556, 2.67777778, 3.        ]),
##         cv=5, fit_intercept=True, gcv_mode=None, normalize=False, scoring=None,
##         store_cv_values=False)
## 
## //anaconda3/envs/m-team/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
##   DeprecationWarning)
## Lasso(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=1000,
##       normalize=False, positive=False, precompute=False, random_state=None,
##       selection='cyclic', tol=0.0001, warm_start=False)
## LassoCV(alphas=array([0.1       , 0.42222222, 0.74444444, 1.06666667, 1.38888889,
##        1.71111111, 2.03333333, 2.35555556, 2.67777778, 3.        ]),
##         copy_X=True, cv=5, eps=0.001, fit_intercept=True, max_iter=1000,
##         n_alphas=100, n_jobs=None, normalize=False, positive=False,
##         precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
##         verbose=False)
## SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
##              eta0=0.01, fit_intercept=True, l1_ratio=0.15,
##              learning_rate='invscaling', loss='squared_loss', max_iter=1000,
##              n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
##              shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
##              warm_start=False)
## SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
##              eta0=0.01, fit_intercept=True, l1_ratio=0.15,
##              learning_rate='invscaling', loss='squared_loss', max_iter=1000,
##              n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
##              shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
##              warm_start=False)
## SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
##              eta0=0.01, fit_intercept=True, l1_ratio=0.15,
##              learning_rate='invscaling', loss='squared_loss', max_iter=1000,
##              n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
##              shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
##              warm_start=False)
## SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
##              eta0=0.01, fit_intercept=True, l1_ratio=0.15,
##              learning_rate='invscaling', loss='squared_loss', max_iter=1000,
##              n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
##              shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
##              warm_start=False)