Chapter 11 Vector Calculus

  • MML, chapter 5

11.1 理論

\[ {\bf f(\theta)=}\begin{bmatrix}f_{1}({\bf \theta})\\ \vdots\\ f_{m}({\bf \theta}) \end{bmatrix}\in\mathbb{R}^{m\times1} \]

  • \(m\)個函數

\[ {\bf f(\theta)}=\begin{bmatrix} a1*\theta_1+a2*\theta_2+a3*\theta_3\\ b1*\theta_1^2+b2*\theta_2^3 \end{bmatrix} \]

numpy有一系列的polynomial routines

擴充前述的函數,允許input x可以有任意的維度,函數也可以有任意的維度及不同多向式。

\[ \frac{\partial{\bf f(\theta)}}{\partial \theta_{i}}{\bf =}\begin{bmatrix}\frac{\partial f_{1}({\bf \theta})}{\partial \theta_{i}}\\ \vdots\\ \frac{\partial f_{m}({\bf \theta})}{\partial \theta_{i}} \end{bmatrix}\in\mathbb{R}^{m\times1} \]

向量函數維度

Figure 11.1: 向量函數維度

11.1.1 Jacobian

The collection of all first-order partial deriva- tives of a vector-valued function is called the Jacobian.

11.2 Gradient Descent

\[\min_{\theta} f(\theta)\]

  • \(\nabla_{\theta}f(\theta)\) gives that direction that can increases f value. To decrease value, we need to go \(-\nabla_{\theta}f(\theta)\) direction。

  • Only direction, but how long? \(-\gamma \nabla_{\theta}f(\theta)\)

11.2.1 Vanilla gradient descent

\[\theta_{i+1}=\theta_{i}-\gamma \nabla_{\theta}f(\theta_i)\]

11.2.2 Adaptive gradient descent

\[\theta_{i+1}=\theta_{i}-\gamma_i \nabla_{\theta}f(\theta_i)\]

  • \(\gamma_i\) is called the learning rate.
    • When the function value increases after a gradient step, the step-size was too large. Undo the step and decrease the step-size.
    • When the function value decreases the step could have been larger. Try to increase the step-size.

11.2.3 Gradient Descent With Momentum

As illustrated in Figure 7.3, the convergence of gradient descent may be very slow if the curvature of the optimization surface is such that there are regions that are poorly scaled.

\[\theta_{i+1}=\theta_{i}-\gamma_i \nabla_{\theta}f(\theta_i)+ \alpha \Delta \theta_i\]

11.2.4 Stochastic Gradient Descent

\[ f(\theta)=\sum_{n=1}^N f(\theta | \bf x_n)\]

Vanilla gradient descent:

\[ \begin{eqnarray} \theta_{i+1} &=& \theta_{i}-\gamma \nabla_{\theta}f(\theta_i)\\ &=& \theta_{i}-\gamma \sum_{n=1}^N \nabla_{\theta}f(\theta_i|\bf x_n) \end{eqnarray}\]

  • N can be huge.

  • Randomly draw a mini-batch \(B_i\) from training data. And consider the following loss function instead:

\[f_{B_i}(\theta) =\sum_{\bf x_n\in B_{i}}f(\theta|\bf x_n)\]

Mini-batch Stochastic Gradient Descet:

\[\theta_{i+1}=\theta_{i}-\gamma \sum_{\bf x_n\in B_{i}} \nabla_{\theta}f(\theta_i|\bf x_n)\]

  • Batch Gradient Descent. Batch Size = Size of Training Set

  • Stochastic Gradient Descent. Batch Size = 1

  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

We will divided training set into several mini batches. And use each through several iterations.

  • Epochs

11.2.5 Chain Rule

11.2.6 Flow of Data in Graphs

11.2.6.1 Forward propagation

11.2.6.2 Backward propagation

11.3 應用