Chapter 4 Multi-Layer NN Notes

Similar to the chapter on single-layer NNs, this chapter outlays notation & methodology for a multiple-layer neural network.


source: https://arxiv.org/abs/1801.05894

“Deep Learning: An Introduction for Applied Mathematicians” by Catherine F. Higham and Desmond J. Higham, published in 2018


4.1 Notation Setup

4.1.1 Scalars

Layers: 1-\(L\), indexed by \(l\)

Number of Neurons in layer \(l\): \(n_l\)

Neuron Activations: \(a^{(\text{layer num})}_{\text{neuron num}} = a^{(l)}_j\). Vector of activations for a layer is \(a^{(l)}\)

Activation Function: \(g(\cdot)\) is our generic activation function


4.1.2 X

We have our input matrix \(X \in \mathbb{R}^{\text{vars} \times \text{obs}} = \mathbb{R}^{n_0 \times m}\):

\[ X = \ ^{n_0 \text{ inputs}} \overbrace{ \begin{cases} \begin{bmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, m} \\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n_0, 1} & x_{n_0, 2} & \cdots & x_{n_0, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}} \]

The \(i\)th observation of \(X\) is the \(i\)th column of \(X\), referenced as \(x_i\).


4.1.3 W

our Weight matrices \(W^{(l)} \in \mathbb{R}^{\text{out} \times \text{in}} = \mathbb{R}^{n_l \times n_{l - 1}}\):

\[ W^{(l)} = \ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}} \]

\(W^{(l)}\) is the weight matrix for the \(l\)th layer


4.1.4 b

our Bias matrices \(b^{(l)} \in \mathbb{R}^{\text{out} \times 1} = \mathbb{R}^{n_l \times 1}\):

\[ b^{(l)} = \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} b^{(l)}_{1} \\ b^{(l)}_{2} \\ \vdots \\ b^{(l)}_{n_l} \end{bmatrix} \end{cases} \]

\(b^{(l)}\) is the bias matrix for the \(l\)th layer


4.1.5 Y

our target layer matrix \(Y \in \mathbb{R}^{\text{cats} \times \text{obs}} = \mathbb{R}^{n_L \times m}\):

\[ Y = \ ^{n_L \text{ categories}} \overbrace{ \begin{cases} \begin{bmatrix} y_{1, 1} & y_{1, 2} & \cdots & y_{1, m} \\ y_{2, 1} & y_{2, 2} & \cdots & y_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n_L, 1} & y_{n_L, 2} & \cdots & y_{n_L, m} \end{bmatrix} \end{cases} }^{m \text{ obs}} \]

Similar to \(X\), the \(i\)th observation of \(Y\) is the \(i\)th column of \(Y\), referenced as \(y_i\).


4.1.6 z

our neuron layer’s activation function input \(z^{(l)} \in \mathbb{R}^{\text{out} \times 1} = \mathbb{R}^{n_l \times 1}\):

\[ z^{(l)} = \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} z^{(l)}_{1} \\ z^{(l)}_{2} \\ \vdots \\ z^{(l)}_{n_l} \end{bmatrix} \end{cases} \]

\(z^{(l)}\) is the neuron ‘weighted input’ matrix for the \(l\)th layer

We have that:

\[ \begin{aligned} z^{(l)} &= W^{(l)} * a^{(l - 1)} + b^{(l)} \\ \\ &= \ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}} * \ ^{n_{l - 1} \text{ inputs}} \begin{cases} \begin{bmatrix} a^{(l-1)}_{1} \\ a^{(l-1)}_{2} \\ \vdots \\ a^{(l-1)}_{n_l} \end{bmatrix} \end{cases} + \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} b^{(l)}_{1} \\ b^{(l)}_{2} \\ \vdots \\ b^{(l)}_{n_l} \end{bmatrix} \end{cases} \\ \\ &= \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} z^{(l)}_{1} \\ z^{(l)}_{2} \\ \vdots \\ z^{(l)}_{n_l} \end{bmatrix} \end{cases} \end{aligned} \]


4.1.7 a

our Neuron Activation \(a^{(l)} \in \mathbb{R}^{\text{out} \times 1} = \mathbb{R}^{n_l \times 1}\):

\[ a^{(l)} = \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} a^{(l)}_{1} \\ a^{(l)}_{2} \\ \vdots \\ a^{(l)}_{n_l} \end{bmatrix} \end{cases} \]

\(a^{(l)}\) is the activation matrix for the \(l\)th layer

We have that:

\[ \begin{aligned} a^{(l)} &= g\left(z^{(l)}\right) \\ \\ &= g\left(W^{(l)} * a^{(l - 1)} + b^{(l)}\right) \\ \\ &= g\left(\ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}} * \ ^{n_{l - 1} \text{ inputs}} \begin{cases} \begin{bmatrix} a^{(l-1)}_{1} \\ a^{(l-1)}_{2} \\ \vdots \\ a^{(l-1)}_{n_l} \end{bmatrix} \end{cases} + \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} b^{(l)}_{1} \\ b^{(l)}_{2} \\ \vdots \\ b^{(l)}_{n_l} \end{bmatrix} \end{cases}\right) \\ \\ &= g\left(\ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} z^{(l)}_{1} \\ z^{(l)}_{2} \\ \vdots \\ z^{(l)}_{n_l} \end{bmatrix} \end{cases}\right) \\ \\ &= \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} a^{(l)}_{1} \\ a^{(l)}_{2} \\ \vdots \\ a^{(l)}_{n_l} \end{bmatrix} \end{cases} \end{aligned} \]

4.2 Forward Propagation

4.2.1 Setup

For a single neuron, it’s activation is going to be a weighted sum of all the activations of the previous layer, plus a constant, all fed into the activation function. Formally, this is:

\[a^{(l)}_j = g\left(\sum_{i = 1}^{n_{l - 1}} w^{(l)}_{j, i} * a^{(l - 1)}_{i} + b^{(l)}_{j}\right)\]

We can put this in matrix form. An entire layer of neurons can be represented by:

\[a^{(l)} = g\left(z^{(l)}\right) = g\left(W^{(l)} * a^{(l - 1)} + b^{(l)}\right)\]

as was shown above. We can repeatedly apply this formula to get from \(X\) to out predicted \(\hat Y = a^{(L)}\). We start with the initial layer (layer 0) being set equal to \(x_i\).

Note that we will be forward (& backward) propagating one observation of \(X\) at a time by operating on each column separately. However, if desired forward (& backward) propagation can be done on all observations simultaneously. The notation change would involve stretching out \(a^{(l)}\), \(z^{(l)}\), and \(b^{(l)}\) so that they are each \(m\) wide:

\[ \begin{aligned} a^{(l)} &= g\left(z^{(l)}\right) \\ \\ &= g\left(W^{(l)} * a^{(l - 1)} + b^{(l)}\right) \\ \\ &= g(\ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}} * \ ^{n_{l - 1} \text{ inputs}} \overbrace{ \begin{cases} \begin{bmatrix} a^{(l - 1)}_{1, 1} & a^{(l - 1)}_{1, 2} & \cdots & a^{(l - 1)}_{1, m} \\ a^{(l - 1)}_{2, 1} & a^{(l - 1)}_{2, 2} & \cdots & a^{(l - 1)}_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ a^{(l - 1)}_{n_{l - 1}, 1} & a^{(l - 1)}_{n_{l - 1}, 2} & \cdots & a^{(l - 1)}_{n_{l - 1}, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}} \\ &+ \ ^{n_l \text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} - & b^{(l)}_{1} & - \\ - & b^{(l)}_{2} & - \\ \vdots & \vdots & \vdots \\ - & b^{(l)}_{n_l} & - \end{bmatrix} \end{cases} }^{m \text{ obs}}) \\ \\ &= g\left(\ ^{n_l \text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} z^{(l)}_{1, 1} & z^{(l)}_{1, 2} & \cdots & z^{(l)}_{1, m} \\ z^{(l)}_{2, 1} & z^{(l)}_{2, 2} & \cdots & z^{(l)}_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ z^{(l)}_{n_l, 1} & z^{(l)}_{n_l, 2} & \cdots & z^{(l)}_{n_l, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}}\right) \\ \\ &= \ ^{n_l \text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} a^{(l)}_{1, 1} & a^{(l)}_{1, 2} & \cdots & a^{(l)}_{1, m} \\ a^{(l)}_{2, 1} & a^{(l)}_{2, 2} & \cdots & a^{(l)}_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ a^{(l)}_{n_l, 1} & a^{(l)}_{n_l, 2} & \cdots & a^{(l)}_{n_l, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}} \end{aligned} \]

Each column of \(a^{(l)}\) and \(z^{(l)}\) represent an observation and can hold unique values, while \(b^{(l)}\) is merely repeated to be \(m\) wide; each row is the same bias value for each neuron.

We are sticking with one observation at a time for it’s simplicity, and it makes the back-propagation linear algebra easier/cleaner.

4.2.2 Algorithm

For a given observation \(x_i\):

  1. set \(a^{(0)} = x_i\)
  2. For each \(l\) from 1 up to \(L\):
    • \(z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)}\)
    • \(a^{(l)} = g\left(z^{(l)}\right)\)
    • \(D^{(l)} = \text{diag} \left[g'\left(z^{(l)}\right)\right]\)
      • this term will be needed later

if \(Y\) happens to be categorical, we may choose to apply the softmax function (\(\frac{e^{z_i}}{\sum e^{z_j}}\)) to \(a^{(L)}\). Otherwise, we are done! We have our estimated result \(a^{(L)}\).

4.3 Backward Propagation

Recall that we are trying to minimize a cost function via gradient descent by iterating over our parameter vector \(\theta: \theta^{t + 1} \leftarrow \theta^t - \rho * \nabla\mathcal{C}(\theta)\). We will now implement this.

To do so, there is one more useful variable we need to define: \(\delta^{(l)}\)

4.3.1 Delta

We define \(\delta^{(l)}_j := \frac{\partial \mathcal{C}}{\partial z^{(l)}_j}\) for a particular neuron, and its vector form \(\delta^{(l)}\) represents the whole layer.

\(\delta^{(l)}\) allows us to back-propagate one layer at a time by defining the gradients of the earlier layers from those of the later layers. In particular:

\[\delta^{(l)} = \text{diag} \left[g'\left(z^{(l)}\right)\right] * \left(W^{(l + 1)}\right)^T * \delta^{(l + 1)}\]

The derivation is in the linked paper, so I won’t go over it in full here


In short, \(z^{(l + 1)} = W^{(l + 1)} * g\left(z^{(l)}\right) + b^{(l + 1)}\); so, \(\delta^{(l)}\) is related to \(\delta^{(l + 1)}\) via the chain rule:

\[\delta^{(l)} = \frac{\partial \mathcal{C}}{\partial z^{(l)}} = \underbrace{\frac{\partial \mathcal{C}}{\partial z^{(l + 1)}}}_{\delta^{(l + 1)}} * \underbrace{\frac{\partial z^{(l + 1)}}{\partial g}}_{\left(W^{(l + 1)}\right)^T} * \underbrace{\frac{\partial g}{\partial z^{(l)}}}_{g'\left(z^{(l)}\right)}\]

[eventually, add in a write-up on why the transpose of \(W\) is taken. In short, it takes the dot product each neuron’s output across the next layer’s neurons (\(\left(W^{(l + 1)}\right)^T\), each row is the input neuron being distributed across the next layer) with the next layer’s \(\delta^{(l + 1)}\)]


Note that we scale \(\delta^{(l)}\) by \(g'\left(z^{(l)}\right)\), which we do by multiplying on the left by:

\[\text{diag} \left[g'\left(z^{(l)}\right)\right] = \begin{bmatrix} g'\left(z^{(l)}_1\right) & & & \\ & g'\left(z^{(l)}_2\right) & & \\ & & \ddots & \\ & & & g'\left(z^{(l)}_{n_l}\right) \end{bmatrix}\]

This has the same effect as element-wise multiplication.

For shorthand, we define \(D^{(l)} = \text{diag} \left[g'\left(z^{(l)}\right)\right]\)

4.3.2 Gradients

Given \(\delta^{(l)}\), it becomes simple to write down our gradients:

\[ \begin{aligned} \delta^{(L)} &= D^{(L)} * \frac{\partial \mathcal{C}}{\partial a^{(L)}} & \text{(a)} \\ \\ \delta^{(l)} &= D^{(l)} * \left(W^{(l + 1)}\right)^T * \delta^{(l + 1)} & \text{(b)} \\ \\ \frac{\partial \mathcal{C}}{\partial b^{(b)}} &= \delta^{(l)} & \text{(c)} \\ \\ \frac{\partial \mathcal{C}}{\partial W^{(l)}} &= \delta^{(l)} * \left(a^{(l - 1)}\right)^T & \text{(d)} \end{aligned} \]

The proofs of these are in the linked paper. (could add in a bit with an intuitive explanation. eventually I want to get better vis of the chain rule tho beforehand, because I bet we could get something neat with neuron & derivative visualizations)

(we can also do this with expanded matrix view as above)

For the squared-error loss function \(\mathcal{C}(\theta) = \frac{1}{2} (y - a^{(L)})^2\), we would have \(\frac{\partial \mathcal{C}}{\partial a^{(L)}} = (a^{(L)} - y)\) [find out what this is for log-loss :) softmax too?]

4.3.3 Algorithm

For a given observation \(x_i\):

  1. set \(\delta^{(L)} = D^{(l)} * \frac{\partial \mathcal{C}}{\partial a^{(L)}}\)
  2. For each \(l\) from \((L - 1)\) down to 1:
    • \(\delta^{(l)} = D^{(l)} * \left(W^{(l + 1)}\right)^T * \delta^{(l + 1)}\)
  3. For each \(l\) from \(L\) down to 1:
    • \(W^{(l)} \leftarrow W^{(l)} - \rho * \delta^{(l)} * \left(a^{(l - 1)}\right)^T\)
    • \(b^{(l)} \leftarrow W^{(l)} - \rho * \delta^{(l)}\)