Chapter 4 Multi-Layer NN Notes

Similar to the chapter on single-layer NNs, this chapter outlays notation & methodology for a multiple-layer neural network.

source: https://arxiv.org/abs/1801.05894

“Deep Learning: An Introduction for Applied Mathematicians” by Catherine F. Higham and Desmond J. Higham, published in 2018

4.1 Notation Setup

4.1.1 Scalars

Layers: 1- $L$ , indexed by $l$

Number of Neurons in layer $l$ : $n_l$

Neuron Activations: $a^{(\text{layer num})}_{\text{neuron num}} = a^{(l)}_j$ . Vector of activations for a layer is $a^{(l)}$

Activation Function: $g(\cdot)$ is our generic activation function

4.1.2 X

We have our input matrix $X \in \mathbb{R}^{\text{vars} \times \text{obs}} = \mathbb{R}^{n_0 \times m}$ :

$X = \ ^{n_0 \text{ inputs}} \overbrace{ \begin{cases} \begin{bmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, m} \\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n_0, 1} & x_{n_0, 2} & \cdots & x_{n_0, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}}$

The $i$ th observation of $X$ is the $i$ th column of $X$ , referenced as $x_i$ .

4.1.3 W

our Weight matrices $W^{(l)} \in \mathbb{R}^{\text{out} \times \text{in}} = \mathbb{R}^{n_l \times n_{l - 1}}$ :

$W^{(l)} = \ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}}$

$W^{(l)}$ is the weight matrix for the $l$ th layer

4.1.4 b

our Bias matrices $b^{(l)} \in \mathbb{R}^{\text{out} \times 1} = \mathbb{R}^{n_l \times 1}$ :

$b^{(l)} = \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} b^{(l)}_{1} \\ b^{(l)}_{2} \\ \vdots \\ b^{(l)}_{n_l} \end{bmatrix} \end{cases}$

$b^{(l)}$ is the bias matrix for the $l$ th layer

4.1.5 Y

our target layer matrix $Y \in \mathbb{R}^{\text{cats} \times \text{obs}} = \mathbb{R}^{n_L \times m}$ :

$Y = \ ^{n_L \text{ categories}} \overbrace{ \begin{cases} \begin{bmatrix} y_{1, 1} & y_{1, 2} & \cdots & y_{1, m} \\ y_{2, 1} & y_{2, 2} & \cdots & y_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ y_{n_L, 1} & y_{n_L, 2} & \cdots & y_{n_L, m} \end{bmatrix} \end{cases} }^{m \text{ obs}}$

Similar to $X$ , the $i$ th observation of $Y$ is the $i$ th column of $Y$ , referenced as $y_i$ .

4.1.6 z

our neuron layer’s activation function input $z^{(l)} \in \mathbb{R}^{\text{out} \times 1} = \mathbb{R}^{n_l \times 1}$ :

$z^{(l)} = \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} z^{(l)}_{1} \\ z^{(l)}_{2} \\ \vdots \\ z^{(l)}_{n_l} \end{bmatrix} \end{cases}$

$z^{(l)}$ is the neuron ‘weighted input’ matrix for the $l$ th layer

We have that:

$\begin{aligned} z^{(l)} &= W^{(l)} * a^{(l - 1)} + b^{(l)} \\ \\ &= \ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}} * \ ^{n_{l - 1} \text{ inputs}} \begin{cases} \begin{bmatrix} a^{(l-1)}_{1} \\ a^{(l-1)}_{2} \\ \vdots \\ a^{(l-1)}_{n_l} \end{bmatrix} \end{cases} + \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} b^{(l)}_{1} \\ b^{(l)}_{2} \\ \vdots \\ b^{(l)}_{n_l} \end{bmatrix} \end{cases} \\ \\ &= \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} z^{(l)}_{1} \\ z^{(l)}_{2} \\ \vdots \\ z^{(l)}_{n_l} \end{bmatrix} \end{cases} \end{aligned}$

4.1.7 a

our Neuron Activation $a^{(l)} \in \mathbb{R}^{\text{out} \times 1} = \mathbb{R}^{n_l \times 1}$ :

$a^{(l)} = \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} a^{(l)}_{1} \\ a^{(l)}_{2} \\ \vdots \\ a^{(l)}_{n_l} \end{bmatrix} \end{cases}$

$a^{(l)}$ is the activation matrix for the $l$ th layer

We have that:

$\begin{aligned} a^{(l)} &= g\left(z^{(l)}\right) \\ \\ &= g\left(W^{(l)} * a^{(l - 1)} + b^{(l)}\right) \\ \\ &= g\left(\ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}} * \ ^{n_{l - 1} \text{ inputs}} \begin{cases} \begin{bmatrix} a^{(l-1)}_{1} \\ a^{(l-1)}_{2} \\ \vdots \\ a^{(l-1)}_{n_l} \end{bmatrix} \end{cases} + \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} b^{(l)}_{1} \\ b^{(l)}_{2} \\ \vdots \\ b^{(l)}_{n_l} \end{bmatrix} \end{cases}\right) \\ \\ &= g\left(\ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} z^{(l)}_{1} \\ z^{(l)}_{2} \\ \vdots \\ z^{(l)}_{n_l} \end{bmatrix} \end{cases}\right) \\ \\ &= \ ^{n_l\text{ outputs}} \begin{cases} \begin{bmatrix} a^{(l)}_{1} \\ a^{(l)}_{2} \\ \vdots \\ a^{(l)}_{n_l} \end{bmatrix} \end{cases} \end{aligned}$

4.2 Forward Propagation

4.2.1 Setup

For a single neuron, it’s activation is going to be a weighted sum of all the activations of the previous layer, plus a constant, all fed into the activation function. Formally, this is:

$a^{(l)}_j = g\left(\sum_{i = 1}^{n_{l - 1}} w^{(l)}_{j, i} * a^{(l - 1)}_{i} + b^{(l)}_{j}\right)$

We can put this in matrix form. An entire layer of neurons can be represented by:

$a^{(l)} = g\left(z^{(l)}\right) = g\left(W^{(l)} * a^{(l - 1)} + b^{(l)}\right)$

as was shown above. We can repeatedly apply this formula to get from $X$ to out predicted $\hat Y = a^{(L)}$ . We start with the initial layer (layer 0) being set equal to $x_i$ .

Note that we will be forward (& backward) propagating one observation of $X$ at a time by operating on each column separately. However, if desired forward (& backward) propagation can be done on all observations simultaneously. The notation change would involve stretching out $a^{(l)}$ , $z^{(l)}$ , and $b^{(l)}$ so that they are each $m$ wide:

$\begin{aligned} a^{(l)} &= g\left(z^{(l)}\right) \\ \\ &= g\left(W^{(l)} * a^{(l - 1)} + b^{(l)}\right) \\ \\ &= g(\ ^{n_l\text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} w^{(l)}_{1, 1} & w^{(l)}_{1, 2} & \cdots & w^{(l)}_{1, n_{l-1}} \\ w^{(l)}_{2, 1} & w^{(l)}_{2, 2} & \cdots & w^{(l)}_{2, n_{l-1}} \\ \vdots & \vdots & \ddots & \vdots \\ w^{(l)}_{n_l, 1} & w^{(l)}_{n_l, 2} & \cdots & w^{(l)}_{n_l, n_{l-1}} \end{bmatrix} \end{cases} }^{n_{l - 1} \text{ inputs}} * \ ^{n_{l - 1} \text{ inputs}} \overbrace{ \begin{cases} \begin{bmatrix} a^{(l - 1)}_{1, 1} & a^{(l - 1)}_{1, 2} & \cdots & a^{(l - 1)}_{1, m} \\ a^{(l - 1)}_{2, 1} & a^{(l - 1)}_{2, 2} & \cdots & a^{(l - 1)}_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ a^{(l - 1)}_{n_{l - 1}, 1} & a^{(l - 1)}_{n_{l - 1}, 2} & \cdots & a^{(l - 1)}_{n_{l - 1}, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}} \\ &+ \ ^{n_l \text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} - & b^{(l)}_{1} & - \\ - & b^{(l)}_{2} & - \\ \vdots & \vdots & \vdots \\ - & b^{(l)}_{n_l} & - \end{bmatrix} \end{cases} }^{m \text{ obs}}) \\ \\ &= g\left(\ ^{n_l \text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} z^{(l)}_{1, 1} & z^{(l)}_{1, 2} & \cdots & z^{(l)}_{1, m} \\ z^{(l)}_{2, 1} & z^{(l)}_{2, 2} & \cdots & z^{(l)}_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ z^{(l)}_{n_l, 1} & z^{(l)}_{n_l, 2} & \cdots & z^{(l)}_{n_l, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}}\right) \\ \\ &= \ ^{n_l \text{ outputs}} \overbrace{ \begin{cases} \begin{bmatrix} a^{(l)}_{1, 1} & a^{(l)}_{1, 2} & \cdots & a^{(l)}_{1, m} \\ a^{(l)}_{2, 1} & a^{(l)}_{2, 2} & \cdots & a^{(l)}_{2, m} \\ \vdots & \vdots & \ddots & \vdots \\ a^{(l)}_{n_l, 1} & a^{(l)}_{n_l, 2} & \cdots & a^{(l)}_{n_l, m} \\ \end{bmatrix} \end{cases} }^{m \text{ obs}} \end{aligned}$

Each column of $a^{(l)}$ and $z^{(l)}$ represent an observation and can hold unique values, while $b^{(l)}$ is merely repeated to be $m$ wide; each row is the same bias value for each neuron.

We are sticking with one observation at a time for it’s simplicity, and it makes the back-propagation linear algebra easier/cleaner.

4.2.2 Algorithm

For a given observation $x_i$ :

set $a^{(0)} = x_i$
For each $l$ from 1 up to $L$ :
- $z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)}$
- $a^{(l)} = g\left(z^{(l)}\right)$
- $D^{(l)} = \text{diag} \left[g'\left(z^{(l)}\right)\right]$
  - this term will be needed later

if $Y$ happens to be categorical, we may choose to apply the softmax function ( $\frac{e^{z_i}}{\sum e^{z_j}}$ ) to $a^{(L)}$ . Otherwise, we are done! We have our estimated result $a^{(L)}$ .

4.3 Backward Propagation

Recall that we are trying to minimize a cost function via gradient descent by iterating over our parameter vector $\theta: \theta^{t + 1} \leftarrow \theta^t - \rho * \nabla\mathcal{C}(\theta)$ . We will now implement this.

To do so, there is one more useful variable we need to define: $\delta^{(l)}$

4.3.1 Delta

We define $\delta^{(l)}_j := \frac{\partial \mathcal{C}}{\partial z^{(l)}_j}$ for a particular neuron, and its vector form $\delta^{(l)}$ represents the whole layer.

$\delta^{(l)}$ allows us to back-propagate one layer at a time by defining the gradients of the earlier layers from those of the later layers. In particular:

$\delta^{(l)} = \text{diag} \left[g'\left(z^{(l)}\right)\right] * \left(W^{(l + 1)}\right)^T * \delta^{(l + 1)}$

The derivation is in the linked paper, so I won’t go over it in full here

In short, $z^{(l + 1)} = W^{(l + 1)} * g\left(z^{(l)}\right) + b^{(l + 1)}$ ; so, $\delta^{(l)}$ is related to $\delta^{(l + 1)}$ via the chain rule:

$\delta^{(l)} = \frac{\partial \mathcal{C}}{\partial z^{(l)}} = \underbrace{\frac{\partial \mathcal{C}}{\partial z^{(l + 1)}}}_{\delta^{(l + 1)}} * \underbrace{\frac{\partial z^{(l + 1)}}{\partial g}}_{\left(W^{(l + 1)}\right)^T} * \underbrace{\frac{\partial g}{\partial z^{(l)}}}_{g'\left(z^{(l)}\right)}$

[eventually, add in a write-up on why the transpose of $W$ is taken. In short, it takes the dot product each neuron’s output across the next layer’s neurons ( $\left(W^{(l + 1)}\right)^T$ , each row is the input neuron being distributed across the next layer) with the next layer’s $\delta^{(l + 1)}$ ]

Note that we scale $\delta^{(l)}$ by $g'\left(z^{(l)}\right)$ , which we do by multiplying on the left by:

$\text{diag} \left[g'\left(z^{(l)}\right)\right] = \begin{bmatrix} g'\left(z^{(l)}_1\right) & & & \\ & g'\left(z^{(l)}_2\right) & & \\ & & \ddots & \\ & & & g'\left(z^{(l)}_{n_l}\right) \end{bmatrix}$

This has the same effect as element-wise multiplication.

For shorthand, we define $D^{(l)} = \text{diag} \left[g'\left(z^{(l)}\right)\right]$

4.3.2 Gradients

Given $\delta^{(l)}$ , it becomes simple to write down our gradients:

$\begin{aligned} \delta^{(L)} &= D^{(L)} * \frac{\partial \mathcal{C}}{\partial a^{(L)}} & \text{(a)} \\ \\ \delta^{(l)} &= D^{(l)} * \left(W^{(l + 1)}\right)^T * \delta^{(l + 1)} & \text{(b)} \\ \\ \frac{\partial \mathcal{C}}{\partial b^{(b)}} &= \delta^{(l)} & \text{(c)} \\ \\ \frac{\partial \mathcal{C}}{\partial W^{(l)}} &= \delta^{(l)} * \left(a^{(l - 1)}\right)^T & \text{(d)} \end{aligned}$

The proofs of these are in the linked paper. (could add in a bit with an intuitive explanation. eventually I want to get better vis of the chain rule tho beforehand, because I bet we could get something neat with neuron & derivative visualizations)

(we can also do this with expanded matrix view as above)

For the squared-error loss function $\mathcal{C}(\theta) = \frac{1}{2} (y - a^{(L)})^2$ , we would have $\frac{\partial \mathcal{C}}{\partial a^{(L)}} = (a^{(L)} - y)$ [find out what this is for log-loss :) softmax too?]

4.3.3 Algorithm

For a given observation $x_i$ :

set $\delta^{(L)} = D^{(l)} * \frac{\partial \mathcal{C}}{\partial a^{(L)}}$
For each $l$ from $(L - 1)$ down to 1:
- $\delta^{(l)} = D^{(l)} * \left(W^{(l + 1)}\right)^T * \delta^{(l + 1)}$
For each $l$ from $L$ down to 1:
- $W^{(l)} \leftarrow W^{(l)} - \rho * \delta^{(l)} * \left(a^{(l - 1)}\right)^T$
- $b^{(l)} \leftarrow W^{(l)} - \rho * \delta^{(l)}$