Chapter 4 Multi-Layer NN Notes
Similar to the chapter on single-layer NNs, this chapter outlays notation & methodology for a multiple-layer neural network.
source: https://arxiv.org/abs/1801.05894
“Deep Learning: An Introduction for Applied Mathematicians” by Catherine F. Higham and Desmond J. Higham, published in 2018
4.1 Notation Setup
4.1.1 Scalars
Layers: 1-L, indexed by l
Number of Neurons in layer l: nl
Neuron Activations: a(layer num)neuron num=a(l)j. Vector of activations for a layer is a(l)
Activation Function: g(⋅) is our generic activation function
4.1.2 X
We have our input matrix X∈Rvars×obs=Rn0×m:
X= n0 inputsm obs⏞{[x1,1x1,2⋯x1,mx2,1x2,2⋯x2,m⋮⋮⋱⋮xn0,1xn0,2⋯xn0,m]
The ith observation of X is the ith column of X, referenced as xi.
4.1.3 W
our Weight matrices W(l)∈Rout×in=Rnl×nl−1:
W(l)= nl outputsnl−1 inputs⏞{[w(l)1,1w(l)1,2⋯w(l)1,nl−1w(l)2,1w(l)2,2⋯w(l)2,nl−1⋮⋮⋱⋮w(l)nl,1w(l)nl,2⋯w(l)nl,nl−1]
W(l) is the weight matrix for the lth layer
4.1.4 b
our Bias matrices b(l)∈Rout×1=Rnl×1:
b(l)= nl outputs{[b(l)1b(l)2⋮b(l)nl]
b(l) is the bias matrix for the lth layer
4.1.5 Y
our target layer matrix Y∈Rcats×obs=RnL×m:
Y= nL categoriesm obs⏞{[y1,1y1,2⋯y1,my2,1y2,2⋯y2,m⋮⋮⋱⋮ynL,1ynL,2⋯ynL,m]
Similar to X, the ith observation of Y is the ith column of Y, referenced as yi.
4.1.6 z
our neuron layer’s activation function input z(l)∈Rout×1=Rnl×1:
z(l)= nl outputs{[z(l)1z(l)2⋮z(l)nl]
z(l) is the neuron ‘weighted input’ matrix for the lth layer
We have that:
z(l)=W(l)∗a(l−1)+b(l)= nl outputsnl−1 inputs⏞{[w(l)1,1w(l)1,2⋯w(l)1,nl−1w(l)2,1w(l)2,2⋯w(l)2,nl−1⋮⋮⋱⋮w(l)nl,1w(l)nl,2⋯w(l)nl,nl−1]∗ nl−1 inputs{[a(l−1)1a(l−1)2⋮a(l−1)nl]+ nl outputs{[b(l)1b(l)2⋮b(l)nl]= nl outputs{[z(l)1z(l)2⋮z(l)nl]
4.1.7 a
our Neuron Activation a(l)∈Rout×1=Rnl×1:
a(l)= nl outputs{[a(l)1a(l)2⋮a(l)nl]
a(l) is the activation matrix for the lth layer
We have that:
a(l)=g(z(l))=g(W(l)∗a(l−1)+b(l))=g( nl outputsnl−1 inputs⏞{[w(l)1,1w(l)1,2⋯w(l)1,nl−1w(l)2,1w(l)2,2⋯w(l)2,nl−1⋮⋮⋱⋮w(l)nl,1w(l)nl,2⋯w(l)nl,nl−1]∗ nl−1 inputs{[a(l−1)1a(l−1)2⋮a(l−1)nl]+ nl outputs{[b(l)1b(l)2⋮b(l)nl])=g( nl outputs{[z(l)1z(l)2⋮z(l)nl])= nl outputs{[a(l)1a(l)2⋮a(l)nl]
4.2 Forward Propagation
4.2.1 Setup
For a single neuron, it’s activation is going to be a weighted sum of all the activations of the previous layer, plus a constant, all fed into the activation function. Formally, this is:
a(l)j=g(nl−1∑i=1w(l)j,i∗a(l−1)i+b(l)j)
We can put this in matrix form. An entire layer of neurons can be represented by:
a(l)=g(z(l))=g(W(l)∗a(l−1)+b(l))
as was shown above. We can repeatedly apply this formula to get from X to out predicted ˆY=a(L). We start with the initial layer (layer 0) being set equal to xi.
Note that we will be forward (& backward) propagating one observation of X at a time by operating on each column separately. However, if desired forward (& backward) propagation can be done on all observations simultaneously. The notation change would involve stretching out a(l), z(l), and b(l) so that they are each m wide:
a(l)=g(z(l))=g(W(l)∗a(l−1)+b(l))=g( nl outputsnl−1 inputs⏞{[w(l)1,1w(l)1,2⋯w(l)1,nl−1w(l)2,1w(l)2,2⋯w(l)2,nl−1⋮⋮⋱⋮w(l)nl,1w(l)nl,2⋯w(l)nl,nl−1]∗ nl−1 inputsm obs⏞{[a(l−1)1,1a(l−1)1,2⋯a(l−1)1,ma(l−1)2,1a(l−1)2,2⋯a(l−1)2,m⋮⋮⋱⋮a(l−1)nl−1,1a(l−1)nl−1,2⋯a(l−1)nl−1,m]+ nl outputsm obs⏞{[−b(l)1−−b(l)2−⋮⋮⋮−b(l)nl−])=g( nl outputsm obs⏞{[z(l)1,1z(l)1,2⋯z(l)1,mz(l)2,1z(l)2,2⋯z(l)2,m⋮⋮⋱⋮z(l)nl,1z(l)nl,2⋯z(l)nl,m])= nl outputsm obs⏞{[a(l)1,1a(l)1,2⋯a(l)1,ma(l)2,1a(l)2,2⋯a(l)2,m⋮⋮⋱⋮a(l)nl,1a(l)nl,2⋯a(l)nl,m]
Each column of a(l) and z(l) represent an observation and can hold unique values, while b(l) is merely repeated to be m wide; each row is the same bias value for each neuron.
We are sticking with one observation at a time for it’s simplicity, and it makes the back-propagation linear algebra easier/cleaner.
4.2.2 Algorithm
For a given observation xi:
- set a(0)=xi
- For each l from 1 up to L:
- z(l)=W(l)a(l−1)+b(l)
- a(l)=g(z(l))
- D(l)=diag[g′(z(l))]
- this term will be needed later
if Y happens to be categorical, we may choose to apply the softmax function (ezi∑ezj) to a(L). Otherwise, we are done! We have our estimated result a(L).
4.3 Backward Propagation
Recall that we are trying to minimize a cost function via gradient descent by iterating over our parameter vector θ:θt+1←θt−ρ∗∇C(θ). We will now implement this.
To do so, there is one more useful variable we need to define: δ(l)
4.3.1 Delta
We define δ(l)j:=∂C∂z(l)j for a particular neuron, and its vector form δ(l) represents the whole layer.
δ(l) allows us to back-propagate one layer at a time by defining the gradients of the earlier layers from those of the later layers. In particular:
δ(l)=diag[g′(z(l))]∗(W(l+1))T∗δ(l+1)
The derivation is in the linked paper, so I won’t go over it in full here
In short, z(l+1)=W(l+1)∗g(z(l))+b(l+1); so, δ(l) is related to δ(l+1) via the chain rule:
δ(l)=∂C∂z(l)=∂C∂z(l+1)⏟δ(l+1)∗∂z(l+1)∂g⏟(W(l+1))T∗∂g∂z(l)⏟g′(z(l))
[eventually, add in a write-up on why the transpose of W is taken. In short, it takes the dot product each neuron’s output across the next layer’s neurons ((W(l+1))T, each row is the input neuron being distributed across the next layer) with the next layer’s δ(l+1)]
Note that we scale δ(l) by g′(z(l)), which we do by multiplying on the left by:
diag[g′(z(l))]=[g′(z(l)1)g′(z(l)2)⋱g′(z(l)nl)]
This has the same effect as element-wise multiplication.
For shorthand, we define D(l)=diag[g′(z(l))]
4.3.2 Gradients
Given δ(l), it becomes simple to write down our gradients:
δ(L)=D(L)∗∂C∂a(L)(a)δ(l)=D(l)∗(W(l+1))T∗δ(l+1)(b)∂C∂b(b)=δ(l)(c)∂C∂W(l)=δ(l)∗(a(l−1))T(d)
The proofs of these are in the linked paper. (could add in a bit with an intuitive explanation. eventually I want to get better vis of the chain rule tho beforehand, because I bet we could get something neat with neuron & derivative visualizations)
(we can also do this with expanded matrix view as above)
For the squared-error loss function C(θ)=12(y−a(L))2, we would have ∂C∂a(L)=(a(L)−y) [find out what this is for log-loss :) softmax too?]