Chapter 4 Multi-Layer NN Notes

Similar to the chapter on single-layer NNs, this chapter outlays notation & methodology for a multiple-layer neural network.


source: https://arxiv.org/abs/1801.05894

“Deep Learning: An Introduction for Applied Mathematicians” by Catherine F. Higham and Desmond J. Higham, published in 2018


4.1 Notation Setup

4.1.1 Scalars

Layers: 1-L, indexed by l

Number of Neurons in layer l: nl

Neuron Activations: a(layer num)neuron num=a(l)j. Vector of activations for a layer is a(l)

Activation Function: g() is our generic activation function


4.1.2 X

We have our input matrix XRvars×obs=Rn0×m:

X= n0 inputsm obs{[x1,1x1,2x1,mx2,1x2,2x2,mxn0,1xn0,2xn0,m]

The ith observation of X is the ith column of X, referenced as xi.


4.1.3 W

our Weight matrices W(l)Rout×in=Rnl×nl1:

W(l)= nl outputsnl1 inputs{[w(l)1,1w(l)1,2w(l)1,nl1w(l)2,1w(l)2,2w(l)2,nl1w(l)nl,1w(l)nl,2w(l)nl,nl1]

W(l) is the weight matrix for the lth layer


4.1.4 b

our Bias matrices b(l)Rout×1=Rnl×1:

b(l)= nl outputs{[b(l)1b(l)2b(l)nl]

b(l) is the bias matrix for the lth layer


4.1.5 Y

our target layer matrix YRcats×obs=RnL×m:

Y= nL categoriesm obs{[y1,1y1,2y1,my2,1y2,2y2,mynL,1ynL,2ynL,m]

Similar to X, the ith observation of Y is the ith column of Y, referenced as yi.


4.1.6 z

our neuron layer’s activation function input z(l)Rout×1=Rnl×1:

z(l)= nl outputs{[z(l)1z(l)2z(l)nl]

z(l) is the neuron ‘weighted input’ matrix for the lth layer

We have that:

z(l)=W(l)a(l1)+b(l)= nl outputsnl1 inputs{[w(l)1,1w(l)1,2w(l)1,nl1w(l)2,1w(l)2,2w(l)2,nl1w(l)nl,1w(l)nl,2w(l)nl,nl1] nl1 inputs{[a(l1)1a(l1)2a(l1)nl]+ nl outputs{[b(l)1b(l)2b(l)nl]= nl outputs{[z(l)1z(l)2z(l)nl]


4.1.7 a

our Neuron Activation a(l)Rout×1=Rnl×1:

a(l)= nl outputs{[a(l)1a(l)2a(l)nl]

a(l) is the activation matrix for the lth layer

We have that:

a(l)=g(z(l))=g(W(l)a(l1)+b(l))=g( nl outputsnl1 inputs{[w(l)1,1w(l)1,2w(l)1,nl1w(l)2,1w(l)2,2w(l)2,nl1w(l)nl,1w(l)nl,2w(l)nl,nl1] nl1 inputs{[a(l1)1a(l1)2a(l1)nl]+ nl outputs{[b(l)1b(l)2b(l)nl])=g( nl outputs{[z(l)1z(l)2z(l)nl])= nl outputs{[a(l)1a(l)2a(l)nl]

4.2 Forward Propagation

4.2.1 Setup

For a single neuron, it’s activation is going to be a weighted sum of all the activations of the previous layer, plus a constant, all fed into the activation function. Formally, this is:

a(l)j=g(nl1i=1w(l)j,ia(l1)i+b(l)j)

We can put this in matrix form. An entire layer of neurons can be represented by:

a(l)=g(z(l))=g(W(l)a(l1)+b(l))

as was shown above. We can repeatedly apply this formula to get from X to out predicted ˆY=a(L). We start with the initial layer (layer 0) being set equal to xi.

Note that we will be forward (& backward) propagating one observation of X at a time by operating on each column separately. However, if desired forward (& backward) propagation can be done on all observations simultaneously. The notation change would involve stretching out a(l), z(l), and b(l) so that they are each m wide:

a(l)=g(z(l))=g(W(l)a(l1)+b(l))=g( nl outputsnl1 inputs{[w(l)1,1w(l)1,2w(l)1,nl1w(l)2,1w(l)2,2w(l)2,nl1w(l)nl,1w(l)nl,2w(l)nl,nl1] nl1 inputsm obs{[a(l1)1,1a(l1)1,2a(l1)1,ma(l1)2,1a(l1)2,2a(l1)2,ma(l1)nl1,1a(l1)nl1,2a(l1)nl1,m]+ nl outputsm obs{[b(l)1b(l)2b(l)nl])=g( nl outputsm obs{[z(l)1,1z(l)1,2z(l)1,mz(l)2,1z(l)2,2z(l)2,mz(l)nl,1z(l)nl,2z(l)nl,m])= nl outputsm obs{[a(l)1,1a(l)1,2a(l)1,ma(l)2,1a(l)2,2a(l)2,ma(l)nl,1a(l)nl,2a(l)nl,m]

Each column of a(l) and z(l) represent an observation and can hold unique values, while b(l) is merely repeated to be m wide; each row is the same bias value for each neuron.

We are sticking with one observation at a time for it’s simplicity, and it makes the back-propagation linear algebra easier/cleaner.

4.2.2 Algorithm

For a given observation xi:

  1. set a(0)=xi
  2. For each l from 1 up to L:
    • z(l)=W(l)a(l1)+b(l)
    • a(l)=g(z(l))
    • D(l)=diag[g(z(l))]
      • this term will be needed later

if Y happens to be categorical, we may choose to apply the softmax function (eziezj) to a(L). Otherwise, we are done! We have our estimated result a(L).

4.3 Backward Propagation

Recall that we are trying to minimize a cost function via gradient descent by iterating over our parameter vector θ:θt+1θtρC(θ). We will now implement this.

To do so, there is one more useful variable we need to define: δ(l)

4.3.1 Delta

We define δ(l)j:=Cz(l)j for a particular neuron, and its vector form δ(l) represents the whole layer.

δ(l) allows us to back-propagate one layer at a time by defining the gradients of the earlier layers from those of the later layers. In particular:

δ(l)=diag[g(z(l))](W(l+1))Tδ(l+1)

The derivation is in the linked paper, so I won’t go over it in full here


In short, z(l+1)=W(l+1)g(z(l))+b(l+1); so, δ(l) is related to δ(l+1) via the chain rule:

δ(l)=Cz(l)=Cz(l+1)δ(l+1)z(l+1)g(W(l+1))Tgz(l)g(z(l))

[eventually, add in a write-up on why the transpose of W is taken. In short, it takes the dot product each neuron’s output across the next layer’s neurons ((W(l+1))T, each row is the input neuron being distributed across the next layer) with the next layer’s δ(l+1)]


Note that we scale δ(l) by g(z(l)), which we do by multiplying on the left by:

diag[g(z(l))]=[g(z(l)1)g(z(l)2)g(z(l)nl)]

This has the same effect as element-wise multiplication.

For shorthand, we define D(l)=diag[g(z(l))]

4.3.2 Gradients

Given δ(l), it becomes simple to write down our gradients:

δ(L)=D(L)Ca(L)(a)δ(l)=D(l)(W(l+1))Tδ(l+1)(b)Cb(b)=δ(l)(c)CW(l)=δ(l)(a(l1))T(d)

The proofs of these are in the linked paper. (could add in a bit with an intuitive explanation. eventually I want to get better vis of the chain rule tho beforehand, because I bet we could get something neat with neuron & derivative visualizations)

(we can also do this with expanded matrix view as above)

For the squared-error loss function C(θ)=12(ya(L))2, we would have Ca(L)=(a(L)y) [find out what this is for log-loss :) softmax too?]

4.3.3 Algorithm

For a given observation xi:

  1. set δ(L)=D(l)Ca(L)
  2. For each l from (L1) down to 1:
    • δ(l)=D(l)(W(l+1))Tδ(l+1)
  3. For each l from L down to 1:
    • W(l)W(l)ρδ(l)(a(l1))T
    • b(l)W(l)ρδ(l)