# The goal

This post will share some basic knowledge of an artificial neural
network and how to create one from scratch using only `numpy`

. As a
concrete example, we will build a classification network and use it to
classify hand-written digits from the **MNIST** dataset.

We will start by introducing the general idea and structure of the
network, and the mathematical notations. The computations involved
inside the network will be introduced by first showing the math
equations, followed with the `numpy`

implementations in the 2nd part
of the post. If a
particular step can be optimized, for instance, by vectoring the
computations, these will be covered as well. In the end, the network is
trained on the training dataset of MNIST, and its performance
evaluated using MNIST’s test data. The final, complete code is given at
this simple-nn Github repo.

# Computations in the network and notations

## Layers in the network

The network we will be building consists of **5 layers**:

**1. input layer (layer-0)**: this is the layer that feeds the input data into the network.

The number of neurons (or units) in the input layer should be the same as the size of the input data. In this particular case, the input data are images of MNIST hand-written digits. Each image is **28 pixels x 28 pixels** in size, and we **flatten** the image into a single array of length **28 x 28 = 784**. Therefore, the input layer has **784** neurons.

We denote the size of layer \(l\) as \(s_l\). Therefore, \(s_0=784\).

Note that we are indexing the layer number starting from index 0, i.e. the 1st layer has **index 0**, 2nd has index 1, and so on. This is mostly arbitrary. In some places people label the input layer as layer-1. Using 0-based indexing makes it only slightly convenient to work with **Python** whose array indices also start from 0.

**2. 3 hidden layers (layer-1, layer-2 and layer-3)**: these layers have sizes:

- layer-1: 200 neurons, \(s_1=200\).
- layer-2: 100 neurons, \(s_2=100\).
- layer-3: 50 neurons, \(s_3=50\).

NOTE that these are also arbitrary choices. We are not going to compare different design choices in this post. It is totally possible that a simpler or different design would work just as well or even better.

**3. output layer (layer-4)**: this is the layer that outputs the final result of the network. The final result is often termed as the prediction of the model. It is conventional to label the **prediction** as \(\hat{y}\) (or \(\mathbf{\hat{y}}\)).

For the task of recognizing hand-written digits from images, it is a **multi-class classification** problem. There are a total of **10 classes** in the problem, corresponding to digits from **0 to 9**. For such kind of tasks, it is common to formulate the prediction using the so-called **one-hot encoding**. For instance, the label for digit **0** is defined as

\[

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]

\]

that for digit **1** as

\[

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

\]

and similarly, digit **5** is encoded as:

\[

[0, 0, 0, 0, 0, 5, 0, 0, 0, 0]

\]

More generally, the class **i** in a **K-class** classification task is represented by a K-length vector whose **i-th** element is **1**, and all other places are **0**s.

Therefore, the output layer has **10** neurons: \(s_{4}=10\).

**To summarize this section:**

- The network has multiple layers, 1 input layer, 1 output layer, and all other layers in-between are called hidden layers.
- The number of layers in the network is denoted \(L\). The number of neurons/units in layer \(l\) is denoted as \(s_l\). In this case, \(L=5\), \(s_0 = 784\), \(s_1 = 200\), \(s_2 = 100\), \(s_3 = 50\), and \(s_4 = 10\). The input and output layer sizes are determined by the specific problem to be solved, and hidden layer sizes are design choices.
- one-hot encoding is used for the digit classification task.

## Computations happening inside the network – the forward propagation

Information gets passed from the input layer (layer-0) to the 1st hidden layer (layer-1), then to all subsequent layers, one layer at a time.

From layer-0 to layer-1, each neuron \(j\) in layer-0 is connected to each neuron \(i\) in layer-1. Therefore, there are a total of \(s_1 \times s_0 = 200 \times 784 = 157200\) connections. That is quite a number of connections, that is why such models are also called **dense networks**.

Similarly, from layer-1 to layer-2, there are \(s_2 \times s_1 = 100 \times 200 = 20000\) connections.

More generally, from layer \(l\) to layer \(l+1\), the number of connections is \(s_{l+1} \times s_{l}\).

What happens along these connections is a **linear operation**. More specifically, suppose that the neuron \(i\) at layer \(l\) is sending a value to neuron \(j\) in the next layer. The value sent is termed \(a_{i}^{(l)}\), where \(a\) stands for **activation**. The linear operation is

\[

z_{j}^{(l+1)} = a_{i}^{(l)} \theta_{j,i}^{(l+1)} + b_{j}^{(l+1)}

\]

where:

- \(\theta_{j,i}^{(l+1)}\) is a
**weight**that is associated with the connection between these 2 neurons. The weight is multiplied with the input activation value \(a_{i}^{(l)}\). - \(b_{j}^{(l+1)}\) is a
**bias**term added onto the product. - \(z_{j}^{(l+1)}\) is the result of the linear operation. Intuitively, this is what a next layer neuron receives from its upstream input.

After getting \(z_{j}^{(l+1)}\), the neuron gets “activated”, and sends out an “activation” to its next layer. This “activation” is mathematically achieved by an **activation function**:

\[

a_{j}^{(l+1)} = g(z_{j}^{(l+1)})

\]

where \(g()\) is the activation function, which is typically a non-linear function, for instance, a **logistic function** (a.k.a **sigmoid function**):

\[

g(z) = \frac{1}{1 + e^{-z}}

\]

Therefore, the processes happening in from layer \(l\) to \(l+1\) are:

\[

\left\{\begin{matrix}

z_{j}^{(l+1)} =& a_{i}^{(l)} \theta_{j,i}^{(l+1)} + b_{j}^{(l+1)} \\

a_{j}^{(l+1)} =& g(z_{j}^{(l+1)})

\end{matrix}\right.

\]

Then the similar process is repeated from layer \(l+1\) to \(l+2\), and so on, till the very last layer in the network. This process is called **forward propagation**. See figure below for an illustration.

NOTE that for the 1st layer, \(a^{(0)}\) is just the input data in layer-0. For the last layer, \(a^{(L-1)}\) is the model prediction.

Recall that there are \(s_{l+1} \times s_{l}\) number of connections between layer \(l\) and \(l+1\), meaning \(s_{l+1} \times s_{l}\) number of linear operations. All these could be neatly represented using **matrix-vector notations** as below:

Or more compactly:

\[

\mathbf{\Theta}^{(l+1)} \cdot \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} = \mathbf{z}^{(l+1)}

\]

where:

- \(\mathbf{\Theta}^{(l+1)}\) is the
**weight matrix**that maps data from**layer l to layer l+1**. It has a dimension of \((s_{l+1} \times s_l)\). The element, \(\theta_{j,i}^{(l+1)}\) is the weight associated with the link from neuron \(j\) in layer \(l\), to neuron \(i\) in layer \(l+1\). - \(\mathbf{a}^{(l)}\) is the vector of activations from layer \(l\), with dimension \((s_l \times 1)\).
- \(\mathbf{b}^{(l+1)}\) is the vector of biases added to layer \(l+1\), with dimension \((s_{l+1} \times 1)\).
- \(\mathbf{z}^{(l+1)}\) is the
**weighted sum**in layer \(l+1\), with dimension \((s_{l+1} \times 1)\).

The activation process is simply:

\[

\mathbf{a}^{(l+1)} = g(\mathbf{z}^{(l+1)})

\]

where \(\mathbf{a}^{(l+1)}\) is the vector of **activations** in layer \(l+1\), with dimension \((s_{l+1} \times 1)\).

**To summarize this section:**

- Information passes from the input layer through hidden layer(s) to the output layer, a process called
**forward propagation**. - Between 2 consecutive layers \(l\) and \(l+1\), data get passed in a
**linear operation**: \(\mathbf{\Theta}^{(l+1)} \cdot \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} = \mathbf{z}^{(l+1)}\). - A
**non-linear activation function**is used to transform the**weighted sum**of a layer to its output: \(\mathbf{a}^{(l+1)} = g(\mathbf{z}^{(l+1)})\). - \(\mathbf{a}^{(0)}\) is the input data, \(\mathbf{a}^{(L-1)}\) is the model prediction.

NOTE that some materials may use a slightly different formula for the linear operation:

\[

\mathbf{\Theta}^{(l)} \cdot \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} =

\mathbf{z}^{(l+1)}

\]

This is because \(\mathbf{\Theta}^{(l)}\) is used to denote the mapping from layer \(l\) to \(l+1\), while we used \(\mathbf{\Theta}^{(l+1)}\) for this. So it is a purely a notational difference, and the underlying processes are identical. This notation is chosen because I found it slightly more intuitive to associate a layer with the weights that pass data to the layer. This minor advantage may become more obvious when dealing with **convolution neural networks** where different types of layers are concatenated.

ALSO NOTE that for the multi-class classification problem, the output layer needs to use the **softmax** activation function, which is the generalization of **logistic** to multi-dimensional data. The softmax function for a vector \(x = \{x_0, x_1, \cdots, x_{n-1}\}\) is defined as:

\[

\sigma(x)_j = \frac{e^{x_j}}{\sum_i e^{x_i}}

\]

## Computations happening inside the network – the cost function

The **cost function** is a function that computes the cost *J*, which is
a scalar value that measures the goodness-of-fit of the model. A
smaller cost means a better fit of the model prediction. There are
multiple ways to define a cost, for different types of problems. For
the multi-class classification problem, a typical choice is the
**cross-entropy** cost function. The cost contributed by **single record**
is:

\[

J_i = J(\mathbf{\hat{y}}_i, \mathbf{y}_i) = – \mathbf{y}_i \cdot log(\mathbf{\hat{y}}_i)

\]

The cost of all the training samples is just the average over all \(m\) samples:

\[

J = \frac{1}{m} \sum_i^m J_i = – \frac{1}{m} \sum_i^m [\mathbf{y}_i \cdot log(\mathbf{\hat{y}}_i)]

\]

For good measure, let’s also add the **regularization term**:

\[

J = – \frac{1}{m} \sum_i^m [\mathbf{y}_i

\cdot log(\mathbf{\hat{y}}_i)] + \frac{\lambda}{m}

\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} {\theta_{j,i}^{(l)}}^2

\]

where \(\lambda\) is the **regularization parameter**. Note that the nested summations in the above equation are just summing up all (squared) weights in all layers, and biases are not included in the regularization term.

## Computations happening inside the network – the backward propagation

Training of a neural network classifier is a **supervised learning** process, where
the model makes a prediction, which is compared with the true value
(a.k.a **label**), and a correction to the model is made accordingly, if
needed. The prediction is achieved using the **forward propagation**,
and the correction is done in a process called **back-propagation**.
This forward-backward propagation process is repeated multiple times
until a desired skill level of the model is achieved.

Intuitively, the back-propagation finds the error made by the model by comparing the final prediction \(\mathbf{\hat{y}}\) with the label \(\mathbf{y}\): \(\mathbf{\delta} \equiv \mathbf{\hat{y}} – \mathbf{y}\). This error is then propagated backwards from the output layer to hidden layer(s).

This backward error propagation is done because we need to compute the **gradients of the weights and biases**. Recall that training a model is the same process of adjusting its parameters such that the predictions fit the expected values better. For a neural network, its parameters are the weight matrices and bias vectors associated with the layers.

The **gradients of weights** are formally defined as \(\frac{\partial J}{\partial \mathbf{\Theta}}\), and **gradients of biases** are defined as \(\frac{\partial J}{\partial \mathbf{b}}\).

Next we give (without proof) the 2 formulae describing the back-propagation process. More details of the derivation will be covered in a separate post.

Starting from the output layer. The error of the output layer is simply:

\[

\mathbf{\delta}^{(L-1)} = \mathbf{\hat{y}} – \mathbf{y}

\]

**(1)**

From layer \(L-1\) backwards, the error that gets propagated from layer \(l\) to \(l-1\) is:

\[

\begin{equation}

\mathbf{\delta}^{(l-1)} = {\mathbf{\Theta}^{(l)}}^T \cdot

\mathbf{\delta}^{(l)} \odot g'(\mathbf{z}^{(l-1)})

\end{equation}

\]

**(2)**

where:

- \(\mathbf{\delta}^{(l)}\) is the error of layer \(l\). More formally, this is defined as \(\mathbf{\delta}^{(l)} \equiv \frac{\partial J}{\partial \mathbf{z}^{(l)}}\).
- \(\odot\) is the
**Hadamard product**, a.k.a**element-wise product**. - \(g'()\) is the derivative of the activation function.

One can give a quick check of the dimensions in each term:

- \(\mathbf{\delta}^{(l-1)}\): same as \(\mathbf{a}^{(l-1)}\): \((s_{l-1} \times 1)\).
- \({\mathbf{\Theta}^{(l)}}^T\): \((s_{l-1} \times s_l)\).
- \(\mathbf{\delta}^{(l)}\): same as \(\mathbf{a}^{(l)}\): \((s_l \times 1)\).
- \(g'(\mathbf{z}^{(l-1)})\): same as \(\mathbf{z}^{(l-1)}\): \((s_{l-1} \times 1)\).

It is trivial to see that dimensions on both side match.

The 2nd formula involves the computation of gradients in each layer:

\[

\left\{\begin{matrix}

\frac{\partial J}{\partial \mathbf{\Theta}^{(l)}} = & \mathbf{\delta}^{(l)} \cdot {\mathbf{a}^{(l-1)}}^{T} \\

\frac{\partial J}{\partial \mathbf{b}^{(l)}} = & \mathbf{\delta}^{(l)}

\end{matrix}\right.

\]

**(3)**

where:

- \(\frac{\partial J}{\partial \mathbf{\Theta}^{(l)}}\) is the gradients of cost function wrt the weights associated with layer \(l\).
- \(\frac{\partial J}{\partial \mathbf{b}^{(l)}}\) is the gradients of cost function wrt the biases associated with layer \(l\).

These gradients can be used to update the parameters using, for instance, a **gradient descent** scheme.

Notice that \(\mathbf{\Theta}^{(l)}\) is the weights mapping from layer \(l-1\) to \(l\). Therefore, a mnemonic for the above equation is:

gradients of the weights (\(\partial J/ \partial \Theta\)) is the product of the inputs into the weights (\(a^{(l-1)}\)) and the error out from the weights (\(\delta^{(l)}\)).

## Gradient descent update

We will be using the basic gradient descent update scheme for this one. There are various derivations aimed at improving the convergence process, and some of them will be introduced in a later post.

The formula for updating the layer weights is:

\[

\mathbf{\Theta} := \mathbf{\Theta} – \frac{\alpha}{m} \sum_i^m \frac{\partial J_i}{\partial

\mathbf{\Theta}} – \frac{\alpha \lambda}{m} \mathbf{\Theta}

\]

The formula for updating the layer biases is:

\[

\mathbf{b} := \mathbf{b} – \frac{\alpha}{m} \sum_i^m \frac{\partial J_i}{\partial \mathbf{b}}

\]

where \(\alpha\) is the **learning rate**, \(\lambda\) the **regularization parameter**.

## Recap on the notations

It might be worthwhile giving a recap on the notations introduced so far:

- \(L\): number of layers in a neural network.
- \(l\): index of a layer. \(l = 0, 1, \cdots, L-1\).
- \(s_l\): number of neurons/units in layer \(l\).
- \(\theta_{j,i}^{(l)}\): weight that maps from neuron \(i\) in layer \(l-1\) to neuron \(j\) in layer \(l\).
- \(b_{j}^{(l)}\): bias added onto neuron \(j\) in layer \(l\).
- \(z_{j}^{(l)}\): weighted sum of neuron \(j\) in layer \(l\).
- \(a_{j}^{(l)}\): activation of neuron \(j\) in layer \(l\).
- \( \mathbf{\Theta}^{(l)} \) : weight matrix that maps from layer \(l-1\) to layer \(l\).
- \(\mathbf{b}^{(l)}\): bias vector added onto layer \(l\).
- \(\mathbf{z}^{(l)}\): weighted sum vector of layer \(l\).
- \(\mathbf{a}^{(l)}\): activation vector of layer \(l\).
- \(g()\): activation function.
- \(\mathbf{\hat{y}}\): vector of model prediction.
- \(\mathbf{y}\): vector of label.
- \(\mathbf{\delta}^{(l)}\): vector of error in layer \(l\).
- \(J\): cost function value.
- \(\odot\): Hadamard product, a.k.a element-wise product.

# numpy implementation

This section introduces the Python implementation of the above computations. The results will be used as building blocks of the final neural network.

## Softmax function

The **softmax** function can be easily implemented in `numpy`

as:

def softmax(x): return np.exp(x) / np.sum(np.exp(x))

Alternatively, there is also a function in `scipy`

: `scipy.special.softmax`

.

## Logistic function and its derivative

def logistic(x): return 1 / (1 + np.exp(-x))

Alternatively, there is also a function in `scipy`

: `scipy.special.expit`

.

The derivative of logistic is:

def dlogistic(x): '''Derivative of the logistic function''' return np.exp(-x)/(1 + np.exp(-x))**2

## Cross-entropy function

def crossEntropy(yhat, y): '''Cross entropy cost function ''' eps = 1e-10 yhat = np.clip(yhat, eps, 1-eps) return - np.nansum(y*np.log(yhat))

This is basically a literal translation of the mathematical equation
into Python language, with the only exception that a small value
(`eps`

) is added to the `yhat`

term to avoid infinities from the
logarithm function when `yhat = 0`

, and the same `eps`

is subtracted from 1 just to
make it symmetrical.

## Cost function

The cost consists of 2 parts:

- errors of the model contributed by training samples. This is
computed in a
`sampleCost()`

function. - penalties by the regularization term. This is computed in a
`regularizationCost()`

function.

def samplecost(self, yhat, y): '''cost of a single training sample/batch args: yhat (ndarray): prediction in shape (n, m). m is the number of final output units, n is the number of records. y (ndarray): label in shape (n, m). returns: cost (float): summed cost. ''' return self.cost_func(yhat, y) def regulizationcost(self): '''cost from the regularization term defined as the summed squared weights in all layers, not including biases. ''' j = 0 for ll in range(1, self.n_layers): j = j+np.sum(self.thetas[ll]**2) return j

Here, `self.cost_func`

is just the `crossEntropy()`

function
defined above. The latter is passed in to the constructor of the class
such that it is more flexible to switch to a different cost function
definition should one decides to.

## Forward propagation

The forward propagation consists of 2 equations, each corresponding to basically 1 line of code:

z = np.dot(a, thetas.T) + b a = activation(z)

This should be fairly self-explanatory. However, in practice, some extra house-keeping and structuring are needed. The full function can be implemented as follows:

def feedForward(self, x): '''Forward pass Args: x (ndarray): input data with shape (n, f). f is the number of input units, n is the number of records. Returns: weight_sums (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: weighted sums of each layer: z^{(l+1)} = a^{(l)} \cdot \theta^{(l+1)}^{T} + b^{(l+1)} where: z^{(l+1)}: weighted sum in layer l+1. a^{(l)}: activation in layer l. \theta^{(l+1)}: weights that map from layer l to l+1. b^{(l+1)}: biases added to layer l+1. The value for key=0 is the same as input <x>. activations (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: activations in each layer. See above. ''' x = self.force2D(x) activations = {0: x} weight_sums = {0: x} a1 = x for ii in range(1, self.n_layers): bii = self.biases[ii] zii = np.dot(a1, self.thetas[ii].T)+bii if ii == self.n_layers-1: aii = self.af_last(zii) else: aii = self.af(zii) activations[ii] = aii weight_sums[ii] = zii a1 = aii return weight_sums, activations

Firstly, we are going to assume (and therefore require) that the input
data records are arranged in rows. That is to say, the input variable `x`

is
an `ndarray`

of shape `(m, f)`

, where `m`

being the number of records,
and `f`

the number of features (in the case of MNIST data, `f = 784`

).

To enforce this rule on a single data record with a shape of `(784,)`

,
we need to reshape it into 2d, like this:

```
x = np.atleast_2d(x)
```

Then `x.shape`

will be `(1, 784)`

.

For inputs with more than 2 records, e.g. `(4, 784)`

, `atleast_2d(x)`

will have no effects.

It is noticed that the activations and weighted sums of the layers are
used during the back-propagation process (see Equation
(2),(3)). Therefore, it is useful to keep a record of these variables
during the forward propagation process. These can be stored in a
Python **dictionary** using the layer indices as keys. E.g.

```
activations = {0: x}
weight_sums = {0: x}
```

Here, the weighted sum and activation of the 1st layer (indexed 0) are
both the input array `x`

itself.

The `self`

keyword implies that this is a **class method**, as
eventually we are making it into a **NeuralNetwork** class. The `self.biases`

**class attribute** is a dictionary storing the layer biases, similarly
for the `self.thetas`

which stores the layer weights. E.g. `self.biases[1]`

is the bias array for
layer-1, `self.thetas[4]`

is the weight matrix for layer-4.

It is also noticed that hidden layers use logistic as the activation
function, and the output layer uses softmax. A simple `if`

condition
is used to make this distinction.

## Back-propagation

Back-propagation consists 2 equations as well. Inputs for this process are the label data \(\mathbf{y}\), and weighted sums \(\mathbf{z}\) and activations \(\mathbf{a}\) from forward propagation. Outputs from back-propagation are the gradients for weights and biases.

The full function can be implemented as follows:

def feedBackward(self, weight_sums, activations, y): '''Backward propogation Args: weight_sums (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: weighted sums of each layer. activations (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: activations in each layer. y (ndarray): label in shape (m, n). m is the number of final output units, n is the number of records. Returns: grads (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: summed gradients for the weight matrix in each layer. grads_bias (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: summed gradients for the bias in each layer. ''' grads = {} # gradients for weight matrices grads_bias = {} # gradients for bias y = self.force2D(y) delta = activations[self.n_layers-1] - y for jj in range(self.n_layers-1, 0, -1): grads[jj] = np.einsum('ij,ik->jk', delta, activations[jj-1]) grads_bias[jj] = np.sum(delta, axis=0, keepdims=True) if jj != 1: delta = np.dot(delta, self.thetas[jj])*self.daf(weight_sums[jj-1]) return grads, grads_bias

As before, we enforce 2d shape for the `y`

data using `force2D()`

.

`grads`

and `grads_bias`

are 2 dictionaries, storing the resultant
gradient arrays, again using layer indices as keys.

The error of the model \(\delta^{(L-1)}\) is simply:

```
delta = activations[self.n_layers-1] - y
```

Then we move backwards from `self.n_layers-1`

to `1`

in a `for`

loop.

The 1st line inside the `for`

loop computes the gradients of
weights. I used `np.einsum()`

for this to **vectorize the computation**.

NOTE that the `delta`

variable corresponds to the \(\mathbf{\delta}^{(l)}\) term in Equation (3), and is of shape \((m \times s_{l})\), where \(m\) is the number of records. `activations[jj-1]`

is the \(\mathbf{a}^{(l-1)}\) term, with shape \((m \times s_{l-1})\). Equation (3) describes the computation for **a single record**, but here we are dealing with **m records** simultaneously. This **vectorization** can help enhance the computational efficiency by a considerable amount.

The `einsum()`

function is **Einstein summation** function, using a convention
also known as **indicial notation** sometimes seen in mathematical
derivations. In this case, it gives the same results as
`np.dot(delta.T, activations[jj-1])`

. I have noticed that even when
the results are identical, `einsum()`

can be faster by 3-4 times than
the `dot()`

function. Therefore, it is preferable to replace all
`np.dot()`

and `np.sum()`

calls with `einsum()`

to achieve better
performance.

The 2nd line computes the **summed** gradients of biases. The summation
is done across the `m`

records. We will be dividing this by `m`

later, and the
`grads[jj]`

variable as well, to make them into averages.

Lastly, the error is propagated backwards using

delta = np.dot(delta, self.thetas[jj])*self.daf(weight_sums[jj-1])

where `self.daf`

is the derivative of activation function, which in
this case is the derivative of logistic.

## Gradient descent update

The gradient descent update is done separately for the weights and biases in each layer. The full function can be implemented as follows:

def gradientDescent(self, grads, grads_bias, n): '''Perform gradient descent parameter update Args: grads (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: summed gradients for the weight matrix in each layer. grads_bias (dict): keys: layer indices starting from 0 for the input layer to N-1 for the last layer. Values: summed gradients for the bias in each layer. n (int): number of records contributing to the gradient computation. Update rule: \theta_i = \theta_i - \alpha (g_i + \lambda \theta_i) where: \theta_i: weight i. \alpha: learning rate. \g_i: gradient of weight i. \lambda: regularization parameter. ''' n = float(n) for jj in range(1, self.n_layers): theta_jj = self.thetas[jj] theta_jj = theta_jj - self.lr * \ grads[jj]/n - theta_jj*self.lr*self.lam/n self.thetas[jj] = theta_jj bias_jj = self.biases[jj] bias_jj = bias_jj - self.lr * grads_bias[jj]/n self.biases[jj] = bias_jj return

Note that `n`

is the number of records contributing to the **summed**
gradients in `grads`

and `grads_bias`

.
The averages are computed here, rather than during the
gradient computation process inside the `feedBackward()`

function.

This decision is primarily because that training of a network can be
performed in different modes. In **stochastic training**, in each
training iteration the gradients
are computed from **a single** training record and parameters updated
immediately. For stochastic training, `n = 1`

, so the summation and
the average are the same thing.

In **batch training**, the gradients are computed from **all available**
training records, and `n = m`

.

In **mini-batch training**, the gradients are computed from **a subset**
of all available training records, and `n = m`

(these 2 `m`

are not
necessarily the same value).

Therefore, for batch and mini-batch training modes, it is easier to
accumulate the gradients during the back-propagation process and do
the averaging over `n`

later.

These 3 training modes are implemented in the following sections.

## Stochastic training

During each **epoch** of a stochastic training process, these steps are
done:

- randomly select
**1 record**from the training dataset: \((\mathbf{x}_i, \mathbf{y}_i)\). Note that this random selection is**random sampling without replacement**, meaning that already sampled records will not be sampled again. - make a model prediction using the sampled record by a forward propagation, giving \(\mathbf{\hat{y}}_i\).
- compute the error \(\mathbf{\delta} = \mathbf{\hat{y}}_i – \mathbf{y}_i\), and perform back-propagation to get the gradients \(\partial J/\partial \mathbf{\Theta}\) and \(\partial J/\partial

\mathbf{b}\) for all the layers. - perform gradient descent using the gradients.
- back to step
`1`

, until all training samples have been iterated through.

The full function can be implemented as follows:

def stochasticTrain(self, x, y, epochs): '''Stochastic training Args: x (ndarray): input with shape (n, f). f is the number of input units, n is the number of records. y (ndarray): input with shape (n, m). m is the number of output units, n is the number of records. epochs (int): number of epochs to train. Returns: self.costs (ndarray): overall cost at each epoch. ''' self.costs = [] m = len(x) for ee in range(epochs): idxs = np.random.permutation(m) for i, ii in enumerate(idxs): xii = x[[ii]] yii = y[[ii]] weight_sums, activations = self.feedForward(xii) grads, grads_bias = self.feedBackward( weight_sums, activations, yii) self.gradientDescent(grads, grads_bias, 1) yhat = self.predict(x) j = self.sampleCost(yhat, y) j2 = self.regulizationCost() j = j/m + j2*self.lam/m print('# <stochasticTrain>: Epoch = %d, Cost = %f' % (ee, j)) # annealing if len(self.costs) > 1 and j > self.costs[-1]: self.lr *= 0.9 print('# <stochasticTrain>: Annealing learning rate, current lr =', self.lr) self.costs.append(j) self.costs = np.array(self.costs) return self.costs

NOTE that after each epoch, we compute the cost function and
store all the costs into an array. This `costs`

array is useful to
monitor the training process, by for instance plotting the curve of
costs against epochs.

## Batch and mini-batch training

During each **epoch** of a batch or mini-batch training process, these steps are
done:

- randomly select
`m`

**records**(\(m > 1\)) from the training dataset: \(\{(\mathbf{x}_i, \mathbf{y}_i), \; i = 1, 2, \cdots, m\}\). Note that this**random selection is random sampling without replacement**, meaning that already sampled records will not be sampled again. - make a model prediction using each sampled record by a forward propagation, giving \(\mathbf{\hat{y}}_i\).
- compute the error \(\mathbf{\delta}_i = \mathbf{\hat{y}}_i – \mathbf{y}_i\), and perform back-propagation to get the gradients \(\partial J/\partial \mathbf{\Theta}\) and \(\partial J/\partial \mathbf{b}\) for all the layers.
- compute the averaged gradients stemmed from all \(m\) samples, and use that to perform gradient descent using the gradients.
- back to step
`1`

, until all training samples have been iterated through.

Notice that the only difference between a batch training and a mini-batch training is that in **batch training**, \(m = M\), the number of all available samples, while in **mini-batch training**, \(m = k\), a term called batch size.

In view of such a similarity, a separate function is created and used as the core of both batch and mini-batch training modes. This function can be implemented as follows:

def _batchTrain(self, x, y): '''Training using a single sample or a sample batch Args: x (ndarray): input with shape (n, f). f is the number of input units, n is the number of records. y (ndarray): input with shape (n, m). m is the number of output units, n is the number of records. Returns: j (float): summed cost over samples in <x> <y>. Slow version, use the sample one by one ''' m = len(x) grads = dict([(ll, np.zeros_like(self.thetas[ll])) for ll in range(1, self.n_layers)]) grads_bias = dict([(ll, np.zeros_like(self.biases[ll])) for ll in range(1, self.n_layers)]) j = 0 # accumulates cost idx = np.random.permutation(m) for ii in idx: xii = x[ii] yii = y[ii] weight_sums, activations = self.feedForward(xii) j = j+self.sampleCost(activations[self.n_layers-1], yii) gradii, grads_biasii = self.feedBackward(weight_sums, activations, yii) for ll in range(1, self.n_layers): grads[ll] = grads[ll]+gradii[ll] grads_bias[ll] = grads_bias[ll]+grads_biasii[ll] self.gradientDescent(grads, grads_bias, m) return j

Notice that for a given batch size `m`

, we are iterating one-by-one of
the `m`

samples, each time performing a forward pass and a backward
pass. The gradients are accumulated across the `m`

samples.

This is a faithful implementation of the algorithm, but not
necessarily an efficient one. Recall that when introducing the forward
propagation, we already mentioned the benefits of **vectorized
computations**. Similarly, the batch training function above can be
vectorized, as follows:

def _batchTrain2(self, x, y): '''Training using a single sample or a sample batch Args: x (ndarray): input with shape (n, f). f is the number of input units, n is the number of records. y (ndarray): input with shape (n, m). m is the number of output units, n is the number of records. Returns: j (float): summed cost over samples in <x> <y>. Vectorized version. About 3x faster than _batchTrain() ''' m = len(x) weight_sums, activations = self.feedForward(x) grads, grads_bias = self.feedBackward(weight_sums, activations, y) self.gradientDescent(grads, grads_bias, m) j = self.sampleCost(activations[self.n_layers-1], y) return j

NOTE that in the vectorized version, we are no longer making a `for`

loop through the selected sample batch. The code looks cleaner, and
actually runs faster. My preliminary tests suggest that
`_batchTrain2()`

is about 3 times faster than `_batchTrain()`

.

With the core of batch and mini-batch update ready, let’s create a wrapper to get the function for batch training and min-batch training.

The **batch training** function can be implemented as:

def batchTrain(self, x, y, epochs): '''Training using all samples Args: x (ndarray): input with shape (n, f). f is the number of input units, n is the number of records. y (ndarray): input with shape (n, m). m is the number of output units, n is the number of records. epochs (int): number of epochs to train. Returns: self.costs (ndarray): overall cost at each epoch. ''' self.costs = [] m = len(x) for ee in range(epochs): j = self._batchTrain2(x, y) j2 = self.regulizationCost() j = j/m + j2*self.lam/m print('# <batchTrain>: Epoch = %d, Cost = %f' % (ee, j)) # annealing if len(self.costs) > 1 and j > self.costs[-1]: self.lr *= 0.9 print('# <batchTrain>: Annealing learning rate, current lr =', self.lr) self.costs.append(j) self.costs = np.array(self.costs) return self.costs

Again, after each epoch we compute the cost and store its value.

Additionally, if we notice that the cost function `j`

increases
after an epoch, we shrink the learning rate by a factor of `0.9`

. This
trick is sometimes called **annealing**. An increasing cost could be the
result of the learning rate being too large. There are other ways to
dynamically adjust the learning rate during training, and they are not
covered in this post.

The **mini-batch** training function can be implemented as:

def miniBatchTrain(self, x, y, epochs, batch_size): '''Training using mini batches Args: x (ndarray): input with shape (n, f). f is the number of input units, n is the number of records. y (ndarray): input with shape (n, m). m is the number of output units, n is the number of records. epochs (int): number of epochs to train. batch_size (int): mini-batch size. Returns: self.costs (ndarray): overall cost at each epoch. ''' self.costs = [] m = len(x) for ee in range(epochs): j = 0 idx = np.random.permutation(m) id1 = 0 while True: id2 = id1+batch_size id2 = min(id2, m) idxii = idx[id1:id2] xii = x[idxii] yii = y[idxii] j = j+self._batchTrain2(xii, yii) id1 += batch_size if id1 > m-1: break j2 = self.regulizationCost() j = j/m + j2*self.lam/m print('# <miniBatchTrain>: Epoch = %d, Cost = %f' % (ee, j)) # annealing if len(self.costs) > 1 and j > self.costs[-1]: self.lr *= 0.9 print('# <miniBatchTrain>: Annealing learning rate, current lr =', self.lr) self.costs.append(j) self.costs = np.array(self.costs) return self.costs

Different from the **batch training**, we are not passing all
the samples to the `_batchTrain2()`

function in one go, but in a
number of mini-batches, each time with a size of `batch_size`

.

Again, costs after epochs are recorded and annealing of the learning rate is performed.

## Save and load the parameters

Complex model can be difficult to train, and it might be desirable to split the entire training process into multiple sessions. At the end of a training session, the current model parameters are saved to disk. Before the next session starts, the previous parameters are loaded and training starts from there.

These saving and loading functions can be implemented as:

def saveWeights(self, outfilename): '''Save model parameters to file Args: outfilename (str): absolute path to file to save model parameters. Parameters are saved using numpy.savez(), loaded using numpy.load(). ''' print('\n# <saveWeights>: Save network weights to file', outfilename) dump = {'lr': self.lr, 'n_nodes': self.n_nodes, 'lam': self.lam} for ll in range(1, self.n_layers): dump['theta_%d' % ll] = self.thetas[ll] dump['bias_%d' % ll] = self.biases[ll] np.savez(outfilename, **dump) return def loadWeights(self, abpathin): '''Load model parameters from file Args: abpathin (str): absolute path to file to load model parameters. Parameters are saved using numpy.savez(), loaded using numpy.load(). ''' print('\n# <saveWeights>: Load network weights from file', abpathin) with np.load(abpathin) as npzfile: for ll in range(1, self.n_layers): thetall = npzfile['theta_%d' % ll] biasll = npzfile['bias_%d' % ll] self.thetas[ll] = thetall self.biases[ll] = biasll self.lam = npzfile['lam'] self.lr = npzfile['lr'] return

## Others

The above has covered all the key elements of our `NeuralNetwork`

class. There are a few miscellaneous points.

### 1. parameter initialization

It is recommended that for logistic activation functions, the weights of a neural network are initialized using **xavier initialization** or **normalized** **xavier initialization**. Specifically, the weights \(\mathbf{\Theta}\) are initialized from random variables of a **uniform distribution** within \([ – \sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}]\), where \(n_{in}\) is the number of inputs to the weights, and \(n_{out}\) the number of outputs. For instance, the weights for the **1st hidden layer** in our MNIST classification model is initialized using

\[

\Theta^{(1)} \sim U[- \frac{\sqrt{6}}{\sqrt{ 784 + 200}}, \frac{\sqrt{6}}{\sqrt{ 784 + 200}}]

\]

Such an initialization can be implemented as:

def init(self): self.thetas = {} # theta_l is mapping from layer l-1 to l self.biases = {} # bias_l is added to layer l when mapping from layer l-1 to l for ii in range(1, self.n_layers): if self.init_func == 'xavier': stdii = np.sqrt(6/(self.n_nodes[ii-1]+self.n_nodes[ii])) thetaii = (np.random.rand(self.n_nodes[ii], self.n_nodes[ii-1]) - 0.5) * stdii elif self.init_func == 'he': stdii = np.sqrt(2/self.n_nodes[ii-1]) thetaii = np.random.normal(0, stdii, size=(self.n_nodes[ii], self.n_nodes[ii-1])) self.thetas[ii] = thetaii self.biases[ii] = np.zeros([1, self.n_nodes[ii]])

NOTE that also included is the implementation of the **HE
initialization**, which is recommended for models using the **ReLU** activation function.

### 2. Visualize the network

We give without further explanations a function to plot out the
neurons in a network, using the **networkx** module.

def plotNN(self): '''Plot structure of network''' import networkx as nx self.graph = nx.DiGraph() show_nodes_half = 3 layered_nodes = [] for ll in range(self.n_layers-1): # layer l nodesll = list(zip([ll, ] * self.n_nodes[ll], range(self.n_nodes[ll]))) # layer l+1 nodesll2 = list(zip([ll+1, ]*self.n_nodes[ll+1], range(self.n_nodes[ll+1]))) # omit if too many if len(nodesll) > show_nodes_half*3: nodesll = nodesll[:show_nodes_half] + nodesll[-show_nodes_half:] if len(nodesll2) > show_nodes_half*3: nodesll2 = nodesll2[:show_nodes_half] + nodesll2[-show_nodes_half:] # build network edges = [(a, b) for a in nodesll for b in nodesll2] self.graph.add_edges_from(edges) layered_nodes.append(nodesll) if ll == self.n_layers-2: layered_nodes.append(nodesll2) # -----------------Adjust positions----------------- pos = {} for ll, nodesll in enumerate(layered_nodes): posll = np.array(nodesll).astype('float') yll = posll[:, 1] if self.n_nodes[ll] > show_nodes_half*3: bottom = yll[:show_nodes_half] top = yll[-show_nodes_half:] bottom = bottom/np.ptp(bottom)/3. top = (top-np.min(top))/np.ptp(top)/3.+2/3. yll = np.r_[bottom, top] else: yll = yll/np.ptp(yll) posll[:, 1] = yll pos.update(dict(zip(nodesll, posll))) #-------------------Draw figure------------------- fig = plt.figure() ax = fig.add_subplot(111) nx.draw(self.graph, pos=pos, ax=ax, with_labels=True) plt.show(block=False) return

The figure below shows the visualization of our MNIST classifier network. Note that when a layer has too many neurons, only a small subset is drawn.

## Complete class definition

The complete code for our `NeuralNetwork`

class can be found in the
`classifier/multiclassnn.py`

file in this Github repo.

# Application on MNIST data

The MNIST data in csv format are obtained from:
http://pjreddie.com/projects/mnist-in-csv/. The training set consists of **60,000** records, and the test set **10,000**. The function below reads
in these data and format into `ndarray`

.

def readData(path): '''Read csv formatted MNIST data Args: path (str): absolute path to input csv data file. Returns: x_data (ndarray): mnist input array in shape (n, 784), n is the number of records, 784 is the number of (flattened) pixels (28 x 28) in each record. y_data (ndarray): mnist label array in shape (n, 10), n is the number of records. Each label is a 10-element binary array where the only 1 in the array denotes the label of the record. ''' data_file = pd.read_csv(path, header=None) x_data = [] y_data = [] for ii in range(len(data_file)): lineii = data_file.iloc[ii, :] imgii = np.array(lineii.iloc[1:])/255. labelii = np.zeros(10) labelii[int(lineii.iloc[0])] = 1 y_data.append(labelii) x_data.append(imgii) x_data = np.array(x_data) y_data = np.array(y_data) return x_data, y_data

NOTE that `pandas`

is used for csv file reading, but it is not
essential, the plain `csv`

can manage just as well.

The script below reads in the training data, build the neural network,
train the model using mini-batch mode for **50 epochs**, with a **batch-size of 64**. Then the cost
function curve is plotted against epochs, and the model is tested
against the test dataset.

Finally, prediction accuracy for the training and test datasets are printed out.

Code here:

N_INPUTS = 784 # input number of units N_HIDDEN = [200, 100, 50] # hidden layer sizes N_OUTPUTS = 10 # output number of units LEARNING_RATE = 0.1 # initial learning rate LAMBDA = 0.01 # regularization parameter EPOCHS = 50 # training epochs # -----------------Open MNIST data----------------- x_data, y_data = readData(TRAIN_DATA_FILE) print('x_data.shape:', x_data.shape) print('y_data.shape:', y_data.shape) # ------------------Create network------------------ nn = neuralNetwork(N_INPUTS, N_HIDDEN, N_OUTPUTS, LEARNING_RATE, af_last=softmax_a, lam=LAMBDA) # -----------------Mini-batch train----------------- costs = nn.miniBatchTrain(x_data, y_data, EPOCHS, 64) # -----------Plot a graph of the network----------- fig, ax = plt.subplots() ax.plot(costs, 'b-o') ax.set_xlabel('epochs') ax.set_ylabel('Cost') fig.show() nn.saveWeights('./weights.npz') # ---------------Validate on train set--------------- yhat_train = np.argmax(nn.predict(x_data), axis=1) y_true_train = np.argmax(y_data, axis=1) n_correct_train = np.sum(yhat_train == y_true_train) print('Training samples:') n_test_train = 20 for ii in np.random.randint(0, len(y_data)-1, n_test_train): yhatii = yhat_train[ii] ytrueii = y_true_train[ii] print('yhat = %d, yii = %d' % (yhatii, ytrueii)) # ------------------Open test data------------------ x_data_test, y_data_test = readData(TEST_DATA_FILE) # ---------------test on test set--------------- yhat_test = np.argmax(nn.predict(x_data_test), axis=1) y_true_test = np.argmax(y_data_test, axis=1) n_correct_test = np.sum(yhat_test == y_true_test) print('Test samples:') n_test = 20 for ii in np.random.randint(0, len(y_data_test)-1, n_test): yhatii = yhat_test[ii] ytrueii = y_true_test[ii] print('yhat = %d, yii = %d' % (yhatii, ytrueii)) print('Accuracy in training set:', n_correct_train/float(len(x_data))*100.) print('Accuracy in test set:', n_correct_test/float(len(x_data_test))*100.)

The figure below shows the changes of cost with epochs:

After 50 epochs of training, the accuracy on training set is: **98.41 %**.
Accuracy on test set is: **96.71 %**. Not too shabby.

Let’s also create a few plots showing the prediction results:

nrows = 3 ncols = 4 n_plots = nrows*ncols fig = plt.figure(figsize=(12, 10)) for i in range(n_plots): axi = fig.add_subplot(nrows, ncols, i+1) axi.axis('off') xi = x_data_test[i] yi = y_data_test[i] yhati = yhat_test[i] plotResult(xi, yi, yhati, ax=axi) fig.show()

It seems that at least for the first 12 test casts, our model gets them all correct. See figure below.

Lastly, the training process cost me **345** seconds, about **6 minutes**. I was
using a fairly beefy **8 core AMD Ryzen 3700X** desktop CPU. On a laptop
it would take a bit longer. But the performance is pretty good
considering that the entire thing uses nothing but `numpy`

.

# Wrap up

In this post, we introduced the multi-class classification neural
network. The model is trained using forward and backward
propagation. A softmax function is used as the output layer
activation, and hidden layers use logistic function as
activation. Vectorized implementation allows our simple `numpy`

implementation to achieve a fairly good performance.

Many of the concepts and notation are common among different types of
networks. In a future post, I will share some knowledge and
experiences in building a **convolutional neural network** using
`numpy`

.

# Support me

If you like the contents, please consider sending a small tip. Below this my monero wallet address:

41tR6ku6uiDA41awW1UgVD9DoJ1PxYQsMa1iQjM68ekHGTee5wgEfgLDXh7QZ3cZZnRnVuC5nGeY1Na28ZJygrHx3JeKBV7