This post will share some basic knowledge of an artificial neural network and how to create one from scratch using only numpy. As a concrete example, we will build a classification network and use it to classify hand-written digits from the MNIST dataset.

We will start by introducing the general idea and structure of the network, and the mathematical notations. The computations involved inside the network will be introduced by first showing the math equations, followed with the numpy implementations in the 2nd part of the post. If a particular step can be optimized, for instance, by vectoring the computations, these will be covered as well. In the end, the network is trained on the training dataset of MNIST, and its performance evaluated using MNIST’s test data. The final, complete code is given at this simple-nn Github repo.

Computations in the network and notations

Layers in the network

The network we will be building consists of 5 layers:

1. input layer (layer-0): this is the layer that feeds the input data into the network.

The number of neurons (or units) in the input layer should be the same as the size of the input data. In this particular case, the input data are images of MNIST hand-written digits. Each image is 28 pixels x 28 pixels in size, and we flatten the image into a single array of length 28 x 28 = 784. Therefore, the input layer has 784 neurons.

We denote the size of layer \(l\) as \(s_l\). Therefore, \(s_0=784\).

Note that we are indexing the layer number starting from index 0, i.e. the 1st layer has index 0, 2nd has index 1, and so on. This is mostly arbitrary. In some places people label the input layer as layer-1. Using 0-based indexing makes it only slightly convenient to work with Python whose array indices also start from 0.

2. 3 hidden layers (layer-1, layer-2 and layer-3): these layers have sizes:

layer-1: 200 neurons, \(s_1=200\).
layer-2: 100 neurons, \(s_2=100\).
layer-3: 50 neurons, \(s_3=50\).

NOTE that these are also arbitrary choices. We are not going to compare different design choices in this post. It is totally possible that a simpler or different design would work just as well or even better.

3. output layer (layer-4): this is the layer that outputs the final result of the network. The final result is often termed as the prediction of the model. It is conventional to label the prediction as \(\hat{y}\) (or \(\mathbf{\hat{y}}\)).

For the task of recognizing hand-written digits from images, it is a multi-class classification problem. There are a total of 10 classes in the problem, corresponding to digits from 0 to 9. For such kind of tasks, it is common to formulate the prediction using the so-called one-hot encoding. For instance, the label for digit 0 is defined as

\[
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
\]

that for digit 1 as

\[
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
\]

and similarly, digit 5 is encoded as:

\[
[0, 0, 0, 0, 0, 5, 0, 0, 0, 0]
\]

More generally, the class i in a K-class classification task is represented by a K-length vector whose i-th element is 1, and all other places are 0s.

Therefore, the output layer has 10 neurons: \(s_{4}=10\).

To summarize this section:

The network has multiple layers, 1 input layer, 1 output layer, and all other layers in-between are called hidden layers.
The number of layers in the network is denoted \(L\). The number of neurons/units in layer \(l\) is denoted as \(s_l\). In this case, \(L=5\), \(s_0 = 784\), \(s_1 = 200\), \(s_2 = 100\), \(s_3 = 50\), and \(s_4 = 10\). The input and output layer sizes are determined by the specific problem to be solved, and hidden layer sizes are design choices.
one-hot encoding is used for the digit classification task.

Computations happening inside the network – the forward propagation

Information gets passed from the input layer (layer-0) to the 1st hidden layer (layer-1), then to all subsequent layers, one layer at a time.

From layer-0 to layer-1, each neuron \(j\) in layer-0 is connected to each neuron \(i\) in layer-1. Therefore, there are a total of \(s_1 \times s_0 = 200 \times 784 = 157200\) connections. That is quite a number of connections, that is why such models are also called dense networks.

Similarly, from layer-1 to layer-2, there are \(s_2 \times s_1 = 100 \times 200 = 20000\) connections.

More generally, from layer \(l\) to layer \(l+1\), the number of connections is \(s_{l+1} \times s_{l}\).

What happens along these connections is a linear operation. More specifically, suppose that the neuron \(i\) at layer \(l\) is sending a value to neuron \(j\) in the next layer. The value sent is termed \(a_{i}^{(l)}\), where \(a\) stands for activation. The linear operation is

\[
z_{j}^{(l+1)} = a_{i}^{(l)} \theta_{j,i}^{(l+1)} + b_{j}^{(l+1)}
\]

where:

\(\theta_{j,i}^{(l+1)}\) is a weight that is associated with the connection between these 2 neurons. The weight is multiplied with the input activation value \(a_{i}^{(l)}\).
\(b_{j}^{(l+1)}\) is a bias term added onto the product.
\(z_{j}^{(l+1)}\) is the result of the linear operation. Intuitively, this is what a next layer neuron receives from its upstream input.

After getting \(z_{j}^{(l+1)}\), the neuron gets “activated”, and sends out an “activation” to its next layer. This “activation” is mathematically achieved by an activation function:

\[
a_{j}^{(l+1)} = g(z_{j}^{(l+1)})
\]

where \(g()\) is the activation function, which is typically a non-linear function, for instance, a logistic function (a.k.a sigmoid function):

\[
g(z) = \frac{1}{1 + e^{-z}}
\]

Therefore, the processes happening in from layer \(l\) to \(l+1\) are:

\[
\left\{\begin{matrix}
z_{j}^{(l+1)} =& a_{i}^{(l)} \theta_{j,i}^{(l+1)} + b_{j}^{(l+1)} \\
a_{j}^{(l+1)} =& g(z_{j}^{(l+1)})
\end{matrix}\right.
\]

Then the similar process is repeated from layer \(l+1\) to \(l+2\), and so on, till the very last layer in the network. This process is called forward propagation. See figure below for an illustration.

Schematic showing the forward propagation process happening between layer l and l+1.

NOTE that for the 1st layer, \(a^{(0)}\) is just the input data in layer-0. For the last layer, \(a^{(L-1)}\) is the model prediction.

Recall that there are \(s_{l+1} \times s_{l}\) number of connections between layer \(l\) and \(l+1\), meaning \(s_{l+1} \times s_{l}\) number of linear operations. All these could be neatly represented using matrix-vector notations as below:

Or more compactly:

\[
\mathbf{\Theta}^{(l+1)} \cdot \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} = \mathbf{z}^{(l+1)}
\]

where:

\(\mathbf{\Theta}^{(l+1)}\) is the weight matrix that maps data from layer l to layer l+1. It has a dimension of \((s_{l+1} \times s_l)\). The element, \(\theta_{j,i}^{(l+1)}\) is the weight associated with the link from neuron \(j\) in layer \(l\), to neuron \(i\) in layer \(l+1\).
\(\mathbf{a}^{(l)}\) is the vector of activations from layer \(l\), with dimension \((s_l \times 1)\).
\(\mathbf{b}^{(l+1)}\) is the vector of biases added to layer \(l+1\), with dimension \((s_{l+1} \times 1)\).
\(\mathbf{z}^{(l+1)}\) is the weighted sum in layer \(l+1\), with dimension \((s_{l+1} \times 1)\).

The activation process is simply:

\[
\mathbf{a}^{(l+1)} = g(\mathbf{z}^{(l+1)})
\]

where \(\mathbf{a}^{(l+1)}\) is the vector of activations in layer \(l+1\), with dimension \((s_{l+1} \times 1)\).

To summarize this section:

Information passes from the input layer through hidden layer(s) to the output layer, a process called forward propagation.
Between 2 consecutive layers \(l\) and \(l+1\), data get passed in a linear operation: \(\mathbf{\Theta}^{(l+1)} \cdot \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} = \mathbf{z}^{(l+1)}\).
A non-linear activation function is used to transform the weighted sum of a layer to its output: \(\mathbf{a}^{(l+1)} = g(\mathbf{z}^{(l+1)})\).
\(\mathbf{a}^{(0)}\) is the input data, \(\mathbf{a}^{(L-1)}\) is the model prediction.

NOTE that some materials may use a slightly different formula for the linear operation:

\[
\mathbf{\Theta}^{(l)} \cdot \mathbf{a}^{(l)} + \mathbf{b}^{(l+1)} =
\mathbf{z}^{(l+1)}
\]

This is because \(\mathbf{\Theta}^{(l)}\) is used to denote the mapping from layer \(l\) to \(l+1\), while we used \(\mathbf{\Theta}^{(l+1)}\) for this. So it is a purely a notational difference, and the underlying processes are identical. This notation is chosen because I found it slightly more intuitive to associate a layer with the weights that pass data to the layer. This minor advantage may become more obvious when dealing with convolution neural networks where different types of layers are concatenated.

ALSO NOTE that for the multi-class classification problem, the output layer needs to use the softmax activation function, which is the generalization of logistic to multi-dimensional data. The softmax function for a vector \(x = \{x_0, x_1, \cdots, x_{n-1}\}\) is defined as:

\[
\sigma(x)_j = \frac{e^{x_j}}{\sum_i e^{x_i}}
\]

Computations happening inside the network – the cost function

The cost function is a function that computes the cost J, which is a scalar value that measures the goodness-of-fit of the model. A smaller cost means a better fit of the model prediction. There are multiple ways to define a cost, for different types of problems. For the multi-class classification problem, a typical choice is the cross-entropy cost function. The cost contributed by single record is:

\[
J_i = J(\mathbf{\hat{y}}_i, \mathbf{y}_i) = – \mathbf{y}_i \cdot log(\mathbf{\hat{y}}_i)
\]

The cost of all the training samples is just the average over all \(m\) samples:

\[
J = \frac{1}{m} \sum_i^m J_i = – \frac{1}{m} \sum_i^m [\mathbf{y}_i \cdot log(\mathbf{\hat{y}}_i)]
\]

For good measure, let’s also add the regularization term:

\[
J = – \frac{1}{m} \sum_i^m [\mathbf{y}_i
\cdot log(\mathbf{\hat{y}}_i)] + \frac{\lambda}{m}
\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} {\theta_{j,i}^{(l)}}^2
\]

where \(\lambda\) is the regularization parameter. Note that the nested summations in the above equation are just summing up all (squared) weights in all layers, and biases are not included in the regularization term.

Computations happening inside the network – the backward propagation

Training of a neural network classifier is a supervised learning process, where the model makes a prediction, which is compared with the true value (a.k.a label), and a correction to the model is made accordingly, if needed. The prediction is achieved using the forward propagation, and the correction is done in a process called back-propagation. This forward-backward propagation process is repeated multiple times until a desired skill level of the model is achieved.

Intuitively, the back-propagation finds the error made by the model by comparing the final prediction \(\mathbf{\hat{y}}\) with the label \(\mathbf{y}\): \(\mathbf{\delta} \equiv \mathbf{\hat{y}} – \mathbf{y}\). This error is then propagated backwards from the output layer to hidden layer(s).

This backward error propagation is done because we need to compute the gradients of the weights and biases. Recall that training a model is the same process of adjusting its parameters such that the predictions fit the expected values better. For a neural network, its parameters are the weight matrices and bias vectors associated with the layers.

The gradients of weights are formally defined as \(\frac{\partial J}{\partial \mathbf{\Theta}}\), and gradients of biases are defined as \(\frac{\partial J}{\partial \mathbf{b}}\).

Next we give (without proof) the 2 formulae describing the back-propagation process. More details of the derivation will be covered in a separate post.

Starting from the output layer. The error of the output layer is simply:

\[
\mathbf{\delta}^{(L-1)} = \mathbf{\hat{y}} – \mathbf{y}
\]

(1)

From layer \(L-1\) backwards, the error that gets propagated from layer \(l\) to \(l-1\) is:

\[
\begin{equation}
\mathbf{\delta}^{(l-1)} = {\mathbf{\Theta}^{(l)}}^T \cdot
\mathbf{\delta}^{(l)} \odot g'(\mathbf{z}^{(l-1)})
\end{equation}
\]

(2)

where:

\(\mathbf{\delta}^{(l)}\) is the error of layer \(l\). More formally, this is defined as \(\mathbf{\delta}^{(l)} \equiv \frac{\partial J}{\partial \mathbf{z}^{(l)}}\).
\(\odot\) is the Hadamard product, a.k.a element-wise product.
\(g'()\) is the derivative of the activation function.

One can give a quick check of the dimensions in each term:

\(\mathbf{\delta}^{(l-1)}\): same as \(\mathbf{a}^{(l-1)}\): \((s_{l-1} \times 1)\).
\({\mathbf{\Theta}^{(l)}}^T\): \((s_{l-1} \times s_l)\).
\(\mathbf{\delta}^{(l)}\): same as \(\mathbf{a}^{(l)}\): \((s_l \times 1)\).
\(g'(\mathbf{z}^{(l-1)})\): same as \(\mathbf{z}^{(l-1)}\): \((s_{l-1} \times 1)\).

It is trivial to see that dimensions on both side match.

The 2nd formula involves the computation of gradients in each layer:

\[
\left\{\begin{matrix}
\frac{\partial J}{\partial \mathbf{\Theta}^{(l)}} = & \mathbf{\delta}^{(l)} \cdot {\mathbf{a}^{(l-1)}}^{T} \\
\frac{\partial J}{\partial \mathbf{b}^{(l)}} = & \mathbf{\delta}^{(l)}
\end{matrix}\right.
\]

(3)

where:

\(\frac{\partial J}{\partial \mathbf{\Theta}^{(l)}}\) is the gradients of cost function wrt the weights associated with layer \(l\).
\(\frac{\partial J}{\partial \mathbf{b}^{(l)}}\) is the gradients of cost function wrt the biases associated with layer \(l\).

These gradients can be used to update the parameters using, for instance, a gradient descent scheme.

Notice that \(\mathbf{\Theta}^{(l)}\) is the weights mapping from layer \(l-1\) to \(l\). Therefore, a mnemonic for the above equation is:

gradients of the weights (\(\partial J/ \partial \Theta\)) is the product of the inputs into the weights (\(a^{(l-1)}\)) and the error out from the weights (\(\delta^{(l)}\)).

Gradient descent update

We will be using the basic gradient descent update scheme for this one. There are various derivations aimed at improving the convergence process, and some of them will be introduced in a later post.

The formula for updating the layer weights is:

\[
\mathbf{\Theta} := \mathbf{\Theta} – \frac{\alpha}{m} \sum_i^m \frac{\partial J_i}{\partial
\mathbf{\Theta}} – \frac{\alpha \lambda}{m} \mathbf{\Theta}
\]

The formula for updating the layer biases is:

\[
\mathbf{b} := \mathbf{b} – \frac{\alpha}{m} \sum_i^m \frac{\partial J_i}{\partial \mathbf{b}}
\]

where \(\alpha\) is the learning rate, \(\lambda\) the regularization parameter.

Recap on the notations

It might be worthwhile giving a recap on the notations introduced so far:

\(L\): number of layers in a neural network.
\(l\): index of a layer. \(l = 0, 1, \cdots, L-1\).
\(s_l\): number of neurons/units in layer \(l\).
\(\theta_{j,i}^{(l)}\): weight that maps from neuron \(i\) in layer \(l-1\) to neuron \(j\) in layer \(l\).
\(b_{j}^{(l)}\): bias added onto neuron \(j\) in layer \(l\).
\(z_{j}^{(l)}\): weighted sum of neuron \(j\) in layer \(l\).
\(a_{j}^{(l)}\): activation of neuron \(j\) in layer \(l\).
\( \mathbf{\Theta}^{(l)} \) : weight matrix that maps from layer \(l-1\) to layer \(l\).
\(\mathbf{b}^{(l)}\): bias vector added onto layer \(l\).
\(\mathbf{z}^{(l)}\): weighted sum vector of layer \(l\).
\(\mathbf{a}^{(l)}\): activation vector of layer \(l\).
\(g()\): activation function.
\(\mathbf{\hat{y}}\): vector of model prediction.
\(\mathbf{y}\): vector of label.
\(\mathbf{\delta}^{(l)}\): vector of error in layer \(l\).
\(J\): cost function value.
\(\odot\): Hadamard product, a.k.a element-wise product.

numpy implementation

This section introduces the Python implementation of the above computations. The results will be used as building blocks of the final neural network.

Softmax function

The softmax function can be easily implemented in numpy as:

def softmax(x):
	return np.exp(x) / np.sum(np.exp(x))

Alternatively, there is also a function in scipy: scipy.special.softmax.

Logistic function and its derivative

def logistic(x):
	return 1 / (1 + np.exp(-x))

Alternatively, there is also a function in scipy: scipy.special.expit.

The derivative of logistic is:

def dlogistic(x):
	'''Derivative of the logistic function'''
	return np.exp(-x)/(1 + np.exp(-x))**2

Cross-entropy function

def crossEntropy(yhat, y):
	'''Cross entropy cost function '''
	eps = 1e-10
	yhat = np.clip(yhat, eps, 1-eps)
	return - np.nansum(y*np.log(yhat))

This is basically a literal translation of the mathematical equation into Python language, with the only exception that a small value (eps) is added to the yhat term to avoid infinities from the logarithm function when yhat = 0, and the same eps is subtracted from 1 just to make it symmetrical.

Cost function

The cost consists of 2 parts:

errors of the model contributed by training samples. This is computed in a sampleCost() function.
penalties by the regularization term. This is computed in a regularizationCost() function.

def samplecost(self, yhat, y):
	'''cost of a single training sample/batch

	args:
	yhat (ndarray): prediction in shape (n, m). m is the number of
		final output units, n is the number of records.
	y (ndarray): label in shape (n, m).
	returns:
	cost (float): summed cost.
	'''
	return self.cost_func(yhat, y)

def regulizationcost(self):
	'''cost from the regularization term

	defined as the summed squared weights in all layers, not including
	biases.
	'''
	j = 0
	for ll in range(1, self.n_layers):
	j = j+np.sum(self.thetas[ll]**2)
	return j

Here, self.cost_func is just the crossEntropy() function defined above. The latter is passed in to the constructor of the class such that it is more flexible to switch to a different cost function definition should one decides to.

Forward propagation

The forward propagation consists of 2 equations, each corresponding to basically 1 line of code:

z = np.dot(a, thetas.T) + b
a = activation(z)

This should be fairly self-explanatory. However, in practice, some extra house-keeping and structuring are needed. The full function can be implemented as follows:

def feedForward(self, x):
	'''Forward pass

	Args:
	x (ndarray): input data with shape (n, f). f is the number of input units,
		n is the number of records.
	Returns:
	weight_sums (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		weighted sums of each layer:

		z^{(l+1)} = a^{(l)} \cdot \theta^{(l+1)}^{T} + b^{(l+1)}

		where:
		z^{(l+1)}: weighted sum in layer l+1.
		a^{(l)}: activation in layer l.
		\theta^{(l+1)}: weights that map from layer l to l+1.
		b^{(l+1)}: biases added to layer l+1.

		The value for key=0 is the same as input <x>.
	activations (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		activations in each layer. See above.
	'''
	x = self.force2D(x)
	activations = {0: x}
	weight_sums = {0: x}
	a1 = x
	for ii in range(1, self.n_layers):
	bii = self.biases[ii]
	zii = np.dot(a1, self.thetas[ii].T)+bii
	if ii == self.n_layers-1:
		aii = self.af_last(zii)
	else:
		aii = self.af(zii)
	activations[ii] = aii
	weight_sums[ii] = zii
	a1 = aii

	return weight_sums, activations

Firstly, we are going to assume (and therefore require) that the input data records are arranged in rows. That is to say, the input variable x is an ndarray of shape (m, f), where m being the number of records, and f the number of features (in the case of MNIST data, f = 784).

To enforce this rule on a single data record with a shape of (784,), we need to reshape it into 2d, like this:

x = np.atleast_2d(x)

Then x.shape will be (1, 784).

For inputs with more than 2 records, e.g. (4, 784), atleast_2d(x) will have no effects.

It is noticed that the activations and weighted sums of the layers are used during the back-propagation process (see Equation (2),(3)). Therefore, it is useful to keep a record of these variables during the forward propagation process. These can be stored in a Python dictionary using the layer indices as keys. E.g.

activations = {0: x}
weight_sums = {0: x}

Here, the weighted sum and activation of the 1st layer (indexed 0) are both the input array x itself.

The self keyword implies that this is a class method, as eventually we are making it into a NeuralNetwork class. The self.biases class attribute is a dictionary storing the layer biases, similarly for the self.thetas which stores the layer weights. E.g. self.biases[1] is the bias array for layer-1, self.thetas[4] is the weight matrix for layer-4.

It is also noticed that hidden layers use logistic as the activation function, and the output layer uses softmax. A simple if condition is used to make this distinction.

Back-propagation

Back-propagation consists 2 equations as well. Inputs for this process are the label data \(\mathbf{y}\), and weighted sums \(\mathbf{z}\) and activations \(\mathbf{a}\) from forward propagation. Outputs from back-propagation are the gradients for weights and biases.

The full function can be implemented as follows:

def feedBackward(self, weight_sums, activations, y):
	'''Backward propogation

	Args:
	weight_sums (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		weighted sums of each layer.
	activations (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		activations in each layer.
	y (ndarray): label in shape (m, n). m is the number of
		final output units, n is the number of records.
	Returns:
	grads (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		summed gradients for the weight matrix in each layer.
	grads_bias (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		summed gradients for the bias in each layer.
	'''
	grads = {}       # gradients for weight matrices
	grads_bias = {}  # gradients for bias
	y = self.force2D(y)
	delta = activations[self.n_layers-1] - y

	for jj in range(self.n_layers-1, 0, -1):
	grads[jj] = np.einsum('ij,ik->jk', delta, activations[jj-1])
	grads_bias[jj] = np.sum(delta, axis=0, keepdims=True)
	if jj != 1:
		delta = np.dot(delta, self.thetas[jj])*self.daf(weight_sums[jj-1])

	return grads, grads_bias

As before, we enforce 2d shape for the y data using force2D().

grads and grads_bias are 2 dictionaries, storing the resultant gradient arrays, again using layer indices as keys.

The error of the model \(\delta^{(L-1)}\) is simply:

delta = activations[self.n_layers-1] - y

Then we move backwards from self.n_layers-1 to 1 in a for loop.

The 1st line inside the for loop computes the gradients of weights. I used np.einsum() for this to vectorize the computation.

NOTE that the delta variable corresponds to the \(\mathbf{\delta}^{(l)}\) term in Equation (3), and is of shape \((m \times s_{l})\), where \(m\) is the number of records. activations[jj-1] is the \(\mathbf{a}^{(l-1)}\) term, with shape \((m \times s_{l-1})\). Equation (3) describes the computation for a single record, but here we are dealing with m records simultaneously. This vectorization can help enhance the computational efficiency by a considerable amount.

The einsum() function is Einstein summation function, using a convention also known as indicial notation sometimes seen in mathematical derivations. In this case, it gives the same results as np.dot(delta.T, activations[jj-1]). I have noticed that even when the results are identical, einsum() can be faster by 3-4 times than the dot() function. Therefore, it is preferable to replace all np.dot() and np.sum() calls with einsum() to achieve better performance.

The 2nd line computes the summed gradients of biases. The summation is done across the m records. We will be dividing this by m later, and the grads[jj] variable as well, to make them into averages.

Lastly, the error is propagated backwards using

delta = np.dot(delta, self.thetas[jj])*self.daf(weight_sums[jj-1])

where self.daf is the derivative of activation function, which in this case is the derivative of logistic.

Gradient descent update

The gradient descent update is done separately for the weights and biases in each layer. The full function can be implemented as follows:

def gradientDescent(self, grads, grads_bias, n):
	'''Perform gradient descent parameter update

	Args:
	grads (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		summed gradients for the weight matrix in each layer.
	grads_bias (dict): keys: layer indices starting from 0 for
		the input layer to N-1 for the last layer. Values:
		summed gradients for the bias in each layer.
	n (int): number of records contributing to the gradient computation.

	Update rule:

	\theta_i = \theta_i - \alpha (g_i + \lambda \theta_i)

	where:
	\theta_i: weight i.
	\alpha: learning rate.
	\g_i: gradient of weight i.
	\lambda: regularization parameter.
	'''

	n = float(n)
	for jj in range(1, self.n_layers):
	theta_jj = self.thetas[jj]
	theta_jj = theta_jj - self.lr * \
		grads[jj]/n - theta_jj*self.lr*self.lam/n
	self.thetas[jj] = theta_jj

	bias_jj = self.biases[jj]
	bias_jj = bias_jj - self.lr * grads_bias[jj]/n
	self.biases[jj] = bias_jj

	return

Note that n is the number of records contributing to the summed gradients in grads and grads_bias. The averages are computed here, rather than during the gradient computation process inside the feedBackward() function.

This decision is primarily because that training of a network can be performed in different modes. In stochastic training, in each training iteration the gradients are computed from a single training record and parameters updated immediately. For stochastic training, n = 1, so the summation and the average are the same thing.

In batch training, the gradients are computed from all available training records, and n = m.

In mini-batch training, the gradients are computed from a subset of all available training records, and n = m (these 2 m are not necessarily the same value).

Therefore, for batch and mini-batch training modes, it is easier to accumulate the gradients during the back-propagation process and do the averaging over n later.

These 3 training modes are implemented in the following sections.

Stochastic training

During each epoch of a stochastic training process, these steps are done:

randomly select 1 record from the training dataset: \((\mathbf{x}_i, \mathbf{y}_i)\). Note that this random selection is random sampling without replacement, meaning that already sampled records will not be sampled again.
make a model prediction using the sampled record by a forward propagation, giving \(\mathbf{\hat{y}}_i\).
compute the error \(\mathbf{\delta} = \mathbf{\hat{y}}_i – \mathbf{y}_i\), and perform back-propagation to get the gradients \(\partial J/\partial \mathbf{\Theta}\) and \(\partial J/\partial
\mathbf{b}\) for all the layers.
perform gradient descent using the gradients.
back to step 1, until all training samples have been iterated through.

The full function can be implemented as follows:

def stochasticTrain(self, x, y, epochs):
	'''Stochastic training

	Args:
	x (ndarray): input with shape (n, f). f is the number of input units,
		n is the number of records.
	y (ndarray): input with shape (n, m). m is the number of output units,
		n is the number of records.
	epochs (int): number of epochs to train.
	Returns:
	self.costs (ndarray): overall cost at each epoch.
	'''
	self.costs = []
	m = len(x)
	for ee in range(epochs):
	idxs = np.random.permutation(m)
	for i, ii in enumerate(idxs):
		xii = x[[ii]]
		yii = y[[ii]]
		weight_sums, activations = self.feedForward(xii)
		grads, grads_bias = self.feedBackward(
		weight_sums, activations, yii)
		self.gradientDescent(grads, grads_bias, 1)

	yhat = self.predict(x)
	j = self.sampleCost(yhat, y)
	j2 = self.regulizationCost()
	j = j/m + j2*self.lam/m
	print('# <stochasticTrain>: Epoch = %d, Cost = %f' % (ee, j))
	# annealing
	if len(self.costs) > 1 and j > self.costs[-1]:
		self.lr *= 0.9
		print('# <stochasticTrain>: Annealing learning rate, current lr =', self.lr)
	self.costs.append(j)

	self.costs = np.array(self.costs)

	return self.costs

NOTE that after each epoch, we compute the cost function and store all the costs into an array. This costs array is useful to monitor the training process, by for instance plotting the curve of costs against epochs.

Batch and mini-batch training

During each epoch of a batch or mini-batch training process, these steps are done:

randomly select m records (\(m > 1\)) from the training dataset: \(\{(\mathbf{x}_i, \mathbf{y}_i), \; i = 1, 2, \cdots, m\}\). Note that this random selection is random sampling without replacement, meaning that already sampled records will not be sampled again.
make a model prediction using each sampled record by a forward propagation, giving \(\mathbf{\hat{y}}_i\).
compute the error \(\mathbf{\delta}_i = \mathbf{\hat{y}}_i – \mathbf{y}_i\), and perform back-propagation to get the gradients \(\partial J/\partial \mathbf{\Theta}\) and \(\partial J/\partial \mathbf{b}\) for all the layers.
compute the averaged gradients stemmed from all \(m\) samples, and use that to perform gradient descent using the gradients.
back to step 1, until all training samples have been iterated through.

Notice that the only difference between a batch training and a mini-batch training is that in batch training, \(m = M\), the number of all available samples, while in mini-batch training, \(m = k\), a term called batch size.

In view of such a similarity, a separate function is created and used as the core of both batch and mini-batch training modes. This function can be implemented as follows:

def _batchTrain(self, x, y):
	'''Training using a single sample or a sample batch

	Args:
	x (ndarray): input with shape (n, f). f is the number of input units,
		n is the number of records.
	y (ndarray): input with shape (n, m). m is the number of output units,
		n is the number of records.
	Returns:
	j (float): summed cost over samples in <x> <y>.

	Slow version, use the sample one by one
	'''
	m = len(x)
	grads = dict([(ll, np.zeros_like(self.thetas[ll])) for ll
			in range(1, self.n_layers)])
	grads_bias = dict([(ll, np.zeros_like(self.biases[ll])) for ll
			in range(1, self.n_layers)])
	j = 0   # accumulates cost
	idx = np.random.permutation(m)
	for ii in idx:
	xii = x[ii]
	yii = y[ii]
	weight_sums, activations = self.feedForward(xii)
	j = j+self.sampleCost(activations[self.n_layers-1], yii)
	gradii, grads_biasii = self.feedBackward(weight_sums, activations, yii)
	for ll in range(1, self.n_layers):
		grads[ll] = grads[ll]+gradii[ll]
		grads_bias[ll] = grads_bias[ll]+grads_biasii[ll]

	self.gradientDescent(grads, grads_bias, m)

	return j

Notice that for a given batch size m, we are iterating one-by-one of the m samples, each time performing a forward pass and a backward pass. The gradients are accumulated across the m samples.

This is a faithful implementation of the algorithm, but not necessarily an efficient one. Recall that when introducing the forward propagation, we already mentioned the benefits of vectorized computations. Similarly, the batch training function above can be vectorized, as follows:

def _batchTrain2(self, x, y):
	'''Training using a single sample or a sample batch

	Args:
	x (ndarray): input with shape (n, f). f is the number of input units,
		n is the number of records.
	y (ndarray): input with shape (n, m). m is the number of output units,
		n is the number of records.
	Returns:
	j (float): summed cost over samples in <x> <y>.

	Vectorized version. About 3x faster than _batchTrain()
	'''
	m = len(x)
	weight_sums, activations = self.feedForward(x)
	grads, grads_bias = self.feedBackward(weight_sums, activations, y)
	self.gradientDescent(grads, grads_bias, m)
	j = self.sampleCost(activations[self.n_layers-1], y)

	return j

NOTE that in the vectorized version, we are no longer making a for loop through the selected sample batch. The code looks cleaner, and actually runs faster. My preliminary tests suggest that _batchTrain2() is about 3 times faster than _batchTrain().

With the core of batch and mini-batch update ready, let’s create a wrapper to get the function for batch training and min-batch training.

The batch training function can be implemented as:

def batchTrain(self, x, y, epochs):
	'''Training using all samples

	Args:
	x (ndarray): input with shape (n, f). f is the number of input units,
		n is the number of records.
	y (ndarray): input with shape (n, m). m is the number of output units,
		n is the number of records.
	epochs (int): number of epochs to train.
	Returns:
	self.costs (ndarray): overall cost at each epoch.
	'''

	self.costs = []
	m = len(x)

	for ee in range(epochs):
	j = self._batchTrain2(x, y)
	j2 = self.regulizationCost()
	j = j/m + j2*self.lam/m
	print('# <batchTrain>: Epoch = %d, Cost = %f' % (ee, j))
	# annealing
	if len(self.costs) > 1 and j > self.costs[-1]:
		self.lr *= 0.9
		print('# <batchTrain>: Annealing learning rate, current lr =', self.lr)

	self.costs.append(j)
	self.costs = np.array(self.costs)

	return self.costs

Again, after each epoch we compute the cost and store its value.

Additionally, if we notice that the cost function j increases after an epoch, we shrink the learning rate by a factor of 0.9. This trick is sometimes called annealing. An increasing cost could be the result of the learning rate being too large. There are other ways to dynamically adjust the learning rate during training, and they are not covered in this post.

The mini-batch training function can be implemented as:

def miniBatchTrain(self, x, y, epochs, batch_size):
	'''Training using mini batches

	Args:
	x (ndarray): input with shape (n, f). f is the number of input units,
		n is the number of records.
	y (ndarray): input with shape (n, m). m is the number of output units,
		n is the number of records.
	epochs (int): number of epochs to train.
	batch_size (int): mini-batch size.
	Returns:
	self.costs (ndarray): overall cost at each epoch.
	'''

	self.costs = []
	m = len(x)
	for ee in range(epochs):
	j = 0
	idx = np.random.permutation(m)
	id1 = 0
	while True:
		id2 = id1+batch_size
		id2 = min(id2, m)
		idxii = idx[id1:id2]
		xii = x[idxii]
		yii = y[idxii]
		j = j+self._batchTrain2(xii, yii)
		id1 += batch_size
		if id1 > m-1:
		break

	j2 = self.regulizationCost()
	j = j/m + j2*self.lam/m

	print('# <miniBatchTrain>: Epoch = %d, Cost = %f' % (ee, j))
	# annealing
	if len(self.costs) > 1 and j > self.costs[-1]:
		self.lr *= 0.9
		print('# <miniBatchTrain>: Annealing learning rate, current lr =', self.lr)

	self.costs.append(j)
	self.costs = np.array(self.costs)

	return self.costs

Different from the batch training, we are not passing all the samples to the _batchTrain2() function in one go, but in a number of mini-batches, each time with a size of batch_size.

Again, costs after epochs are recorded and annealing of the learning rate is performed.

Save and load the parameters

Complex model can be difficult to train, and it might be desirable to split the entire training process into multiple sessions. At the end of a training session, the current model parameters are saved to disk. Before the next session starts, the previous parameters are loaded and training starts from there.

These saving and loading functions can be implemented as:

def saveWeights(self, outfilename):
	'''Save model parameters to file

	Args:
	outfilename (str): absolute path to file to save model parameters.

	Parameters are saved using numpy.savez(), loaded using numpy.load().
	'''

	print('\n# <saveWeights>: Save network weights to file', outfilename)

	dump = {'lr': self.lr,
		'n_nodes': self.n_nodes,
		'lam': self.lam}
	for ll in range(1, self.n_layers):
	dump['theta_%d' % ll] = self.thetas[ll]
	dump['bias_%d' % ll] = self.biases[ll]

	np.savez(outfilename, **dump)

	return

def loadWeights(self, abpathin):
	'''Load model parameters from file

	Args:
	abpathin (str): absolute path to file to load model parameters.

	Parameters are saved using numpy.savez(), loaded using numpy.load().
	'''

	print('\n# <saveWeights>: Load network weights from file', abpathin)

	with np.load(abpathin) as npzfile:
	for ll in range(1, self.n_layers):
		thetall = npzfile['theta_%d' % ll]
		biasll = npzfile['bias_%d' % ll]
		self.thetas[ll] = thetall
		self.biases[ll] = biasll
	self.lam = npzfile['lam']
	self.lr = npzfile['lr']

	return

Others

The above has covered all the key elements of our NeuralNetwork class. There are a few miscellaneous points.

1. parameter initialization

It is recommended that for logistic activation functions, the weights of a neural network are initialized using xavier initialization or normalized xavier initialization. Specifically, the weights \(\mathbf{\Theta}\) are initialized from random variables of a uniform distribution within \([ – \sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}]\), where \(n_{in}\) is the number of inputs to the weights, and \(n_{out}\) the number of outputs. For instance, the weights for the 1st hidden layer in our MNIST classification model is initialized using

\[
\Theta^{(1)} \sim U[- \frac{\sqrt{6}}{\sqrt{ 784 + 200}}, \frac{\sqrt{6}}{\sqrt{ 784 + 200}}]
\]

Such an initialization can be implemented as:

def init(self):

    self.thetas = {}  # theta_l is mapping from layer l-1 to l
    self.biases = {}  # bias_l is added to layer l when mapping from layer l-1 to l

    for ii in range(1, self.n_layers):
        if self.init_func == 'xavier':
            stdii = np.sqrt(6/(self.n_nodes[ii-1]+self.n_nodes[ii]))
            thetaii = (np.random.rand(self.n_nodes[ii], self.n_nodes[ii-1]) - 0.5) * stdii
        elif self.init_func == 'he':
            stdii = np.sqrt(2/self.n_nodes[ii-1])
            thetaii = np.random.normal(0, stdii, size=(self.n_nodes[ii],
                self.n_nodes[ii-1]))

        self.thetas[ii] = thetaii
        self.biases[ii] = np.zeros([1, self.n_nodes[ii]])

NOTE that also included is the implementation of the HE initialization, which is recommended for models using the ReLU activation function.

2. Visualize the network

We give without further explanations a function to plot out the neurons in a network, using the networkx module.

def plotNN(self):
    '''Plot structure of network'''

    import networkx as nx

    self.graph = nx.DiGraph()
    show_nodes_half = 3

    layered_nodes = []
    for ll in range(self.n_layers-1):
        # layer l
        nodesll = list(zip([ll, ] * self.n_nodes[ll], range(self.n_nodes[ll])))
        # layer l+1
        nodesll2 = list(zip([ll+1, ]*self.n_nodes[ll+1], range(self.n_nodes[ll+1])))

        # omit if too many
        if len(nodesll) > show_nodes_half*3:
            nodesll = nodesll[:show_nodes_half] + nodesll[-show_nodes_half:]
        if len(nodesll2) > show_nodes_half*3:
            nodesll2 = nodesll2[:show_nodes_half] + nodesll2[-show_nodes_half:]

        # build network
        edges = [(a, b) for a in nodesll for b in nodesll2]
        self.graph.add_edges_from(edges)

        layered_nodes.append(nodesll)
        if ll == self.n_layers-2:
            layered_nodes.append(nodesll2)

    # -----------------Adjust positions-----------------
    pos = {}

    for ll, nodesll in enumerate(layered_nodes):
        posll = np.array(nodesll).astype('float')
        yll = posll[:, 1]
        if self.n_nodes[ll] > show_nodes_half*3:
            bottom = yll[:show_nodes_half]
            top = yll[-show_nodes_half:]
            bottom = bottom/np.ptp(bottom)/3.
            top = (top-np.min(top))/np.ptp(top)/3.+2/3.
            yll = np.r_[bottom, top]
        else:
            yll = yll/np.ptp(yll)

        posll[:, 1] = yll

        pos.update(dict(zip(nodesll, posll)))

    #-------------------Draw figure-------------------
    fig = plt.figure()
    ax = fig.add_subplot(111)
    nx.draw(self.graph, pos=pos, ax=ax, with_labels=True)
    plt.show(block=False)

    return

The figure below shows the visualization of our MNIST classifier network. Note that when a layer has too many neurons, only a small subset is drawn.

Visualize the network. Note that not all neurons in each layer have been drawn.

Complete class definition

The complete code for our NeuralNetwork class can be found in the classifier/multiclassnn.py file in this Github repo.

Application on MNIST data

The MNIST data in csv format are obtained from: http://pjreddie.com/projects/mnist-in-csv/. The training set consists of 60,000 records, and the test set 10,000. The function below reads in these data and format into ndarray.

def readData(path):
    '''Read csv formatted MNIST data

    Args:
        path (str): absolute path to input csv data file.
    Returns:
        x_data (ndarray): mnist input array in shape (n, 784), n is the number
            of records, 784 is the number of (flattened) pixels (28 x 28) in
            each record.
        y_data (ndarray): mnist label array in shape (n, 10), n is the number
            of records. Each label is a 10-element binary array where the only
            1 in the array denotes the label of the record.
    '''
    data_file = pd.read_csv(path, header=None)
    x_data = []
    y_data = []
    for ii in range(len(data_file)):
        lineii = data_file.iloc[ii, :]
        imgii = np.array(lineii.iloc[1:])/255.
        labelii = np.zeros(10)
        labelii[int(lineii.iloc[0])] = 1
        y_data.append(labelii)
        x_data.append(imgii)

    x_data = np.array(x_data)
    y_data = np.array(y_data)

    return x_data, y_data

NOTE that pandas is used for csv file reading, but it is not essential, the plain csv can manage just as well.

The script below reads in the training data, build the neural network, train the model using mini-batch mode for 50 epochs, with a batch-size of 64. Then the cost function curve is plotted against epochs, and the model is tested against the test dataset.

Finally, prediction accuracy for the training and test datasets are printed out.

Code here:

N_INPUTS = 784                 # input number of units
N_HIDDEN = [200, 100, 50]      # hidden layer sizes
N_OUTPUTS = 10                 # output number of units
LEARNING_RATE = 0.1            # initial learning rate
LAMBDA = 0.01                  # regularization parameter
EPOCHS = 50                    # training epochs

# -----------------Open MNIST data-----------------
x_data, y_data = readData(TRAIN_DATA_FILE)
print('x_data.shape:', x_data.shape)
print('y_data.shape:', y_data.shape)

# ------------------Create network------------------
nn = neuralNetwork(N_INPUTS, N_HIDDEN, N_OUTPUTS, LEARNING_RATE,
                    af_last=softmax_a, lam=LAMBDA)

# -----------------Mini-batch train-----------------
costs = nn.miniBatchTrain(x_data, y_data, EPOCHS, 64)

# -----------Plot a graph of the network-----------
fig, ax = plt.subplots()
ax.plot(costs, 'b-o')
ax.set_xlabel('epochs')
ax.set_ylabel('Cost')
fig.show()

nn.saveWeights('./weights.npz')

# ---------------Validate on train set---------------
yhat_train = np.argmax(nn.predict(x_data), axis=1)
y_true_train = np.argmax(y_data, axis=1)
n_correct_train = np.sum(yhat_train == y_true_train)

print('Training samples:')
n_test_train = 20
for ii in np.random.randint(0, len(y_data)-1, n_test_train):
    yhatii = yhat_train[ii]
    ytrueii = y_true_train[ii]
    print('yhat = %d, yii = %d' % (yhatii, ytrueii))

# ------------------Open test data------------------
x_data_test, y_data_test = readData(TEST_DATA_FILE)

# ---------------test on test set---------------
yhat_test = np.argmax(nn.predict(x_data_test), axis=1)
y_true_test = np.argmax(y_data_test, axis=1)
n_correct_test = np.sum(yhat_test == y_true_test)

print('Test samples:')
n_test = 20
for ii in np.random.randint(0, len(y_data_test)-1, n_test):
    yhatii = yhat_test[ii]
    ytrueii = y_true_test[ii]
    print('yhat = %d, yii = %d' % (yhatii, ytrueii))

print('Accuracy in training set:', n_correct_train/float(len(x_data))*100.)
print('Accuracy in test set:', n_correct_test/float(len(x_data_test))*100.)

The figure below shows the changes of cost with epochs:

After 50 epochs of training, the accuracy on training set is: 98.41 %. Accuracy on test set is: 96.71 %. Not too shabby.

Let’s also create a few plots showing the prediction results:

nrows = 3
ncols = 4
n_plots =  nrows*ncols
fig = plt.figure(figsize=(12, 10))
for i in range(n_plots):
    axi = fig.add_subplot(nrows, ncols, i+1)
    axi.axis('off')
    xi = x_data_test[i]
    yi = y_data_test[i]
    yhati = yhat_test[i]
    plotResult(xi, yi, yhati, ax=axi)

fig.show()

It seems that at least for the first 12 test casts, our model gets them all correct. See figure below.

Images of the first 12 test samples in MNIST dataset. Also shown are the model predictions and true values.

Lastly, the training process cost me 345 seconds, about 6 minutes. I was using a fairly beefy 8 core AMD Ryzen 3700X desktop CPU. On a laptop it would take a bit longer. But the performance is pretty good considering that the entire thing uses nothing but numpy.

Wrap up

In this post, we introduced the multi-class classification neural network. The model is trained using forward and backward propagation. A softmax function is used as the output layer activation, and hidden layers use logistic function as activation. Vectorized implementation allows our simple numpy implementation to achieve a fairly good performance.

Many of the concepts and notation are common among different types of networks. In a future post, I will share some knowledge and experiences in building a convolutional neural network using numpy.

Support me

If you like the contents, please consider sending a small tip. Below this my monero wallet address:

41tR6ku6uiDA41awW1UgVD9DoJ1PxYQsMa1iQjM68ekHGTee5wgEfgLDXh7QZ3cZZnRnVuC5nGeY1Na28ZJygrHx3JeKBV7

Create a simple neural network using numpy

The goal

Computations in the network and notations

Layers in the network

Computations happening inside the network – the forward propagation

Computations happening inside the network – the cost function

Computations happening inside the network – the backward propagation

Gradient descent update

Recap on the notations

numpy implementation

Softmax function

Logistic function and its derivative

Cross-entropy function

Cost function

Forward propagation

Back-propagation

Gradient descent update

Stochastic training

Batch and mini-batch training

Save and load the parameters

Others

1. parameter initialization

2. Visualize the network

Complete class definition

Application on MNIST data

Wrap up

Support me

Leave a Reply Cancel reply

The goal

Computations in the network and notations

Layers in the network

Computations happening inside the network – the forward propagation

Computations happening inside the network – the cost function

Computations happening inside the network – the backward propagation

Gradient descent update

Recap on the notations

numpy implementation

Softmax function

Logistic function and its derivative

Cross-entropy function

Cost function

Forward propagation

Back-propagation

Gradient descent update

Stochastic training

Batch and mini-batch training

Save and load the parameters

Others

1. parameter initialization

2. Visualize the network

Complete class definition

Application on MNIST data

Wrap up

Support me

Related Posts

Matplotlib气象色标模块的构造与使用

An Algorithm for Computing the Approximate IoU Between Oriented 2D boxes

Signed distance function to oriented 2D boxes

Leave a Reply Cancel reply