This is the beginning of my neural networks learning. I have read the books written by Michael Nielsen for a long time and I think now it’s time to complete the learning examples in first chapter of his book.
Introduction of some functions
I first want to show the use of some functions in his program example. These uses really surprise me a lot.
numpy.random
The first function is the random function from the package numpy. In initialize the matrix, the random function is used.
random.randn
Return a sample (or samples) from the “standard normal” distribution.
If positive, int_like or intconvertible arguments are provided, randn generates an array of shape (d0, d1, …, dn), filled with random floats sampled from a univariate “normal” (Gaussian) distribution of mean 0 and variance 1 (if any of the d_i are floats, they are first converted to integers by truncation). A single float randomly sampled from the distribution is returned if no argument is provided
1  import numpy as np 
2  print(np.random.randn(1)) 
[0.34894831]
1  print(np.random.randn(3)) 
[0.34444932 0.12172097 1.14900238]
1  print(np.random.randn(2,3)) 
[[ 0.49635216 0.22762119 0.68270641]
[2.13526944 0.82040908 0.79356388]]
Random.shuffle
Modify a sequence inplace by shuffling its contents.
This function only shuffles the array along the first axis of a multidimensional array. The order of subarrays is changed but their contents remains the same.
1  A=np.random.randn(4,4) 
1  print(A) 
[[ 0.89532715 2.34406351 0.47233016 0.1943856 ]
[ 0.57509425 0.84810353 1.11576561 1.33146725]
[ 0.81883264 2.25208295 1.52527099 1.30444846]
[ 1.94464225 0.29825984 0.16625868 0.35876162]]
1  np.random.shuffle(A) 
2  print(A) 
[[ 1.94464225 0.29825984 0.16625868 0.35876162]
[ 0.57509425 0.84810353 1.11576561 1.33146725]
[ 0.81883264 2.25208295 1.52527099 1.30444846]
[ 0.89532715 2.34406351 0.47233016 0.1943856 ]]
zip()
Python’s zip() function creates an iterator that will aggregate elements from two or more iterables. You can use the resulting iterator to quickly and consistently solve common programming problems, like creating dictionaries. In this tutorial, you’ll discover the logic behind the Python zip() function and how you can use it to solve realworld problems.
1  A=['1','2','3'] 
2  B=['A','B','C'] 
3  C=[1,2,3] 
4  ABC=zip(A,B,C) 
5  print(ABC) 
<zip object at 0x000001D6AB168A08>
1  type(ABC) 
zip
1  list(ABC) 
[('1', 'A', 1), ('2', 'B', 2), ('3', 'C', 3)]
Matrix use
1  A=np.array([1,2,3,4,5,6,7]) 
2  for l in A[1:]: 
3  print(l) 
2
3
4
5
6
7
1  for l in A[:1]: 
2  print(l) 
1
2
3
4
5
6
1  sizes=[2,3,4] 
2  W=[np.random.randn(y, x) for x, y in zip(sizes[:1], sizes[1:])] 
1  print(W) 
[array([[0.53071848, 0.26905161],
[0.75696575, 0.57292324],
[1.47093334, 0.060232 ]]), array([[ 1.03193319, 0.58177683, 0.78046451],
[ 0.14132843, 0.90416154, 0.12645047],
[ 1.90204955, 0.55866015, 0.39481778],
[0.11897701, 1.1277029 , 0.7584795 ]])]
My understanding on the neural networks
How we map the input to the output
Previously, I have learned the use of some special functions, now it’s time to give a summary of my understanding. I will not list everything since Michael Nieslen gives wonderful descriptions.
In my view, our problem is that we have a input which is usually a one dimensional array, the output is still an array. What we need do is to map the input to the output correctly.
In real life we will can describe and measure the world in different way. The color, the sound, the taste etc. However,
Anything is a number.
The properties in real world can all be mapped into a number space. And what happened in real world can be described by the numbers and number operations. For example, we use coordinate (x,y,z) to describe the position of the some object.
We create the neural networks and it has many layers. From the mathematical view, the input data $V_{in}$ will be processed through different layers with a matrix and Sigmoid transformation.
Where $V^{i+1}$ is the value in layer $i+1$ and $\boldsymbol{W}^{i}$ is the weight matrix between layer i and layer j. $\boldsymbol{b}$ is a vector.
So the value vectors in different layers will be linked with the the transformation. So this is how we obtain the output from the input.
In summary, we have a neural networks means we have a series of weight matrix and bias vectors. Different weights, bias, number of layers will give different neural networks. Our neural networks are actually a series of matrix, vectors.
How to measure the quality of the mapping?
To measure the quality of the mapping, we should compare the output data from our neural networks and the actual data. To measure the quality we can for example define the following cost function
If the difference between the output and the actual value is smaller, it means our neural networks works better.
How to train our neural networks？
A very important step of deep learning is to train our neural networks. Training means we change our weights and bias vector to let our output closer to the actual result.
To realize this, we need give some modifications after each learning. In machine learning, the down hill method is used to optimize our parameters. What we will do is just the same. The difference is that the effect of weights and bias on the output is more complex. We need choose a road in an abstract space to let the cost function decrease just like we go down in an abstract space.
So the partial derivative will be calculated. Some tricks will be used to let the change of the weight function and bias will always let the cost function decrease.
But what’s really exciting about the equation is that it lets us see how to choose $\Delta \nu$ so as to make $\Delta C$ negative. In particular, suppose we choose
where $\eta$ is a small, positive parameter (known as the learning rate). So $\Delta C\approx \eta \nabla C^2<0$
and the vector should be updated like this
However, for a neural networks, further derivation must be done to calculate the derivative of the weight and bias in each layers. Nielsen gives detailed explanation and proof.Back Propagation Method
Here is a simple summary,
The backpropagation equations provide us with a way of computing the gradient of the cost function. Let’s explicitly write this out in the form of an algorithm:
 Input x: Set the corresponding activation a1 for the input layer.
 Feedforward: For each l=2,3,…,L compute $z^{l}=w^{l}a^{l1}+b^{l}$ and $a^{l}=\sigma_{z^{l}}$.
 Output error $\delta^{L}$: Compute the vector $\delta^{L}=\nabla_{a}C\cdot \sigma^{\prime}(z^{L})$.
 Backpropagate the error: For each l=L−1,L−2,…,2 compute $\delta^{l}=((w^{l+1})^{T}\delta^{l+1}))\cdot \sigma^{\prime}(z^{l})$.
 Output: The gradient of the cost function is given by $\frac{\partial C}{\partial b_{j}^{l}}=\delta_{j}^{l}$ and $\frac{\partial C}{\partial w_{jk}^{l}}=a_{k}^{l1}\delta_{j}^{l}$.
Explanation of the program
Now I will focus on the program and give my own understanding of the function. I will follow the list of the program. The first is import necessary packages.
1  """ 
2  network.py 
3  ~~~~~~~~~~ 
4 

5  A module to implement the stochastic gradient descent learning 
6  algorithm for a feedforward neural network. Gradients are calculated 
7  using backpropagation. Note that I have focused on making the code 
8  simple, easily readable, and easily modifiable. It is not optimized, 
9  and omits many desirable features. 
10  """ 
11  
12  #### Libraries 
13  # Standard library 
14  import random 
15  
16  # Thirdparty libraries 
17  import numpy as np 
Then the Sigmoid function and derivation of Sigmoid function will be defined.
1  #### Miscellaneous functions 
2  def sigmoid(z): 
3  """The sigmoid function.""" 
4  return 1.0/(1.0+np.exp(z)) 
5  
6  def sigmoid_prime(z): 
7  """Derivative of the sigmoid function.""" 
8  return sigmoid(z)*(1sigmoid(z)) 
Then a class named network will be defined. And in this class,the _init_
function is as follows
1  def __init__(self, sizes): 
2  """The list ``sizes`` contains the number of neurons in the 
3  respective layers of the network. For example, if the list 
4  was [2, 3, 1] then it would be a threelayer network, with the 
5  first layer containing 2 neurons, the second layer 3 neurons, 
6  and the third layer 1 neuron. The biases and weights for the 
7  network are initialized randomly, using a Gaussian 
8  distribution with mean 0, and variance 1. Note that the first 
9  layer is assumed to be an input layer, and by convention we 
10  won't set any biases for those neurons, since biases are only 
11  ever used in computing the outputs from later layers.""" 
12  self.num_layers = len(sizes) 
13  self.sizes = sizes 
14  self.biases = [np.random.randn(y, 1) for y in sizes[1:]] 
15  self.weights = [np.random.randn(y, x) 
16  for x, y in zip(sizes[:1], sizes[1:])] 
Be careful the use of size[:1],size[1:]
which means the matrix removed the last element and last element respectively. The use of zip is also new to me and self.weights
is constructed by a series of matrix with different dimensions.
The feedforward function that which update $a$ from $l$ layer to $l+1$ layer
1  def feedforward(self, a): 
2  """Return the output of the network if ``a`` is input.""" 
3  for b, w in zip(self.biases, self.weights): 
4  a = sigmoid(np.dot(w, a)+b) 
5  return a 
Here is the SGD main function
1  def SGD(self, training_data, epochs, mini_batch_size, eta, 
2  test_data=None): 
3  """Train the neural network using minibatch stochastic 
4  gradient descent. The ``training_data`` is a list of tuples 
5  ``(x, y)`` representing the training inputs and the desired 
6  outputs. The other nonoptional parameters are 
7  selfexplanatory. If ``test_data`` is provided then the 
8  network will be evaluated against the test data after each 
9  epoch, and partial progress printed out. This is useful for 
10  tracking progress, but slows things down substantially.""" 
11  if test_data: n_test = len(test_data) 
12  n = len(training_data) 
13  for j in xrange(epochs): 
14  random.shuffle(training_data) 
15  mini_batches = [ 
16  training_data[k:k+mini_batch_size] 
17  for k in xrange(0, n, mini_batch_size)] 
18  for mini_batch in mini_batches: 
19  self.update_mini_batch(mini_batch, eta) 
20  if test_data: 
21  print "Epoch {0}: {1} / {2}".format( 
22  j, self.evaluate(test_data), n_test) 
23  else: 
24  print "Epoch {0} complete".format(j) 
The main training function. A test data will be used if needed. We give the training data which will be loaded use well defined function
1  import mnist_loader 
2  training_data, validation_data, test_data = \ 
3  mnist_loader.load_data_wrapper() 
This function will divide the step of training and show the progress and quality of the neural networks if we give the test data. The function update_mini_batch
will update the weights and bias for a given training data.
update_mini_batch
function is defined as follows
1  def update_mini_batch(self, mini_batch, eta): 
2  """Update the network's weights and biases by applying 
3  gradient descent using backpropagation to a single mini batch. 
4  The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta`` 
5  is the learning rate.""" 
6  nabla_b = [np.zeros(b.shape) for b in self.biases] 
7  nabla_w = [np.zeros(w.shape) for w in self.weights] 
8  for x, y in mini_batch: 
9  delta_nabla_b, delta_nabla_w = self.backprop(x, y) 
10  nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] 
11  nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] 
12  self.weights = [w(eta/len(mini_batch))*nw 
13  for w, nw in zip(self.weights, nabla_w)] 
14  self.biases = [b(eta/len(mini_batch))*nb 
15  for b, nb in zip(self.biases, nabla_b)] 
The partial derivative for weights and biases will be defined first. Then the partial derivative will be calculated use the function backprop
. And the new weights will bias will be updated. The most important part is the function backprop
1  def backprop(self, x, y): 
2  """Return a tuple ``(nabla_b, nabla_w)`` representing the 
3  gradient for the cost function C_x. ``nabla_b`` and 
4  ``nabla_w`` are layerbylayer lists of numpy arrays, similar 
5  to ``self.biases`` and ``self.weights``.""" 
6  nabla_b = [np.zeros(b.shape) for b in self.biases] 
7  nabla_w = [np.zeros(w.shape) for w in self.weights] 
8  # feedforward 
9  activation = x 
10  activations = [x] # list to store all the activations, layer by layer 
11  zs = [] # list to store all the z vectors, layer by layer 
12  for b, w in zip(self.biases, self.weights): 
13  z = np.dot(w, activation)+b 
14  zs.append(z) 
15  activation = sigmoid(z) 
16  activations.append(activation) 
17  # backward pass 
18  delta = self.cost_derivative(activations[1], y) * \ 
19  sigmoid_prime(zs[1]) 
20  nabla_b[1] = delta 
21  nabla_w[1] = np.dot(delta, activations[2].transpose()) 
22  # Note that the variable l in the loop below is used a little 
23  # differently to the notation in Chapter 2 of the book. Here, 
24  # l = 1 means the last layer of neurons, l = 2 is the 
25  # secondlast layer, and so on. It's a renumbering of the 
26  # scheme in the book, used here to take advantage of the fact 
27  # that Python can use negative indices in lists. 
28  for l in xrange(2, self.num_layers): 
29  z = zs[l] 
30  sp = sigmoid_prime(z) 
31  delta = np.dot(self.weights[l+1].transpose(), delta) * sp 
32  nabla_b[l] = delta 
33  nabla_w[l] = np.dot(delta, activations[l1].transpose()) 
34  return (nabla_b, nabla_w) 
This function is a realization of previous explanation of the back propagation methods. And finally, the two other functions
1  def evaluate(self, test_data): 
2  """Return the number of test inputs for which the neural 
3  network outputs the correct result. Note that the neural 
4  network's output is assumed to be the index of whichever 
5  neuron in the final layer has the highest activation.""" 
6  test_results = [(np.argmax(self.feedforward(x)), y) 
7  for (x, y) in test_data] 
8  return sum(int(x == y) for (x, y) in test_results) 
9  
10  def cost_derivative(self, output_activations, y): 
11  """Return the vector of partial derivatives \partial C_x / 
12  \partial a for the output activations.""" 
13  return (output_activationsy) 
which is very easy to understand.
How to use?
This is really a good example of neural networks deep learning. To use this, you can direct download Michael Nielsen’s example. However, he writes this use python2, to use python3, you can use the another example by MichalDanielDobrzanski
after downloading the repository, the file network.py
is just like we shown above. The following is the use of the program
1  C:\Users\xiail\Documents\Dropbox\Code\Python\Study\NeuralNetworks\Study1\DeepL 
2  on35 (master > origin) 
3  λ python 
4  Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit 
5  win32 
6  Type "help", "copyright", "credits" or "license" for more information. 
7  import mnist_loader 
8  training_data, validation_data, test_data = mnist_loader.load_data_wrapper() 
9  import network 
10  784, 30, 10]) net = network.Network([ 
11  30, 10, 3.0, test_data=test_data) net.SGD(training_data, 
12  Epoch 0 : 8254 / 10000 
13  Epoch 1 : 8367 / 10000 
14  Epoch 2 : 8449 / 10000 
15  Epoch 3 : 8483 / 10000 
16  Epoch 4 : 8517 / 10000 
17  Epoch 5 : 8533 / 10000 
18  Epoch 6 : 8538 / 10000 
19  Epoch 7 : 8541 / 10000 
20  Epoch 8 : 9448 / 10000 
21  Epoch 9 : 9450 / 10000 
22  Epoch 10 : 9446 / 10000 
23  Epoch 11 : 9475 / 10000 
24  Epoch 12 : 9456 / 10000 
25  Epoch 13 : 9473 / 10000 
26  Epoch 14 : 9447 / 10000 
27  Epoch 15 : 9483 / 10000 
28  Epoch 16 : 9501 / 10000 
29  Epoch 17 : 9501 / 10000 
30  Epoch 18 : 9502 / 10000 
31  Epoch 19 : 9501 / 10000 
32  Epoch 20 : 9485 / 10000 
33  Epoch 21 : 9491 / 10000 
34  Epoch 22 : 9519 / 10000 
35  Epoch 23 : 9499 / 10000 
36  Epoch 24 : 9530 / 10000 
37  Epoch 25 : 9504 / 10000 
38  Epoch 26 : 9502 / 10000 
39  Epoch 27 : 9521 / 10000 
40  Epoch 28 : 9506 / 10000 
41  Epoch 29 : 9498 / 10000 
42  >>> 