Gradient descent and backpropagation
Gradient descent defines two concepts:
- Gradient or derivative measurement of the slope (up or down / how steep)
- Descent or reducing the error level between the present result, relying on the parameters (weights and biases), and the target training dataset
There are several ways to measure whether you are going up or down a slope. Derivatives are the most commonly used mathematical tool for this. Let us say you have 15 steps to go from one floor of a building down to the next floor. At each step, you feel you are going down.
The slope or derivative is the function describing you going down those stairs:
- S = slope of you going down the stairs
- dy = where you are once you have made a step (up, down, or staying on a step)
- dx = the size of your steps (one at a time, two at a time, and so on)
- f(x) = the fact of going downstairs, for example from step 4 to step 3, step by step.
- The derivative or slope is thus:
f(x) is the function of you going down (or up or stopping on a step). For example, when you move one step forward, you are going from step 4 to step 3 (down). Thus f(x)=x-b, in which b = 1 in this example. This is called a decreasing function. h = the number of steps you are going at each pace, for example, one step down if you are not in a hurry and two steps down if you are in a hurry. In this example, you are going down one step at a time from step 4 to 3; thus h = 1 (one step at a time).
We obtain the following formula:
This means that we are at step 3 after taking one step. We started at step 4, so we went down -1 step. The minus sign means you are going downstairs from step 4 to step 3 or -1 step.
The gradient is the direction the function is taking. In this case, it is -1 step. If you go down two steps at a time, it would be -2. A straightforward way is just to take the derivative and use it. If it's 0, we're not moving. If it's negative we're going down (less computation or less remaining time to train).
If the slope is positive, we are in trouble, we're going up and increase the cost of the function (more training or more remaining time to train).
The goal of an FNN is to converge to 0. This means that as long as the parameters (weights and biases) are not optimized, the output is far from the target expected. In this case, for example, there are four predicates to train (1-1,0-0,1-0,0-1). If only two results are correct, the present situation is negative, which is 2 - 4 = -2. When three results are correctly trained, the output gradient descent is 3 - 4 = -1. This will be translated into derivative gradient descent form using the cost calculated. Descent means that 0 is the target when all four outputs out of four are correct. This arithmetic works for an example without complicated calculations. But TensorFlow provides functions for all situations to calculate the cost (cost function), see whether the training session is going well (down), and optimize the weights to be sure that the cost is diminishing (gradient descent). The following GradientDescentOptimizer will optimize the training of the weights.
cost = tf.reduce_mean(tf.square(y_-Output))
train_step = tf.train.GradientDescentOptimizer(0.10).minimize(cost)
A current learning rate for the gradient descent optimizer is 0.01. However, in this model, it can be sped up to 0.10.
The GradientDescentOptimizer will make sure the slope is following a decreasing function's gradient descent by optimizing the weights of the network accordingly.
In this TensorFlow example, the means of the output values are calculated and then squashed with the logistic function. It will provide information for the inbuilt gradient descent optimizer to minimize the cost, with a small training step (0.01). This means that the weights will be slightly changed before each iteration. An iteration defines backpropagation. By going back running the FNN again, then measuring, and going back again, we are propagating many combinations of weights—hopefully in the right direction (down the slope)—to optimize the network.
Stochastic gradient descent (SGD) consists of calculating gradient descent on samples of random (stochastic) data instead of using the whole dataset every time.