The Cost function
A Cost function describes the average sum of errors for a batch in our entire network and is often defined by this equation:
The input is defined as each weight and the output is the total average cost we encountered over the processed batch. Think of this cost as the average sum of errors. Now, our goal here is to minimize this function or the cost of errors to the lowest value possible. In the previous couple of examples, we have seen a technique called gradient descent being used to minimize this cost function. Gradient descent works by differentiating the Cost function and determining the gradient with respect to each weight. Then, for each weight, or dimension if you will, the algorithm alters the weight based on the calculated gradient that minimizes the Cost function.
Before we get into the heavy math that explains the differentiation, let's see how gradient descent works in two dimensions, with the following diagram:
In simpler terms, all that the algorithm is doing is just trying to find the minimum in slow gradual steps. We use small steps in order to avoid overshooting the minimum, which as you have seen earlier can happen (remember the wobble). That is where the term learning rate also comes in, which determines how fast we want to train. The slower the training, the more confident you will be in your results, but usually at a cost of time. The alternative is to train quicker, using a higher learning rate, but, as you can see now, it may be easy to overshoot any global minimum.
Gradient descent is the simplest form we will talk about, but keep in mind that there are also several advanced variations of other optimization algorithms we will explore. In the TF example, for instance, we used AdamOptimizer to minimize the Cost function, but there are several other variations. For now, though, we will focus on how to calculate the gradient of the Cost function and understand the basics of backpropagation with gradient descent in the next section.