上QQ阅读APP看书，第一时间看更新

Weight optimization

Before the training starts, the network parameters are set randomly. Then to optimize the network weights, an iterative algorithm called Gradient Descent (GD) is used. Using GD optimization, our network computes the cost gradient based on the training set. Then, through an iterative process, the gradient G of the error function E is computed.

In following graph, gradient G of error function E provides the direction in which the error function with current values has the steeper slope. Since the ultimate target is to reduce the network error, GD makes small steps in the opposite direction -G. This iterative process is executed a number of times, so the error E would move down towards the global minima. This way, the ultimate target is to reach a point where G = 0, where no further optimization is possible:

Searching for the minimum for the error function E; we move in the direction in which the gradient G of E is minimal

The downside is that it takes too long to converge, which makes it impossible to meet the demand of handling large-scale training data. Therefore, a faster GD called Stochastic Gradient Descent (SDG) is proposed, which is also a widely used optimizer in DNN training. In SGD, we use only one training sample per iteration from the training set to update the network parameters.

I'm not saying SGD is the only available optimization algorithm, but there are so many advanced optimizers available nowadays, for example, Adam, RMSProp, ADAGrad, Momentum, and so on. More or less, most of them are either direct or indirect optimized versions of SGD.

By the way, the term stochastic comes from the fact that the gradient based on a single training sample per iteration is a stochastic approximation of the true cost gradient.