上QQ阅读APP看书，第一时间看更新

Numerical – controlled convergence

This approach can prove time-saving by using the target result, provided it exists beforehand. Training the reinforcement program in this manner validates the process.

In the following source code, an intuitive cross-entropy function is introduced (see Chapter 9, Getting Your Neurons to Work, for more on cross-entropy).

Cross-entropy refers to energy. The main concepts are as follows:

Energy represents the difference between one distribution and another
It is what makes the system continue to train
When a lot of training needs to be done, there is a high level of energy
When the training reaches the end of its cycles, the level of energy is low
In the following code, cross-entropy value (CEV) measures the difference between a target matrix and the episode matrix
Cross-entropy is often measured in more complex forms when necessary (see Chapter 9, Getting Your Neurons to Work, and Chapter 10, Applying Biomimicking to Artificial Intelligence)

In the following code, a basic function provides sufficient results.

for i in range(50000):
    current_state = ql.random.randint(0, int(Q.shape[0]))
    PossibleAction = possible_actions(current_state)
    action = ActionChoice(PossibleAction)
    reward(current_state,action,gamma)
    if Q.sum()>0:
     #print("convergent episode:",i,"Q.Sum",Q.sum(),"numerical convergent value e-1:",Q.sum()-sum) 
     #print("convergent episode:",i,"numerical convergent value:",ceg-Q.sum())
     CEV=-(math.log(Q.sum())-math.log(ceg))
     print("convergent episode:",i,"numerical convergent value:",CEV)
     sum=Q.sum()
     if(Q.sum()-3992==0):
       print("Final convergent episode:",i,"numerical convergent value:",ceg-Q.sum())
       break; #break on average (the process is random) before 50000

The previous program stops before 50,000 epochs. This is because, in the model described in this chapter (see the previous code excerpt), the system stops when it reaches an acceptable CEV convergence value.

convergent episode: 1573 numerical convergent value: -0.0
convergent episode: 1574 numerical convergent value: -0.0
convergent episode: 1575 numerical convergent value: -0.0
convergent episode: 1576 numerical convergent value: -0.0
convergent episode: 1577 numerical convergent value: -0.0
Final convergent episode: 1577 numerical convergent value: 0.0

The program stopped at episode 1577. Since the decision process is random, the same number will not be obtained twice in a row. Furthermore, the constant 3992 was known in advance. This is possible in closed environments where a pre-set goal has been set. This is not the case often but was used to illustrate the concept of convergence. The following chapters will explore better ways to reach convergence, such as gradient descent.

The Python program is available at:

https://github.com/PacktPublishing/Artificial-Intelligence-By-Example/blob/master/Chapter03/Q_learning_convergence.py