 ## Reinforcement Learning – OpenAi Gym

Reinforcement learning is recently one of the potential research field of data scientists, it makes feasible to outdo processes what we have so far, and makes imaginable to reach the so called artificial general intelligence (AGI). In our previous blog we described and made the theory of reinforcement learning familiar to you. This following blog requires the knowledge of it and introduces the process basics of reinforcement learning through a practical example.

Before the start of the practical side. We have to mention OpenAi, they are one of the lead researchers on the reinforcement learning field and on the artificial general intelligence topic. They developed a toolkit called Gym which is a free and easy to use tool to the artificial intelligence community. Programmers and developers can make their own comparison of different reinforcement learning algorithms in it. We can program agents into games from walking all the way to pinball and other games. env = gym.make(‘CartPole-v0’)

The code above defines the environment CartPole. This environment is basically a balancing game. In the game the goal is to balance an unstable pole on a little moving cuboid. This cuboid is moving along 1 dimension and its frictionless. The game start when the car start to move and the pole start to fall. Our agent win +1 reward at every successful timestep. Where successful means the still balanced, standing pole. The game is over if the pole locks an angle of 15 degrees relative to the vertical axis or the car moves more than 2.4 unit to left or right from the center.

Which action to choose in a given state can be defined in a variety of ways with different policies. One possible way is to define a quantity of goodness (Q) for each state / action pair, which is the maximum reward from that state to that action. For Q learning, these values are included in a table, and after the teaching is completed, the action is chosen in the given state that will result in the maximum future reward.

Value based reinforcement learning – Q-Learning

Initiate the Q value: Create such a Q table in where the number of columns equal with the number of possible actions and in where the number of rows equal with the number of the states. At the beginning we set all Q to 0 value (which is the expected reward to that given action/state pair).

# Quality init

Qualities = {}

all_states = []

for i in range(MAXSTATES):

all_states.append(str(i).zfill(3)) # Define all the possible states of the game

for state in all_states:

Qualities[state] = {}

for action in range(env.action_space.n):

Qualities[state][action] = 0

# end One learning cycle

Choose an action from the current state based on the Q table. If the environment is discovered in one direction, it is easy to find and stuck in a locally optimal policy. To overcome this, we expand the decision strategy with an exploration mechanism in which our agent doesn’t choose from the one already experienced (the maximum row of a given table in Q), but selects an action randomly. For this selection mechanism we define a parameter: Epsilon, which determines the ratio in which the agent discovers and explores or relies and exploits on the learned board. This ratio is modified while learning.

eps = 1.0 / np.sqrt(game_idx + 1) But our Q table at the beginning of the learning not reliable because of the randomly initialized values which is the Q matrix. Thus at action choosing we apply the exploration strategy, because at this time we barely know about our environment. In this case the epsilon is near by 1. Approaching the end of the learning we don’t use the exploration strategy more, rather use the exploitation strategy. It is expressed by the epsilon parameter which is goes down near 0 and we exploit our already reliable Q table.

if np.random.uniform() < eps:

act = env.action_space.sample()  # epsilon greedy

else:

act = max_dict(Qualities[state])

Overwrite and feedback.

After the agent took the action, find itself in a new state and get a new reward for taking that action, which can be positive or negative depends on the previous state and taken action. With the knowledge of the previous state, the new state, the taken action and the reward what the agent got and the usage of the Bellman equation it can update and overwrite its Q table. Where,

• NewQ(s,a): Value of the new Q, taking the action from that state.
• Q(s,a): Recent Q value at the state and action place.
• R(s,a): Reward for taking this action from this state.
• maxQ’(s’,a’): The biggest future reward possible, considering the new state and all the possible action from it.
• α: Learning rate
• γ: Discount rate

a1, max_q_s1a1 = max_dict(Qualities[state_new])

Qualities[state][act] += ALPHA * (reward + GAMMA * max_q_s1a1 – Qualities[state][act])

state, act = state_new, a1

The process described above uses the results of the llearning cycle. In order for the agent to be able to map the entire environment, we need to do more than one game. This amount can be determined using a parameter.

for game_idx in range(howmanyGames):

Results

Game index     Epsilon   Reward
0                 1.0000    14.0
10000           0.0100    200.0
20000           0.0071    200.0
30000           0.0058    200.0
40000           0.0050    200.0
50000           0.0045    200.0 Now you are familiar with this reinforcement learning loop and the Bellman updater equation. you heard about the value based reinforcement learning, or Q-learning. You can recognize that this reinforcement learning algorithm is totally independent from the environment, it isn’t specialized the system which is around it. In our following blog on reinforcement learning we will introduce you the so called Deep Q learning, which is a kind of artificial general intelligence which is capable to be effective in various environment.

We are confident that inside your company there are a lot of tasks which can be automated with AI: In case you would like to enjoy the advantages of artificial intelligence, then apply to our free consultation on one of our contacts. 