Teaching a Neural Network to play a game using Q-learning

In this blog post we will walk through how to build an AI that can play a computer game with a Neural Network and Q-Learning. We will expand our game from the Teaching an AI to play a simple game using Q-learning blog post to be more complex by introducing an extra dimension. To get the most out of this blog post I recommend reading the previous post first.

The full source for this example is available on Github here. Note the Neural Network version of the reinforcement learning algorithm is in the neuralnetwork branch.

The game

Our game is a simple “catch the cheese” console game where the player P must move to catch the cheese C without falling into the pit O. The player gets one point of each cheese he finds and minus one point for every time he falls into the pit. The game ends if the user gets either 5 points or -5 points.

As mentioned above we are expanding our original game with a new dimension so the player can now move up, down, left and right. This gif shows a human player playing the new game.

This gif shows a human player playing the game.

Reinforcement Learning with a Neural Network

In a previous post we build an AI using the q-learning algorithm with a q-table. That algorithm used the q-table to lookup the optimal next action based on the current state of the game (for a refresher on how the q-learning algorithm works go here). While this works fine for simple games, as the complexity of the game grows so does the q-table. This is because the q-table must contain q-values for each possible action A at each possible state S of the game.

One alternative approach is to replace the table lookup with a Neural Network. Our Neural Network would take a state S and an action A as input and output the q-value, that is the possible reward, for taking the action A at the state S.

With this implementation, to determine which action A to take at the state S, our AI would run the network once for every action, and select the action that gave the highest output of the network to maximize the reward for the AI.

To train our network we would use a similar approach as the original Q-Learning algorithm, but customize it for our Neural Network as follows:

Step 1: Initialize Neural Network with random values
Step 2: While playing the game execute the following loop
Step 2.a: Generate random number between 0 and 1

If number is larger than the threshold e select random action, otherwise select action with the highest possible reward by on running the neural network with the current state and each possible action

Step 2.b: Take action from step 2.a
Step 2.c: Observe reward r after taking action
Step 2.d: Train the Neural Network using the reward r using the formula

\[ \displaystyle Q(s_{t},a_{t})\leftarrow \overbrace {\underbrace {r_{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}} ^{\rm {learned~value}} \]

Implementing this will gives us an AI that trains the Neural Network in a online training fashion, that is training the network immediately as data becomes available.

Catastrophic Interference and Experience Replay

Online training algorithms, as the one explained above, are unfortunately prone to catastrophic interference. Catastrophic interference is when a Neural Network abruptly forgets what is has previously learned when learning new information.

As an example, in our game the network will experience that sometimes going left will result in finding the cheese, but other times going left will cause you to fall into the pit. Catastrophic interference will cause the network to forget what is has previously learning about going left when falling into the pit. This causes the network to struggle finding a good general solution to the game.

To counter catastrophic interference we can use a method called experience replay. We introduce a replay memory of size R to our AI. On each iteration we train the network with a random batch of size B of states and actions from the replay memory. Using this approach we continuously train the network using a batch of samples instead of just a single example, thereby countering catastrophic interference.

Our q-learning algorithm now works as follows:

Step 1: Initialize Neural Network with random values
Step 2: While playing the game execute the following loop
Step 2.a: Generate random number between 0 and 1

Step 2.b: Take action from step 2.a
Step 2.c: Observe reward r after taking action
Step 2.d: Add state, action, reward and new state to replay memory (if memory is full overwrite oldest entry)
Step 2.e: If replay memory is full – extract batch of size B

For each example in the batch calculate the target q value using:
\[ \displaystyle Q(s_{t},a_{t})\leftarrow \overbrace {\underbrace {r_{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}} ^{\rm {learned~value}} \]

Train the Neural Network using the batch target q values and input states

Implementing the Neural Network AI

With our algorithm defined we can start implementing our AI player. The game takes an instance of a player class as the player object. The player class must implement the get_input function. The get_input function is called once on every iteration of the game loop, and returns the player direction.

An example of the player class for a human player is given below:

require 'io/console'

class Player
  attr_accessor :y,:x

  def initialize
    @x = 0
    @y = 0
  end

  def get_input
    input = STDIN.getch
    if input == 'a'
      return :left
    elsif input == 'd'
      return :right
    elsif input == 'w'
      return :up
    elsif input == 's'
      return :down
    elsif input == 'q'
      exit
    end

    return :nothing
  end
end

For our Neural Network AI player we must implement a new player class that uses the algorithm outline above to determine the action in the get_input function.

We start by requiring the ruby-fann gem. The ruby-fann is a gem that contains Ruby bindings for FANN (Fast Artificial Neural Network) a C implementation of a Neural Network.

Next we define our constructor and in our constructor setup the player attributes and the parameters we need for our algorithm. For our example we use a replay memory size of 500 and a training batch size of 400.

require 'ruby-fann'

class QLearningPlayer
  attr_accessor :y, :x, :game

  def initialize
    @x = 0
    @y = 0
    @actions = [:left, :right, :up, :down]
    @first_run = true

    @discount = 0.9
    @epsilon = 0.1
    @max_epsilon = 0.9
    @epsilon_increase_factor = 800.0

    @replay_memory_size = 500
    @replay_memory_pointer = 0
    @replay_memory = []
    @replay_batch_size = 400

    @runs = 0

    @r = Random.new
  end

Note how we also setup parameters to support a dynamic epsilon value. Epsilon is the probability used in step 2.a of the algorithm to select a action. If we have a low epsilon there is a higher probability of choosing a random action, instead of the action with the highest reward. Our epsilon implementation will be dynamic and start with a very low value, to encourage exploration, and grow for every iteration until it hits the max_epsilon value.

Next we setup a function to initialize our Neural Network. We setup our network to have the input size of the map x * map y + number of possible actions. We have one hidden layer with the same amount of neurons as a input layer and one output node (our q value). Additionally we setup the learning rate to 0.2 and change the activation functions to sigmoid symmetric to support negative values.

def initialize_q_neural_network
  # Setup model
  # Input is the size of the map + number of actions
  # Output size is one 
  @q_nn_model = RubyFann::Standard.new(
                num_inputs: @game.map_size_x*@game.map_size_y + @actions.length,
                hidden_neurons: [ (@game.map_size_x*@game.map_size_y+@actions.length) ],
                num_outputs: 1 )

  @q_nn_model.set_learning_rate(0.2)

  @q_nn_model.set_activation_function_hidden(:sigmoid_symmetric)
  @q_nn_model.set_activation_function_output(:sigmoid_symmetric)

end

Now its time to implement the get_input function. We start by pausing for a few milliseconds to help us follow along with the AI player and increment the attribute keeping track of the number of runs.

Then we check if this is the first run and if so we initialize the neural network (step 1).

def get_input
  # Pause to make sure humans can follow along
  # Increase pause with the number of runs
  sleep 0.05 + 0.01*(@runs/400.0)
  @runs += 1

  if @first_run
    # If this is first run initialize the Q-neural network
    initialize_q_neural_network
    @first_run = false
  else

If this is not the first run we evaluate what happened on the last move to calculate the reward (step 2.c). If the game score increased we set the reward to 1, if the game score decreased we give a negative reward of -1, and if nothing happened we set the reward to -0.1. Giving a negative reward when nothing happens will encourage the algorithm to go straight for the cheese.

# If this is not the first 
# Evaluate what happened on last action and calculate reward
r = 0 # default is 0
if !@game.new_game and @old_score < @game.score
  r = 1 # reward is 1 if our score increased
elsif !@game.new_game and @old_score > @game.score
  r = -1 # reward is -1 if our score decreased
elsif !@game.new_game
  r = -0.1
end

Next we capture the current state of the game and add it to the replay memory along with the reward and old state. We capture the state as an input vector to the neural network. We code the current position of the player in the input vector by setting the vector to 1 at the player position (step 2.d).

# Capture current state
# Set input to network map_size_x * map_size_y + actions length vector with a 1 on the player position
input_state = Array.new(@game.map_size_x*@game.map_size_y + @actions.length, 0)
input_state[@x + (@game.map_size_x*@y)] = 1

# Add reward, old_state and input state to memory
@replay_memory[@replay_memory_pointer] = {reward: r, old_input_state: @old_input_state, input_state: input_state}
# Increment memory pointer
@replay_memory_pointer = (@replay_memory_pointer<@replay_memory_size) ? @replay_memory_pointer+1 : 0

Then we check if the memory is full. If it is full we extract a random batch, calculate the update q values and train the network (step 2.e).

  # If replay memory is full train network on a batch of states from the memory
  if @replay_memory.length > @replay_memory_size
    # Randomly sample a batch of actions from the memory and train network with these actions
    @batch = @replay_memory.sample(@replay_batch_size)
    training_x_data = []
    training_y_data = []

    # For each batch calculate new q_value based on current network and reward
    @batch.each do |m|
      # To get entire q table row of the current state run the network once for every posible action
      q_table_row = []
      @actions.length.times do |a|
        # Create neural network input vector for this action
        input_state_action = m[:input_state].clone
        # Set a 1 in the action location of the input vector
        input_state_action[(@game.map_size_x*@game.map_size_y) + a] = 1
        # Run the network for this action and get q table row entry
        q_table_row[a] = @q_nn_model.run(input_state_action).first
      end

      # Update the q value
      updated_q_value = m[:reward] + @discount * q_table_row.max

      # Add to training set
      training_x_data.push(m[:old_input_state])
      training_y_data.push([updated_q_value])
    end

    # Train network with batch
    train = RubyFann::TrainData.new( :inputs=> training_x_data, :desired_outputs=>training_y_data );
    @q_nn_model.train_on_data(train, 1, 1, 0.01)
  end
end

With the network updated we start to think about which action to do next. We first capture the current state of the game in the a network input vector, then we calculate epsilon based on the current run of the algorithm. The more runs the higher the epsilon meaning the higher probability of choosing the action with the highest reward instead of a random action.

Next we either pick a random action or run the network for the current state S and each action A to determine which action to take based on the network output (step 2.a).

# Capture current state and score
# Set input to network map_size_x * map_size_y vector with a 1 on the player position
input_state = Array.new(@game.map_size_x*@game.map_size_y + @actions.length, 0)
input_state[@x + (@game.map_size_x*@y)] = 1
# Chose action based on Q value estimates for state
# If a random number is higher than epsilon we take a random action
# We will slowly increase @epsilon based on runs to a maximum of @max_epsilon - this encourages early exploration
epsilon_run_factor = (@runs/@epsilon_increase_factor) > (@max_epsilon-@epsilon) ? (@max_epsilon-@epsilon) : (@runs/@epsilon_increase_factor)
if @r.rand > (@epsilon + epsilon_run_factor)
  # Select random action
  @action_taken_index = @r.rand(@actions.length)
else
  # To get the entire q table row of the current state run the network once for every posible action
  q_table_row = []
  @actions.length.times do |a|
    # Create neural network input vector for this action
    input_state_action = input_state.clone
    # Set a 1 in the action location of the input vector
    input_state_action[(@game.map_size_x*@game.map_size_y) + a] = 1
    # Run the network for this action and get q table row entry
    q_table_row[a] = @q_nn_model.run(input_state_action).first
  end
  # Select action with highest posible reward
  @action_taken_index = q_table_row.each_with_index.max[1]
end

Lastly we store the current score in the old score variable as well as the current state in the old state variable and return the action to take so the game can execute the action (step 2.b).

  # Save current state, score and q table row
  @old_score = @game.score

  # Set action taken in input state before storing it
  input_state[(@game.map_size_x*@game.map_size_y) + @action_taken_index] = 1
  @old_input_state = input_state

  # Take action
  return @actions[@action_taken_index]
end

This is quite a lot of code you can find the complete combined code here: https://github.com/daugaard/q-learning-simple-game/blob/55748d5e821b34a531dba4d9c4b2683038db6b3d/q_learning_player.rb.

Letting the AI play

With our AI programmed we can run the code and see how it does.

We see that the AI is all over the place in the beginning. This is because of the dynamic epsilon value and because we don’t start training the network until the replay memory is full, meaning the actions taken are truly random in the beginning. However at the end of run 1 and beginning of run 2 you see the AI has learned to avoid the pit and go straight for the cheese.

Generalizing the approach

This post showed how we can teach a Neural Network with symmetric sigmoid activators to play a simple game by coding the game state and actions as the input vector and getting a measure of reward as output of the network. This scheme requires knowledge of the game to model the network which can be limiting when wanting to build a more general AI.

A more general approach could be to use the RBG values of the game rendering as input instead of a coded game state. This approach was explored in the paper Playing Atari with Deep Reinforcement Learning by researchers at DeepMind Technologies. In this paper they successfully trained Q-learning with a neural network Q table to play Space Invaders, Pong, Q*bert and other Atari 2600 games.

If you enjoyed this article signup for our newsletter to get more blog posts and updates straight to your inbox.

6 comments

Boris says:

February 5, 2018 at 2:32 pm

Would adding another C to the map with value of 2 require any code changes to the QLearningPlayer, besides awarding more points in the first prt of get_input? I was wondering whether this AI could differentiate between “cheeses” of higher value.

Thanks

- Soren D says:
  
  February 5, 2018 at 7:12 pm
  
  Adding another cheese would require you to change the reward function so you can give a higher reward for the “bigger” cheese. Other than that you can use the same approach, however you might want to try a couple of different network sizes to figure out which one gives the best result.
  
  Adding complexity to the system often means you’ll need to increase the number of hidden neurons in the neural network.
  
Archit Wagle says:

March 9, 2018 at 4:22 am

Excellent explanation!Thanks this helped me a lot!

Henry Ni says:

March 20, 2018 at 12:43 am

How come Q-Learning works on randomly generated terrain?
Doesn’t the table it generate only apply for that certain game level?

- Soren D says:
  
  March 20, 2018 at 7:19 am
  
  The algorithm could work on randomly generated terrain if you include the terrain attributes in the q-table, and train for long enough.
  
  If you don’t you’d have to retrain for each level.
  
Finn says:

May 16, 2018 at 11:28 am

This is by far the best explanation I’ve seen so far.
Still I got a few questions because I am not working with Ruby or with any neural networks library. I’ve coded my own neural networks library.

1) Do you mean with sigmoid symmetric the TanH function? I am using that one.
2) “extract a batch”. Is this batch random? Is the content shuffled?
3) How is your network initialized? What is your init. bias and your initial weights?
4) What method do you use for training? Stochastic gradient descent?
5) I tried to have 4 output neurons, the error is only at the neuron where we just made a decision. This should still work, right?

I would be very happy if you could quickly help me out with this one because my code is only producing mess. 🙂