Why does Q Learning diverge?Neural Network Learning Rate vs Q-Learning Learning RateTensorflow regression predicting 1 for all inputsCan you interpolate with QLearning or Reinforcement learning in general?Why does Q-learning use an actor model and critic model?Prioritized Experience Replay - why to approximate the Density Function?Dueling DQN what does a' mean?Representing similar states in reinforcement learning?Reinforcement learning: decreasing loss without increasing rewardHow does Q-Learning deal with mixed strategies?Different algorithms categorized in reinforcement learning
Hang 20lb projector screen on Hardieplank
nth number which is divisible by all the numbers from 1 to 10
What does air vanishing on contact sound like?
Is it cheaper to drop cargo than to land it?
How can I close a gap between my fence and my neighbor's that's on his side of the property line?
I caught several of my students plagiarizing. Could it be my fault as a teacher?
Transfer over $10k
Did we get closer to another plane than we were supposed to, or was the pilot just protecting our delicate sensibilities?
Catholic vs Protestant Support for Nazism in Germany
Power LED from 3.3V Power Pin without Resistor
How to improve/restore vintage peugeot bike, or is it even worth it?
Why are notes ordered like they are on a piano?
Is Jon Snow immune to dragonfire?
How does NAND gate work? (Very basic question)
If Earth is tilted, why is Polaris always above the same spot?
Would "lab meat" be able to feed a much larger global population
Can PCs use nonmagical armor and weapons looted from monsters?
Why is this a valid proof for the harmonic series?
Password expiration with Password manager
Why do freehub and cassette have only one position that matches?
What precisely is a link?
Feels like I am getting dragged into office politics
Why is Thanos so tough at the beginning of "Avengers: Endgame"?
Can Ghost kill White Walkers or Wights?
Why does Q Learning diverge?
Neural Network Learning Rate vs Q-Learning Learning RateTensorflow regression predicting 1 for all inputsCan you interpolate with QLearning or Reinforcement learning in general?Why does Q-learning use an actor model and critic model?Prioritized Experience Replay - why to approximate the Density Function?Dueling DQN what does a' mean?Representing similar states in reinforcement learning?Reinforcement learning: decreasing loss without increasing rewardHow does Q-Learning deal with mixed strategies?Different algorithms categorized in reinforcement learning
$begingroup$
My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.
I've tried:
- Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
- Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
- Decreasing the values of the rewards
- Increasing the exploration rate
- Normalizing the inputs to between 1~100 (previously it was 0~1)
- Change the discount rate
- Decrease the layers of the neural network (just for validation)
I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?
Update #1 on August 14th, 2017:
I've decided to add some specific details on what I'm doing right now due to a request to.
I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.
Each character has 9 actions to choose from on each turn:
- move up
- move down
- move left
- move right
- shoot a bullet upwards
- shoot a bullet downwards
- shoot a bullet to the left
- shoot a bullet to the right
- do nothing
The rewards are:
- if agent hits the bot with a bullet, +100 (I've tried many different values)
- if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)
if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)
if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)
The inputs for the neural network are:
Distance between the agent and the bot on the X axis normalized to 0~100
Distance between the agent and the bot on the Y axis normalized to 0~100
Agent's x and y positions
Bot's x and y positions
Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.
I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.
Here's the code:
from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm
#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1
#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)
#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]
#Setup character dimentions
character_size = 50
character_move_speed = 25
#Initialize character stats
character_init_health = 100
#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100
#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))
#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)
initialize = tf.global_variables_initializer()
jList = []
rList = []
init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)
#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()
def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y
agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0
def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width
disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False
draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))
draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))
def bot_take_action():
return random.randint(1, 9)
def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size
if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
def mapping(maximum, number):
return number#int(number * maximum)
def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size
agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width
elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size
if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"
if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25
elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner
def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir
#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()
jList.append(j)
rList.append(rAll)
print winner
I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".
I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.
Thanks!
Update #2 on August 18, 2017:
Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.
Update #3 on August 22, 2017:
I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.
machine-learning python reinforcement-learning q-learning
$endgroup$
|
show 11 more comments
$begingroup$
My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.
I've tried:
- Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
- Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
- Decreasing the values of the rewards
- Increasing the exploration rate
- Normalizing the inputs to between 1~100 (previously it was 0~1)
- Change the discount rate
- Decrease the layers of the neural network (just for validation)
I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?
Update #1 on August 14th, 2017:
I've decided to add some specific details on what I'm doing right now due to a request to.
I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.
Each character has 9 actions to choose from on each turn:
- move up
- move down
- move left
- move right
- shoot a bullet upwards
- shoot a bullet downwards
- shoot a bullet to the left
- shoot a bullet to the right
- do nothing
The rewards are:
- if agent hits the bot with a bullet, +100 (I've tried many different values)
- if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)
if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)
if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)
The inputs for the neural network are:
Distance between the agent and the bot on the X axis normalized to 0~100
Distance between the agent and the bot on the Y axis normalized to 0~100
Agent's x and y positions
Bot's x and y positions
Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.
I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.
Here's the code:
from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm
#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1
#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)
#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]
#Setup character dimentions
character_size = 50
character_move_speed = 25
#Initialize character stats
character_init_health = 100
#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100
#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))
#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)
initialize = tf.global_variables_initializer()
jList = []
rList = []
init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)
#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()
def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y
agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0
def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width
disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False
draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))
draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))
def bot_take_action():
return random.randint(1, 9)
def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size
if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
def mapping(maximum, number):
return number#int(number * maximum)
def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size
agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width
elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size
if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"
if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25
elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner
def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir
#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()
jList.append(j)
rList.append(rAll)
print winner
I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".
I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.
Thanks!
Update #2 on August 18, 2017:
Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.
Update #3 on August 22, 2017:
I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.
machine-learning python reinforcement-learning q-learning
$endgroup$
$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27
1
$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41
1
$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05
1
$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42
1
$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13
|
show 11 more comments
$begingroup$
My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.
I've tried:
- Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
- Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
- Decreasing the values of the rewards
- Increasing the exploration rate
- Normalizing the inputs to between 1~100 (previously it was 0~1)
- Change the discount rate
- Decrease the layers of the neural network (just for validation)
I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?
Update #1 on August 14th, 2017:
I've decided to add some specific details on what I'm doing right now due to a request to.
I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.
Each character has 9 actions to choose from on each turn:
- move up
- move down
- move left
- move right
- shoot a bullet upwards
- shoot a bullet downwards
- shoot a bullet to the left
- shoot a bullet to the right
- do nothing
The rewards are:
- if agent hits the bot with a bullet, +100 (I've tried many different values)
- if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)
if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)
if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)
The inputs for the neural network are:
Distance between the agent and the bot on the X axis normalized to 0~100
Distance between the agent and the bot on the Y axis normalized to 0~100
Agent's x and y positions
Bot's x and y positions
Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.
I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.
Here's the code:
from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm
#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1
#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)
#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]
#Setup character dimentions
character_size = 50
character_move_speed = 25
#Initialize character stats
character_init_health = 100
#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100
#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))
#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)
initialize = tf.global_variables_initializer()
jList = []
rList = []
init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)
#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()
def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y
agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0
def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width
disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False
draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))
draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))
def bot_take_action():
return random.randint(1, 9)
def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size
if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
def mapping(maximum, number):
return number#int(number * maximum)
def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size
agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width
elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size
if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"
if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25
elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner
def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir
#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()
jList.append(j)
rList.append(rAll)
print winner
I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".
I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.
Thanks!
Update #2 on August 18, 2017:
Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.
Update #3 on August 22, 2017:
I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.
machine-learning python reinforcement-learning q-learning
$endgroup$
My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.
I've tried:
- Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)
- Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)
- Decreasing the values of the rewards
- Increasing the exploration rate
- Normalizing the inputs to between 1~100 (previously it was 0~1)
- Change the discount rate
- Decrease the layers of the neural network (just for validation)
I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?
Update #1 on August 14th, 2017:
I've decided to add some specific details on what I'm doing right now due to a request to.
I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.
Each character has 9 actions to choose from on each turn:
- move up
- move down
- move left
- move right
- shoot a bullet upwards
- shoot a bullet downwards
- shoot a bullet to the left
- shoot a bullet to the right
- do nothing
The rewards are:
- if agent hits the bot with a bullet, +100 (I've tried many different values)
- if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)
if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)
if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)
The inputs for the neural network are:
Distance between the agent and the bot on the X axis normalized to 0~100
Distance between the agent and the bot on the Y axis normalized to 0~100
Agent's x and y positions
Bot's x and y positions
Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.
I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.
Here's the code:
from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm
#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1
#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)
#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]
#Setup character dimentions
character_size = 50
character_move_speed = 25
#Initialize character stats
character_init_health = 100
#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100
#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))
#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)
initialize = tf.global_variables_initializer()
jList = []
rList = []
init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)
#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()
def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y
agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0
def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width
disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False
draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))
draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))
def bot_take_action():
return random.randint(1, 9)
def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size
if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
def mapping(maximum, number):
return number#int(number * maximum)
def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size
agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width
elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size
if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"
if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25
elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner
def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir
#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()
jList.append(j)
rList.append(rAll)
print winner
I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".
I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.
Thanks!
Update #2 on August 18, 2017:
Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.
Update #3 on August 22, 2017:
I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.
machine-learning python reinforcement-learning q-learning
machine-learning python reinforcement-learning q-learning
edited Aug 22 '17 at 11:44
IronEdward
asked Aug 11 '17 at 1:11
IronEdwardIronEdward
15510
15510
$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27
1
$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41
1
$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05
1
$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42
1
$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13
|
show 11 more comments
$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27
1
$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41
1
$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05
1
$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42
1
$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13
$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27
$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27
1
1
$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41
$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41
1
1
$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05
$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05
1
1
$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42
$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42
1
1
$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13
$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13
|
show 11 more comments
2 Answers
2
active
oldest
votes
$begingroup$
If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:
- too many layers, or
- too many recurrent cycles if you're using an RNN.
You can verify if you have exploding gradients as follows:
grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
for g in tf.gradients(loss, weights_list)])**0.5
Some approaches to solving the problem of exploding gradients are:
- Use RELU or ELU activations
- Use Xavier initialization
- Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.
$endgroup$
add a comment |
$begingroup$
If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f22163%2fwhy-does-q-learning-diverge%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:
- too many layers, or
- too many recurrent cycles if you're using an RNN.
You can verify if you have exploding gradients as follows:
grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
for g in tf.gradients(loss, weights_list)])**0.5
Some approaches to solving the problem of exploding gradients are:
- Use RELU or ELU activations
- Use Xavier initialization
- Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.
$endgroup$
add a comment |
$begingroup$
If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:
- too many layers, or
- too many recurrent cycles if you're using an RNN.
You can verify if you have exploding gradients as follows:
grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
for g in tf.gradients(loss, weights_list)])**0.5
Some approaches to solving the problem of exploding gradients are:
- Use RELU or ELU activations
- Use Xavier initialization
- Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.
$endgroup$
add a comment |
$begingroup$
If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:
- too many layers, or
- too many recurrent cycles if you're using an RNN.
You can verify if you have exploding gradients as follows:
grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
for g in tf.gradients(loss, weights_list)])**0.5
Some approaches to solving the problem of exploding gradients are:
- Use RELU or ELU activations
- Use Xavier initialization
- Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.
$endgroup$
If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:
- too many layers, or
- too many recurrent cycles if you're using an RNN.
You can verify if you have exploding gradients as follows:
grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
for g in tf.gradients(loss, weights_list)])**0.5
Some approaches to solving the problem of exploding gradients are:
- Use RELU or ELU activations
- Use Xavier initialization
- Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.
edited Mar 30 '18 at 20:40
Stephen Rauch♦
1,53551330
1,53551330
answered Mar 30 '18 at 20:17
Default pictureDefault picture
22115
22115
add a comment |
add a comment |
$begingroup$
If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.
$endgroup$
add a comment |
$begingroup$
If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.
$endgroup$
add a comment |
$begingroup$
If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.
$endgroup$
If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.
answered Apr 10 at 11:33
mathtickmathtick
1115
1115
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f22163%2fwhy-does-q-learning-diverge%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27
1
$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41
1
$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05
1
$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42
1
$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13