Why does Q Learning diverge?Neural Network Learning Rate vs Q-Learning Learning RateTensorflow regression predicting 1 for all inputsCan you interpolate with QLearning or Reinforcement learning in general?Why does Q-learning use an actor model and critic model?Prioritized Experience Replay - why to approximate the Density Function?Dueling DQN what does a' mean?Representing similar states in reinforcement learning?Reinforcement learning: decreasing loss without increasing rewardHow does Q-Learning deal with mixed strategies?Different algorithms categorized in reinforcement learning

Hang 20lb projector screen on Hardieplank

nth number which is divisible by all the numbers from 1 to 10

What does air vanishing on contact sound like?

Is it cheaper to drop cargo than to land it?

How can I close a gap between my fence and my neighbor's that's on his side of the property line?

I caught several of my students plagiarizing. Could it be my fault as a teacher?

Transfer over $10k

Did we get closer to another plane than we were supposed to, or was the pilot just protecting our delicate sensibilities?

Catholic vs Protestant Support for Nazism in Germany

Power LED from 3.3V Power Pin without Resistor

How to improve/restore vintage peugeot bike, or is it even worth it?

Why are notes ordered like they are on a piano?

Is Jon Snow immune to dragonfire?

How does NAND gate work? (Very basic question)

If Earth is tilted, why is Polaris always above the same spot?

Would "lab meat" be able to feed a much larger global population

Can PCs use nonmagical armor and weapons looted from monsters?

Why is this a valid proof for the harmonic series?

Password expiration with Password manager

Why do freehub and cassette have only one position that matches?

What precisely is a link?

Feels like I am getting dragged into office politics

Why is Thanos so tough at the beginning of "Avengers: Endgame"?

Can Ghost kill White Walkers or Wights?



Why does Q Learning diverge?


Neural Network Learning Rate vs Q-Learning Learning RateTensorflow regression predicting 1 for all inputsCan you interpolate with QLearning or Reinforcement learning in general?Why does Q-learning use an actor model and critic model?Prioritized Experience Replay - why to approximate the Density Function?Dueling DQN what does a' mean?Representing similar states in reinforcement learning?Reinforcement learning: decreasing loss without increasing rewardHow does Q-Learning deal with mixed strategies?Different algorithms categorized in reinforcement learning













8












$begingroup$


My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.



I've tried:



  • Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

  • Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

  • Decreasing the values of the rewards

  • Increasing the exploration rate

  • Normalizing the inputs to between 1~100 (previously it was 0~1)

  • Change the discount rate

  • Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?



Update #1 on August 14th, 2017:



I've decided to add some specific details on what I'm doing right now due to a request to.



I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.



Each character has 9 actions to choose from on each turn:



  • move up

  • move down

  • move left

  • move right

  • shoot a bullet upwards

  • shoot a bullet downwards

  • shoot a bullet to the left

  • shoot a bullet to the right

  • do nothing

The rewards are:



  • if agent hits the bot with a bullet, +100 (I've tried many different values)

  • if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

  • if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)


  • if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)


The inputs for the neural network are:



  • Distance between the agent and the bot on the X axis normalized to 0~100


  • Distance between the agent and the bot on the Y axis normalized to 0~100


  • Agent's x and y positions


  • Bot's x and y positions


  • Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.


I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.



Here's the code:



from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False

draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
return random.randint(1, 9)

def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False


def mapping(maximum, number):
return number#int(number * maximum)

def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size

if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"

if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25

elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner

def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()

jList.append(j)
rList.append(rAll)
print winner


I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".



I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.



Thanks!



Update #2 on August 18, 2017:



Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.



Update #3 on August 22, 2017:



I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.










share|improve this question











$endgroup$











  • $begingroup$
    Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 7:27







  • 1




    $begingroup$
    OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 15:41






  • 1




    $begingroup$
    I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 20:05






  • 1




    $begingroup$
    Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
    $endgroup$
    – Neil Slater
    Aug 12 '17 at 16:42







  • 1




    $begingroup$
    That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
    $endgroup$
    – Neil Slater
    Aug 15 '17 at 7:13
















8












$begingroup$


My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.



I've tried:



  • Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

  • Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

  • Decreasing the values of the rewards

  • Increasing the exploration rate

  • Normalizing the inputs to between 1~100 (previously it was 0~1)

  • Change the discount rate

  • Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?



Update #1 on August 14th, 2017:



I've decided to add some specific details on what I'm doing right now due to a request to.



I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.



Each character has 9 actions to choose from on each turn:



  • move up

  • move down

  • move left

  • move right

  • shoot a bullet upwards

  • shoot a bullet downwards

  • shoot a bullet to the left

  • shoot a bullet to the right

  • do nothing

The rewards are:



  • if agent hits the bot with a bullet, +100 (I've tried many different values)

  • if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

  • if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)


  • if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)


The inputs for the neural network are:



  • Distance between the agent and the bot on the X axis normalized to 0~100


  • Distance between the agent and the bot on the Y axis normalized to 0~100


  • Agent's x and y positions


  • Bot's x and y positions


  • Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.


I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.



Here's the code:



from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False

draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
return random.randint(1, 9)

def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False


def mapping(maximum, number):
return number#int(number * maximum)

def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size

if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"

if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25

elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner

def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()

jList.append(j)
rList.append(rAll)
print winner


I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".



I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.



Thanks!



Update #2 on August 18, 2017:



Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.



Update #3 on August 22, 2017:



I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.










share|improve this question











$endgroup$











  • $begingroup$
    Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 7:27







  • 1




    $begingroup$
    OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 15:41






  • 1




    $begingroup$
    I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 20:05






  • 1




    $begingroup$
    Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
    $endgroup$
    – Neil Slater
    Aug 12 '17 at 16:42







  • 1




    $begingroup$
    That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
    $endgroup$
    – Neil Slater
    Aug 15 '17 at 7:13














8












8








8


2



$begingroup$


My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.



I've tried:



  • Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

  • Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

  • Decreasing the values of the rewards

  • Increasing the exploration rate

  • Normalizing the inputs to between 1~100 (previously it was 0~1)

  • Change the discount rate

  • Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?



Update #1 on August 14th, 2017:



I've decided to add some specific details on what I'm doing right now due to a request to.



I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.



Each character has 9 actions to choose from on each turn:



  • move up

  • move down

  • move left

  • move right

  • shoot a bullet upwards

  • shoot a bullet downwards

  • shoot a bullet to the left

  • shoot a bullet to the right

  • do nothing

The rewards are:



  • if agent hits the bot with a bullet, +100 (I've tried many different values)

  • if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

  • if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)


  • if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)


The inputs for the neural network are:



  • Distance between the agent and the bot on the X axis normalized to 0~100


  • Distance between the agent and the bot on the Y axis normalized to 0~100


  • Agent's x and y positions


  • Bot's x and y positions


  • Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.


I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.



Here's the code:



from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False

draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
return random.randint(1, 9)

def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False


def mapping(maximum, number):
return number#int(number * maximum)

def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size

if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"

if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25

elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner

def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()

jList.append(j)
rList.append(rAll)
print winner


I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".



I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.



Thanks!



Update #2 on August 18, 2017:



Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.



Update #3 on August 22, 2017:



I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.










share|improve this question











$endgroup$




My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.



I've tried:



  • Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

  • Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

  • Decreasing the values of the rewards

  • Increasing the exploration rate

  • Normalizing the inputs to between 1~100 (previously it was 0~1)

  • Change the discount rate

  • Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?



Update #1 on August 14th, 2017:



I've decided to add some specific details on what I'm doing right now due to a request to.



I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.



Each character has 9 actions to choose from on each turn:



  • move up

  • move down

  • move left

  • move right

  • shoot a bullet upwards

  • shoot a bullet downwards

  • shoot a bullet to the left

  • shoot a bullet to the right

  • do nothing

The rewards are:



  • if agent hits the bot with a bullet, +100 (I've tried many different values)

  • if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

  • if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)


  • if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)


The inputs for the neural network are:



  • Distance between the agent and the bot on the X axis normalized to 0~100


  • Distance between the agent and the bot on the Y axis normalized to 0~100


  • Agent's x and y positions


  • Bot's x and y positions


  • Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.


I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.



Here's the code:



from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x,
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue,
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False

draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
return random.randint(1, 9)

def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x,
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y,
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False


def mapping(maximum, number):
return number#int(number * maximum)

def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire,
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x,
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size

if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"

if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25

elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner

def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()

jList.append(j)
rList.append(rAll)
print winner


I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".



I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.



Thanks!



Update #2 on August 18, 2017:



Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.



Update #3 on August 22, 2017:



I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.







machine-learning python reinforcement-learning q-learning






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 22 '17 at 11:44







IronEdward

















asked Aug 11 '17 at 1:11









IronEdwardIronEdward

15510




15510











  • $begingroup$
    Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 7:27







  • 1




    $begingroup$
    OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 15:41






  • 1




    $begingroup$
    I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 20:05






  • 1




    $begingroup$
    Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
    $endgroup$
    – Neil Slater
    Aug 12 '17 at 16:42







  • 1




    $begingroup$
    That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
    $endgroup$
    – Neil Slater
    Aug 15 '17 at 7:13

















  • $begingroup$
    Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 7:27







  • 1




    $begingroup$
    OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 15:41






  • 1




    $begingroup$
    I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
    $endgroup$
    – Neil Slater
    Aug 11 '17 at 20:05






  • 1




    $begingroup$
    Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
    $endgroup$
    – Neil Slater
    Aug 12 '17 at 16:42







  • 1




    $begingroup$
    That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
    $endgroup$
    – Neil Slater
    Aug 15 '17 at 7:13
















$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27





$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27





1




1




$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41




$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41




1




1




$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05




$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05




1




1




$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42





$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42





1




1




$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13





$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13











2 Answers
2






active

oldest

votes


















4












$begingroup$

If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:



  1. too many layers, or

  2. too many recurrent cycles if you're using an RNN.

You can verify if you have exploding gradients as follows:



grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
for g in tf.gradients(loss, weights_list)])**0.5


Some approaches to solving the problem of exploding gradients are:



  • Use RELU or ELU activations

  • Use Xavier initialization

  • Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.





share|improve this answer











$endgroup$




















    1












    $begingroup$

    If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.






    share|improve this answer









    $endgroup$













      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "557"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f22163%2fwhy-does-q-learning-diverge%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      4












      $begingroup$

      If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:



      1. too many layers, or

      2. too many recurrent cycles if you're using an RNN.

      You can verify if you have exploding gradients as follows:



      grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
      for g in tf.gradients(loss, weights_list)])**0.5


      Some approaches to solving the problem of exploding gradients are:



      • Use RELU or ELU activations

      • Use Xavier initialization

      • Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.





      share|improve this answer











      $endgroup$

















        4












        $begingroup$

        If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:



        1. too many layers, or

        2. too many recurrent cycles if you're using an RNN.

        You can verify if you have exploding gradients as follows:



        grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
        for g in tf.gradients(loss, weights_list)])**0.5


        Some approaches to solving the problem of exploding gradients are:



        • Use RELU or ELU activations

        • Use Xavier initialization

        • Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.





        share|improve this answer











        $endgroup$















          4












          4








          4





          $begingroup$

          If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:



          1. too many layers, or

          2. too many recurrent cycles if you're using an RNN.

          You can verify if you have exploding gradients as follows:



          grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
          for g in tf.gradients(loss, weights_list)])**0.5


          Some approaches to solving the problem of exploding gradients are:



          • Use RELU or ELU activations

          • Use Xavier initialization

          • Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.





          share|improve this answer











          $endgroup$



          If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:



          1. too many layers, or

          2. too many recurrent cycles if you're using an RNN.

          You can verify if you have exploding gradients as follows:



          grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
          for g in tf.gradients(loss, weights_list)])**0.5


          Some approaches to solving the problem of exploding gradients are:



          • Use RELU or ELU activations

          • Use Xavier initialization

          • Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 30 '18 at 20:40









          Stephen Rauch

          1,53551330




          1,53551330










          answered Mar 30 '18 at 20:17









          Default pictureDefault picture

          22115




          22115





















              1












              $begingroup$

              If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.






              share|improve this answer









              $endgroup$

















                1












                $begingroup$

                If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.






                share|improve this answer









                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.






                  share|improve this answer









                  $endgroup$



                  If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Apr 10 at 11:33









                  mathtickmathtick

                  1115




                  1115



























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f22163%2fwhy-does-q-learning-diverge%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                      Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                      Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High