Why does Q Learning diverge?Neural Network Learning Rate vs Q-Learning Learning RateTensorflow regression predicting 1 for all inputsCan you interpolate with QLearning or Reinforcement learning in general?Why does Q-learning use an actor model and critic model?Prioritized Experience Replay - why to approximate the Density Function?Dueling DQN what does a' mean?Representing similar states in reinforcement learning?Reinforcement learning: decreasing loss without increasing rewardHow does Q-Learning deal with mixed strategies?Different algorithms categorized in reinforcement learning

Hang 20lb projector screen on Hardieplank

nth number which is divisible by all the numbers from 1 to 10

What does air vanishing on contact sound like?

Is it cheaper to drop cargo than to land it?

How can I close a gap between my fence and my neighbor's that's on his side of the property line?

I caught several of my students plagiarizing. Could it be my fault as a teacher?

Transfer over $10k

Did we get closer to another plane than we were supposed to, or was the pilot just protecting our delicate sensibilities?

Catholic vs Protestant Support for Nazism in Germany

Power LED from 3.3V Power Pin without Resistor

How to improve/restore vintage peugeot bike, or is it even worth it?

Why are notes ordered like they are on a piano?

Is Jon Snow immune to dragonfire?

How does NAND gate work? (Very basic question)

If Earth is tilted, why is Polaris always above the same spot?

Would "lab meat" be able to feed a much larger global population

Can PCs use nonmagical armor and weapons looted from monsters?

Why is this a valid proof for the harmonic series?

Password expiration with Password manager

Why do freehub and cassette have only one position that matches?

What precisely is a link?

Feels like I am getting dragged into office politics

Why is Thanos so tough at the beginning of "Avengers: Endgame"?

Can Ghost kill White Walkers or Wights?

Why does Q Learning diverge?

Neural Network Learning Rate vs Q-Learning Learning RateTensorflow regression predicting 1 for all inputsCan you interpolate with QLearning or Reinforcement learning in general?Why does Q-learning use an actor model and critic model?Prioritized Experience Replay - why to approximate the Density Function?Dueling DQN what does a' mean?Representing similar states in reinforcement learning?Reinforcement learning: decreasing loss without increasing rewardHow does Q-Learning deal with mixed strategies?Different algorithms categorized in reinforcement learning

My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.

I've tried:

Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

Decreasing the values of the rewards

Increasing the exploration rate

Normalizing the inputs to between 1~100 (previously it was 0~1)

Change the discount rate

Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?

Update #1 on August 14th, 2017:

I've decided to add some specific details on what I'm doing right now due to a request to.

I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.

Each character has 9 actions to choose from on each turn:

move up

move down

move left

move right

shoot a bullet upwards

shoot a bullet downwards

shoot a bullet to the left

shoot a bullet to the right

do nothing

The rewards are:

if agent hits the bot with a bullet, +100 (I've tried many different values)

if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)

if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)

The inputs for the neural network are:

Distance between the agent and the bot on the X axis normalized to 0~100

Distance between the agent and the bot on the Y axis normalized to 0~100

Agent's x and y positions

Bot's x and y positions

Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.

I've also fiddled with the inputs too; I tried adding new features like the x value of the agent's position(not the distance but the actual position)and the position of the bot's bullet. None of them worked.

Here's the code:

from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
 """Initializes parameters"""
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

 agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
 bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
 agent_hp = bot_hp = character_init_health
 agent_beam_fire = bot_beam_fire = False
 agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
 agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
 global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, 
 agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, 
 agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

 disp.fill(aqua)
 draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
 2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
 draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
 disp_y / 2 - arena_y / 2, arena_x, arena_y))
 if bot_beam_fire == True:
 draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
 bot_beam_fire = False
 if agent_beam_fire == True:
 draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
 agent_beam_fire = False

 draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
 draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

 draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
 draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
 return random.randint(1, 9)

def beam_hit_detector(player):
 global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, 
 bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, 
 bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

 if player == "bot":
 if bot_current_action == 1:
 if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 2:
 if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 elif bot_current_action == 3:
 if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 else:
 if agent_current_action == 1:
 if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif agent_current_action == 2:
 if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False
 elif agent_current_action == 3:
 if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False


def mapping(maximum, number):
 return number#int(number * maximum)

def action(agent_action, bot_action):
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, 
 bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, 
 agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

 agent_current_action = agent_action; bot_current_action = bot_action
 reward = 0; cont = True; successful = False; winner = ""
 if 1 <= bot_action <= 4:
 bot_beam_fire = True
 if bot_action == 1:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
 bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
 elif bot_action == 2:
 bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
 elif bot_action == 3:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
 bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
 elif bot_action == 4:
 bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

 elif 5 <= bot_action <= 8:
 if bot_action == 5:
 bot_y -= character_move_speed
 if bot_y <= disp_y/2 - arena_y/2:
 bot_y = disp_y/2 - arena_y/2
 elif agent_y <= bot_y <= agent_y + character_size:
 bot_y = agent_y + character_size
 elif bot_action == 6:
 bot_x += character_move_speed
 if bot_x >= disp_x/2 + arena_x/2 - character_size:
 bot_x = disp_x/2 + arena_x/2 - character_size
 elif agent_x <= bot_x + character_size <= agent_x + character_size:
 bot_x = agent_x - character_size
 elif bot_action == 7:
 bot_y += character_move_speed
 if bot_y + character_size >= disp_y/2 + arena_y/2:
 bot_y = disp_y/2 + arena_y/2 - character_size
 elif agent_y <= bot_y + character_size <= agent_y + character_size:
 bot_y = agent_y - character_size
 elif bot_action == 8:
 bot_x -= character_move_speed
 if bot_x <= disp_x/2 - arena_x/2:
 bot_x = disp_x/2 - arena_x/2
 elif agent_x <= bot_x <= agent_x + character_size:
 bot_x = agent_x + character_size

 if bot_beam_fire == True:
 if beam_hit_detector("bot"):
 #print "Agent Got Hit!"
 agent_hp -= beam_damage
 reward += -50
 bot_beam_size_x = bot_beam_size_y = 0
 bot_beam_x = bot_beam_y = beam_ob
 if agent_hp <= 0:
 cont = False
 winner = "Bot"

 if 1 <= agent_action <= 4:
 agent_beam_fire = True
 if agent_action == 1:
 if agent_y > disp_y/2 - arena_y/2:
 agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
 agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
 else:
 reward += -25
 elif agent_action == 2:
 if agent_x + character_size < disp_x/2 + arena_x/2:
 agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
 else:
 reward += -25
 elif agent_action == 3:
 if agent_y + character_size < disp_y/2 + arena_y/2:
 agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
 agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
 else:
 reward += -25
 elif agent_action == 4:
 if agent_x > disp_x/2 - arena_x/2:
 agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
 else:
 reward += -25

 elif 5 <= agent_action <= 8:
 if agent_action == 5:
 agent_y -= character_move_speed
 if agent_y <= disp_y/2 - arena_y/2:
 agent_y = disp_y/2 - arena_y/2
 reward += -5
 elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y + character_size
 reward += -2
 elif agent_action == 6:
 agent_x += character_move_speed
 if agent_x + character_size >= disp_x/2 + arena_x/2:
 agent_x = disp_x/2 + arena_x/2 - character_size
 reward += -5
 elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x - character_size
 reward += -2
 elif agent_action == 7:
 agent_y += character_move_speed
 if agent_y + character_size >= disp_y/2 + arena_y/2:
 agent_y = disp_y/2 + arena_y/2 - character_size
 reward += -5
 elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y - character_size
 reward += -2
 elif agent_action == 8:
 agent_x -= character_move_speed
 if agent_x <= disp_x/2 - arena_x/2:
 agent_x = disp_x/2 - arena_x/2
 reward += -5
 elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x + character_size
 reward += -2
 if agent_beam_fire == True:
 if beam_hit_detector("agent"):
 #print "Bot Got Hit!"
 bot_hp -= beam_damage
 reward += 50
 agent_beam_size_x = agent_beam_size_y = 0
 agent_beam_x = agent_beam_y = beam_ob
 if bot_hp <= 0:
 successful = True
 cont = False
 winner = "Agent"
 return reward, cont, successful, winner

def bot_beam_dir_detector():
 global bot_current_action
 if bot_current_action == 1:
 bot_beam_dir = 2
 elif bot_current_action == 2:
 bot_beam_dir = 4
 elif bot_current_action == 3:
 bot_beam_dir = 3
 elif bot_current_action == 4:
 bot_beam_dir = 1
 else:
 bot_beam_dir = 0
 return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
 sess.run(initialize)
 success = 0
 for i in tqdm(range(1, num_episodes)):
 #print "Episode #", i
 rAll = 0; d = False; c = True; j = 0
 param_init()
 samples = []
 while c == True:
 j += 1
 current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 b = bot_take_action()
 if np.random.rand(1) < e or i <= 5:
 a = random.randint(0, 8)
 else:
 a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
 r, c, d, winner = action(a + 1, b)
 bot_beam_dir = bot_beam_dir_detector()
 next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 samples.append([current_state, a, r, next_state])
 if len(samples) > 10:
 for count in xrange(batch_size):
 [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
 batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
 batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
 batch_maxQ1 = np.max(batch_Q1)
 batch_targetQ = batch_allQ
 batch_targetQ[0][a] = reward + y * batch_maxQ1
 sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
 rAll += r
 screen_blit()
 if d == True:
 e = 1. / ((i / 50) + 10)
 success += 1
 break
 #print agent_hp, bot_hp
 display.update()

 jList.append(j)
 rList.append(rAll)
 print winner

I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".

I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.

Thanks!

Update #2 on August 18, 2017:

Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.

Update #3 on August 22, 2017:

I've noticed that if the agent hits the bot with a bullet on a turn and the action the bot taken on that turn was not "fire a bullet", then the wrong actions would be given credit. Thus, I've turned the bullets into beams so the bot/agent takes damage on the turn the beam's fired.

edited Aug 22 '17 at 11:44

asked Aug 11 '17 at 1:11

IronEdward

15510

$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27

1

$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41

1

$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05

1

$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42

1

$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13

|
show 11 more comments

My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.

I've tried:

Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

Decreasing the values of the rewards

Increasing the exploration rate

Normalizing the inputs to between 1~100 (previously it was 0~1)

Change the discount rate

Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?

Update #1 on August 14th, 2017:

I've decided to add some specific details on what I'm doing right now due to a request to.

I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.

Each character has 9 actions to choose from on each turn:

move up

move down

move left

move right

shoot a bullet upwards

shoot a bullet downwards

shoot a bullet to the left

shoot a bullet to the right

do nothing

The rewards are:

if agent hits the bot with a bullet, +100 (I've tried many different values)

if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)

if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)

The inputs for the neural network are:

Distance between the agent and the bot on the X axis normalized to 0~100

Distance between the agent and the bot on the Y axis normalized to 0~100

Agent's x and y positions

Bot's x and y positions

Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.

Here's the code:

from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
 """Initializes parameters"""
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

 agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
 bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
 agent_hp = bot_hp = character_init_health
 agent_beam_fire = bot_beam_fire = False
 agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
 agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
 global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, 
 agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, 
 agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

 disp.fill(aqua)
 draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
 2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
 draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
 disp_y / 2 - arena_y / 2, arena_x, arena_y))
 if bot_beam_fire == True:
 draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
 bot_beam_fire = False
 if agent_beam_fire == True:
 draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
 agent_beam_fire = False

 draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
 draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

 draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
 draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
 return random.randint(1, 9)

def beam_hit_detector(player):
 global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, 
 bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, 
 bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

 if player == "bot":
 if bot_current_action == 1:
 if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 2:
 if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 elif bot_current_action == 3:
 if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 else:
 if agent_current_action == 1:
 if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif agent_current_action == 2:
 if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False
 elif agent_current_action == 3:
 if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False


def mapping(maximum, number):
 return number#int(number * maximum)

def action(agent_action, bot_action):
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, 
 bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, 
 agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

 agent_current_action = agent_action; bot_current_action = bot_action
 reward = 0; cont = True; successful = False; winner = ""
 if 1 <= bot_action <= 4:
 bot_beam_fire = True
 if bot_action == 1:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
 bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
 elif bot_action == 2:
 bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
 elif bot_action == 3:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
 bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
 elif bot_action == 4:
 bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

 elif 5 <= bot_action <= 8:
 if bot_action == 5:
 bot_y -= character_move_speed
 if bot_y <= disp_y/2 - arena_y/2:
 bot_y = disp_y/2 - arena_y/2
 elif agent_y <= bot_y <= agent_y + character_size:
 bot_y = agent_y + character_size
 elif bot_action == 6:
 bot_x += character_move_speed
 if bot_x >= disp_x/2 + arena_x/2 - character_size:
 bot_x = disp_x/2 + arena_x/2 - character_size
 elif agent_x <= bot_x + character_size <= agent_x + character_size:
 bot_x = agent_x - character_size
 elif bot_action == 7:
 bot_y += character_move_speed
 if bot_y + character_size >= disp_y/2 + arena_y/2:
 bot_y = disp_y/2 + arena_y/2 - character_size
 elif agent_y <= bot_y + character_size <= agent_y + character_size:
 bot_y = agent_y - character_size
 elif bot_action == 8:
 bot_x -= character_move_speed
 if bot_x <= disp_x/2 - arena_x/2:
 bot_x = disp_x/2 - arena_x/2
 elif agent_x <= bot_x <= agent_x + character_size:
 bot_x = agent_x + character_size

 if bot_beam_fire == True:
 if beam_hit_detector("bot"):
 #print "Agent Got Hit!"
 agent_hp -= beam_damage
 reward += -50
 bot_beam_size_x = bot_beam_size_y = 0
 bot_beam_x = bot_beam_y = beam_ob
 if agent_hp <= 0:
 cont = False
 winner = "Bot"

 if 1 <= agent_action <= 4:
 agent_beam_fire = True
 if agent_action == 1:
 if agent_y > disp_y/2 - arena_y/2:
 agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
 agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
 else:
 reward += -25
 elif agent_action == 2:
 if agent_x + character_size < disp_x/2 + arena_x/2:
 agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
 else:
 reward += -25
 elif agent_action == 3:
 if agent_y + character_size < disp_y/2 + arena_y/2:
 agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
 agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
 else:
 reward += -25
 elif agent_action == 4:
 if agent_x > disp_x/2 - arena_x/2:
 agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
 else:
 reward += -25

 elif 5 <= agent_action <= 8:
 if agent_action == 5:
 agent_y -= character_move_speed
 if agent_y <= disp_y/2 - arena_y/2:
 agent_y = disp_y/2 - arena_y/2
 reward += -5
 elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y + character_size
 reward += -2
 elif agent_action == 6:
 agent_x += character_move_speed
 if agent_x + character_size >= disp_x/2 + arena_x/2:
 agent_x = disp_x/2 + arena_x/2 - character_size
 reward += -5
 elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x - character_size
 reward += -2
 elif agent_action == 7:
 agent_y += character_move_speed
 if agent_y + character_size >= disp_y/2 + arena_y/2:
 agent_y = disp_y/2 + arena_y/2 - character_size
 reward += -5
 elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y - character_size
 reward += -2
 elif agent_action == 8:
 agent_x -= character_move_speed
 if agent_x <= disp_x/2 - arena_x/2:
 agent_x = disp_x/2 - arena_x/2
 reward += -5
 elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x + character_size
 reward += -2
 if agent_beam_fire == True:
 if beam_hit_detector("agent"):
 #print "Bot Got Hit!"
 bot_hp -= beam_damage
 reward += 50
 agent_beam_size_x = agent_beam_size_y = 0
 agent_beam_x = agent_beam_y = beam_ob
 if bot_hp <= 0:
 successful = True
 cont = False
 winner = "Agent"
 return reward, cont, successful, winner

def bot_beam_dir_detector():
 global bot_current_action
 if bot_current_action == 1:
 bot_beam_dir = 2
 elif bot_current_action == 2:
 bot_beam_dir = 4
 elif bot_current_action == 3:
 bot_beam_dir = 3
 elif bot_current_action == 4:
 bot_beam_dir = 1
 else:
 bot_beam_dir = 0
 return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
 sess.run(initialize)
 success = 0
 for i in tqdm(range(1, num_episodes)):
 #print "Episode #", i
 rAll = 0; d = False; c = True; j = 0
 param_init()
 samples = []
 while c == True:
 j += 1
 current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 b = bot_take_action()
 if np.random.rand(1) < e or i <= 5:
 a = random.randint(0, 8)
 else:
 a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
 r, c, d, winner = action(a + 1, b)
 bot_beam_dir = bot_beam_dir_detector()
 next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 samples.append([current_state, a, r, next_state])
 if len(samples) > 10:
 for count in xrange(batch_size):
 [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
 batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
 batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
 batch_maxQ1 = np.max(batch_Q1)
 batch_targetQ = batch_allQ
 batch_targetQ[0][a] = reward + y * batch_maxQ1
 sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
 rAll += r
 screen_blit()
 if d == True:
 e = 1. / ((i / 50) + 10)
 success += 1
 break
 #print agent_hp, bot_hp
 display.update()

 jList.append(j)
 rList.append(rAll)
 print winner

I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".

I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.

Thanks!

Update #2 on August 18, 2017:

Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.

Update #3 on August 22, 2017:

edited Aug 22 '17 at 11:44

asked Aug 11 '17 at 1:11

IronEdward

15510

$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27

1

$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41

1

$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05

1

$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42

1

$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13

|
show 11 more comments

My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.

I've tried:

Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

Decreasing the values of the rewards

Increasing the exploration rate

Normalizing the inputs to between 1~100 (previously it was 0~1)

Change the discount rate

Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?

Update #1 on August 14th, 2017:

I've decided to add some specific details on what I'm doing right now due to a request to.

I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.

Each character has 9 actions to choose from on each turn:

move up

move down

move left

move right

shoot a bullet upwards

shoot a bullet downwards

shoot a bullet to the left

shoot a bullet to the right

do nothing

The rewards are:

if agent hits the bot with a bullet, +100 (I've tried many different values)

if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)

if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)

The inputs for the neural network are:

Distance between the agent and the bot on the X axis normalized to 0~100

Distance between the agent and the bot on the Y axis normalized to 0~100

Agent's x and y positions

Bot's x and y positions

Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.

Here's the code:

from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
 """Initializes parameters"""
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

 agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
 bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
 agent_hp = bot_hp = character_init_health
 agent_beam_fire = bot_beam_fire = False
 agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
 agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
 global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, 
 agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, 
 agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

 disp.fill(aqua)
 draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
 2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
 draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
 disp_y / 2 - arena_y / 2, arena_x, arena_y))
 if bot_beam_fire == True:
 draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
 bot_beam_fire = False
 if agent_beam_fire == True:
 draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
 agent_beam_fire = False

 draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
 draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

 draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
 draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
 return random.randint(1, 9)

def beam_hit_detector(player):
 global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, 
 bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, 
 bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

 if player == "bot":
 if bot_current_action == 1:
 if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 2:
 if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 elif bot_current_action == 3:
 if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 else:
 if agent_current_action == 1:
 if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif agent_current_action == 2:
 if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False
 elif agent_current_action == 3:
 if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False


def mapping(maximum, number):
 return number#int(number * maximum)

def action(agent_action, bot_action):
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, 
 bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, 
 agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

 agent_current_action = agent_action; bot_current_action = bot_action
 reward = 0; cont = True; successful = False; winner = ""
 if 1 <= bot_action <= 4:
 bot_beam_fire = True
 if bot_action == 1:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
 bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
 elif bot_action == 2:
 bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
 elif bot_action == 3:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
 bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
 elif bot_action == 4:
 bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

 elif 5 <= bot_action <= 8:
 if bot_action == 5:
 bot_y -= character_move_speed
 if bot_y <= disp_y/2 - arena_y/2:
 bot_y = disp_y/2 - arena_y/2
 elif agent_y <= bot_y <= agent_y + character_size:
 bot_y = agent_y + character_size
 elif bot_action == 6:
 bot_x += character_move_speed
 if bot_x >= disp_x/2 + arena_x/2 - character_size:
 bot_x = disp_x/2 + arena_x/2 - character_size
 elif agent_x <= bot_x + character_size <= agent_x + character_size:
 bot_x = agent_x - character_size
 elif bot_action == 7:
 bot_y += character_move_speed
 if bot_y + character_size >= disp_y/2 + arena_y/2:
 bot_y = disp_y/2 + arena_y/2 - character_size
 elif agent_y <= bot_y + character_size <= agent_y + character_size:
 bot_y = agent_y - character_size
 elif bot_action == 8:
 bot_x -= character_move_speed
 if bot_x <= disp_x/2 - arena_x/2:
 bot_x = disp_x/2 - arena_x/2
 elif agent_x <= bot_x <= agent_x + character_size:
 bot_x = agent_x + character_size

 if bot_beam_fire == True:
 if beam_hit_detector("bot"):
 #print "Agent Got Hit!"
 agent_hp -= beam_damage
 reward += -50
 bot_beam_size_x = bot_beam_size_y = 0
 bot_beam_x = bot_beam_y = beam_ob
 if agent_hp <= 0:
 cont = False
 winner = "Bot"

 if 1 <= agent_action <= 4:
 agent_beam_fire = True
 if agent_action == 1:
 if agent_y > disp_y/2 - arena_y/2:
 agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
 agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
 else:
 reward += -25
 elif agent_action == 2:
 if agent_x + character_size < disp_x/2 + arena_x/2:
 agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
 else:
 reward += -25
 elif agent_action == 3:
 if agent_y + character_size < disp_y/2 + arena_y/2:
 agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
 agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
 else:
 reward += -25
 elif agent_action == 4:
 if agent_x > disp_x/2 - arena_x/2:
 agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
 else:
 reward += -25

 elif 5 <= agent_action <= 8:
 if agent_action == 5:
 agent_y -= character_move_speed
 if agent_y <= disp_y/2 - arena_y/2:
 agent_y = disp_y/2 - arena_y/2
 reward += -5
 elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y + character_size
 reward += -2
 elif agent_action == 6:
 agent_x += character_move_speed
 if agent_x + character_size >= disp_x/2 + arena_x/2:
 agent_x = disp_x/2 + arena_x/2 - character_size
 reward += -5
 elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x - character_size
 reward += -2
 elif agent_action == 7:
 agent_y += character_move_speed
 if agent_y + character_size >= disp_y/2 + arena_y/2:
 agent_y = disp_y/2 + arena_y/2 - character_size
 reward += -5
 elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y - character_size
 reward += -2
 elif agent_action == 8:
 agent_x -= character_move_speed
 if agent_x <= disp_x/2 - arena_x/2:
 agent_x = disp_x/2 - arena_x/2
 reward += -5
 elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x + character_size
 reward += -2
 if agent_beam_fire == True:
 if beam_hit_detector("agent"):
 #print "Bot Got Hit!"
 bot_hp -= beam_damage
 reward += 50
 agent_beam_size_x = agent_beam_size_y = 0
 agent_beam_x = agent_beam_y = beam_ob
 if bot_hp <= 0:
 successful = True
 cont = False
 winner = "Agent"
 return reward, cont, successful, winner

def bot_beam_dir_detector():
 global bot_current_action
 if bot_current_action == 1:
 bot_beam_dir = 2
 elif bot_current_action == 2:
 bot_beam_dir = 4
 elif bot_current_action == 3:
 bot_beam_dir = 3
 elif bot_current_action == 4:
 bot_beam_dir = 1
 else:
 bot_beam_dir = 0
 return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
 sess.run(initialize)
 success = 0
 for i in tqdm(range(1, num_episodes)):
 #print "Episode #", i
 rAll = 0; d = False; c = True; j = 0
 param_init()
 samples = []
 while c == True:
 j += 1
 current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 b = bot_take_action()
 if np.random.rand(1) < e or i <= 5:
 a = random.randint(0, 8)
 else:
 a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
 r, c, d, winner = action(a + 1, b)
 bot_beam_dir = bot_beam_dir_detector()
 next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 samples.append([current_state, a, r, next_state])
 if len(samples) > 10:
 for count in xrange(batch_size):
 [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
 batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
 batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
 batch_maxQ1 = np.max(batch_Q1)
 batch_targetQ = batch_allQ
 batch_targetQ[0][a] = reward + y * batch_maxQ1
 sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
 rAll += r
 screen_blit()
 if d == True:
 e = 1. / ((i / 50) + 10)
 success += 1
 break
 #print agent_hp, bot_hp
 display.update()

 jList.append(j)
 rList.append(rAll)
 print winner

I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".

I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.

Thanks!

Update #2 on August 18, 2017:

Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.

Update #3 on August 22, 2017:

edited Aug 22 '17 at 11:44

asked Aug 11 '17 at 1:11

IronEdward

15510

My Q-Learning algorithm's state values keep on diverging to infinity, which means my weights are diverging too. I use a neural network for my value-mapping.

I've tried:

Clipping the "reward + discount * maximum value of action" (max/min set to 50/-50)

Setting a low learning rate (0.00001 and I use the classic Backpropagation for updating the weights)

Decreasing the values of the rewards

Increasing the exploration rate

Normalizing the inputs to between 1~100 (previously it was 0~1)

Change the discount rate

Decrease the layers of the neural network (just for validation)

I've heard that Q Learning is known to diverge on non-linear input, but are there anything else that I can try to stop the divergence of the weights?

Update #1 on August 14th, 2017:

I've decided to add some specific details on what I'm doing right now due to a request to.

I'm currently trying to make an agent learn how to fight in a top-down view of a shooting game. The opponent is a simple bot which moves stochastically.

Each character has 9 actions to choose from on each turn:

move up

move down

move left

move right

shoot a bullet upwards

shoot a bullet downwards

shoot a bullet to the left

shoot a bullet to the right

do nothing

The rewards are:

if agent hits the bot with a bullet, +100 (I've tried many different values)

if agent gets hit by a bullet shot by the bot, -50 (again, I've tried many different values)

if the agent tries to fire a bullet while bullets can't be fired(ex. when the agent just fired a bullet, etc. ), -25(Not necessary but I wanted the agent to be more efficient)

if the bot tries to go out of the arena, -20(Not necessary too but I wanted the agent to be more efficient)

The inputs for the neural network are:

Distance between the agent and the bot on the X axis normalized to 0~100

Distance between the agent and the bot on the Y axis normalized to 0~100

Agent's x and y positions

Bot's x and y positions

Bot's bullet position. If the bot didn't fire a bullet, the parameters are set to the x and y positions of the bot.

Here's the code:

from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
 """Initializes parameters"""
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

 agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
 bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
 agent_hp = bot_hp = character_init_health
 agent_beam_fire = bot_beam_fire = False
 agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
 agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
 global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, 
 agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, 
 agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

 disp.fill(aqua)
 draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
 2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
 draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
 disp_y / 2 - arena_y / 2, arena_x, arena_y))
 if bot_beam_fire == True:
 draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
 bot_beam_fire = False
 if agent_beam_fire == True:
 draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
 agent_beam_fire = False

 draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
 draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

 draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
 draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
 border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
 return random.randint(1, 9)

def beam_hit_detector(player):
 global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, 
 bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, 
 bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

 if player == "bot":
 if bot_current_action == 1:
 if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 2:
 if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 elif bot_current_action == 3:
 if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
 return True
 else:
 return False
 else:
 if agent_current_action == 1:
 if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif agent_current_action == 2:
 if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False
 elif agent_current_action == 3:
 if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
 return True
 else:
 return False
 elif bot_current_action == 4:
 if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
 return True
 else:
 return False


def mapping(maximum, number):
 return number#int(number * maximum)

def action(agent_action, bot_action):
 global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, 
 bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, 
 agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

 agent_current_action = agent_action; bot_current_action = bot_action
 reward = 0; cont = True; successful = False; winner = ""
 if 1 <= bot_action <= 4:
 bot_beam_fire = True
 if bot_action == 1:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
 bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
 elif bot_action == 2:
 bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
 elif bot_action == 3:
 bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
 bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
 elif bot_action == 4:
 bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
 bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

 elif 5 <= bot_action <= 8:
 if bot_action == 5:
 bot_y -= character_move_speed
 if bot_y <= disp_y/2 - arena_y/2:
 bot_y = disp_y/2 - arena_y/2
 elif agent_y <= bot_y <= agent_y + character_size:
 bot_y = agent_y + character_size
 elif bot_action == 6:
 bot_x += character_move_speed
 if bot_x >= disp_x/2 + arena_x/2 - character_size:
 bot_x = disp_x/2 + arena_x/2 - character_size
 elif agent_x <= bot_x + character_size <= agent_x + character_size:
 bot_x = agent_x - character_size
 elif bot_action == 7:
 bot_y += character_move_speed
 if bot_y + character_size >= disp_y/2 + arena_y/2:
 bot_y = disp_y/2 + arena_y/2 - character_size
 elif agent_y <= bot_y + character_size <= agent_y + character_size:
 bot_y = agent_y - character_size
 elif bot_action == 8:
 bot_x -= character_move_speed
 if bot_x <= disp_x/2 - arena_x/2:
 bot_x = disp_x/2 - arena_x/2
 elif agent_x <= bot_x <= agent_x + character_size:
 bot_x = agent_x + character_size

 if bot_beam_fire == True:
 if beam_hit_detector("bot"):
 #print "Agent Got Hit!"
 agent_hp -= beam_damage
 reward += -50
 bot_beam_size_x = bot_beam_size_y = 0
 bot_beam_x = bot_beam_y = beam_ob
 if agent_hp <= 0:
 cont = False
 winner = "Bot"

 if 1 <= agent_action <= 4:
 agent_beam_fire = True
 if agent_action == 1:
 if agent_y > disp_y/2 - arena_y/2:
 agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
 agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
 else:
 reward += -25
 elif agent_action == 2:
 if agent_x + character_size < disp_x/2 + arena_x/2:
 agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
 else:
 reward += -25
 elif agent_action == 3:
 if agent_y + character_size < disp_y/2 + arena_y/2:
 agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
 agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
 else:
 reward += -25
 elif agent_action == 4:
 if agent_x > disp_x/2 - arena_x/2:
 agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
 agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
 else:
 reward += -25

 elif 5 <= agent_action <= 8:
 if agent_action == 5:
 agent_y -= character_move_speed
 if agent_y <= disp_y/2 - arena_y/2:
 agent_y = disp_y/2 - arena_y/2
 reward += -5
 elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y + character_size
 reward += -2
 elif agent_action == 6:
 agent_x += character_move_speed
 if agent_x + character_size >= disp_x/2 + arena_x/2:
 agent_x = disp_x/2 + arena_x/2 - character_size
 reward += -5
 elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x - character_size
 reward += -2
 elif agent_action == 7:
 agent_y += character_move_speed
 if agent_y + character_size >= disp_y/2 + arena_y/2:
 agent_y = disp_y/2 + arena_y/2 - character_size
 reward += -5
 elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
 agent_y = bot_y - character_size
 reward += -2
 elif agent_action == 8:
 agent_x -= character_move_speed
 if agent_x <= disp_x/2 - arena_x/2:
 agent_x = disp_x/2 - arena_x/2
 reward += -5
 elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
 agent_x = bot_x + character_size
 reward += -2
 if agent_beam_fire == True:
 if beam_hit_detector("agent"):
 #print "Bot Got Hit!"
 bot_hp -= beam_damage
 reward += 50
 agent_beam_size_x = agent_beam_size_y = 0
 agent_beam_x = agent_beam_y = beam_ob
 if bot_hp <= 0:
 successful = True
 cont = False
 winner = "Agent"
 return reward, cont, successful, winner

def bot_beam_dir_detector():
 global bot_current_action
 if bot_current_action == 1:
 bot_beam_dir = 2
 elif bot_current_action == 2:
 bot_beam_dir = 4
 elif bot_current_action == 3:
 bot_beam_dir = 3
 elif bot_current_action == 4:
 bot_beam_dir = 1
 else:
 bot_beam_dir = 0
 return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
 sess.run(initialize)
 success = 0
 for i in tqdm(range(1, num_episodes)):
 #print "Episode #", i
 rAll = 0; d = False; c = True; j = 0
 param_init()
 samples = []
 while c == True:
 j += 1
 current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 b = bot_take_action()
 if np.random.rand(1) < e or i <= 5:
 a = random.randint(0, 8)
 else:
 a, _ = sess.run([predict, Q],feed_dict=input_layer : current_state)
 r, c, d, winner = action(a + 1, b)
 bot_beam_dir = bot_beam_dir_detector()
 next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
 mapping(complexity, float(agent_y) / float(arena_y)),
 mapping(complexity, float(bot_x) / float(arena_x)),
 mapping(complexity, float(bot_y) / float(arena_y)),
 #mapping(complexity, float(agent_hp) / float(character_init_health)),
 #mapping(complexity, float(bot_hp) / float(character_init_health)),
 mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
 mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
 bot_beam_dir
 ]])
 samples.append([current_state, a, r, next_state])
 if len(samples) > 10:
 for count in xrange(batch_size):
 [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
 batch_allQ = sess.run(Q, feed_dict=input_layer : batch_current_state)
 batch_Q1 = sess.run(Q, feed_dict = input_layer : batch_next_state)
 batch_maxQ1 = np.max(batch_Q1)
 batch_targetQ = batch_allQ
 batch_targetQ[0][a] = reward + y * batch_maxQ1
 sess.run([updateModel], feed_dict=input_layer : batch_current_state, next_Q : batch_targetQ)
 rAll += r
 screen_blit()
 if d == True:
 e = 1. / ((i / 50) + 10)
 success += 1
 break
 #print agent_hp, bot_hp
 display.update()

 jList.append(j)
 rList.append(rAll)
 print winner

I'm pretty sure that if you have pygame and Tensorflow and matplotlib installed in a python environment you should be able to see the animations of the bot and the agent "fighting".

I digressed in the update, but it would be awesome if somebody could also address my specific problem along with the original general problem.

Thanks!

Update #2 on August 18, 2017:

Based on the advice of @NeilSlater, I've implemented experience replay into my model. The algorithm has improved, but I'm going to look for more better improvement options that offer convergence.

Update #3 on August 22, 2017:

machine-learning python reinforcement-learning q-learning

edited Aug 22 '17 at 11:44

asked Aug 11 '17 at 1:11

IronEdward

15510

edited Aug 22 '17 at 11:44

asked Aug 11 '17 at 1:11

IronEdward

15510

edited Aug 22 '17 at 11:44

asked Aug 11 '17 at 1:11

IronEdward

15510

asked Aug 11 '17 at 1:11

IronEdward

15510

asked Aug 11 '17 at 1:11

IronEdward

15510

$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27

1

$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41

1

$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05

1

$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42

1

$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13

|
show 11 more comments

$begingroup$
Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.
$endgroup$
– Neil Slater
Aug 11 '17 at 7:27

1

$begingroup$
OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.
$endgroup$
– Neil Slater
Aug 11 '17 at 15:41

1

$begingroup$
I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.
$endgroup$
– Neil Slater
Aug 11 '17 at 20:05

1

$begingroup$
Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).
$endgroup$
– Neil Slater
Aug 12 '17 at 16:42

1

$begingroup$
That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.
$endgroup$
– Neil Slater
Aug 15 '17 at 7:13

Are you using experience replay and bootstrapping values from a "frozen" copy of recent network? These are approaches used in DQN - they are not guaranteed though they may be necessary for stability. Are you using a Q($lambda$) algorithm, or just single-step Q-learning? Can you give some indication of what your environment and reward scheme is like? Single-step Q-learning will do poorly when rewards are sparse e.g. final +1 or -1 reward at end of long episode.

– Neil Slater
Aug 11 '17 at 7:27

OK, from your update, I immediately suggest you need experience replay and probably also alternating networks for bootstrapping, because these are stabilising influences on reinforcement learning with non-linear approximators. I'm happy to talk through that in detail, and take a look at your project code to show an example, but might take a day or two to get back to you with that level of detail,.

– Neil Slater
Aug 11 '17 at 15:41

I have got the code running and if I am correct in understanding it, the bullets can be "steered" by the agent selection from actions 1-4 each turn, i.e. the bullet can be moved around in any direction whilst the agent stays still. Is that intentional? The bot doesn't do this because it only fires when aligned on the grid to the agent, and always picks the same direction if it does so.

– Neil Slater
Aug 11 '17 at 20:05

Almost right, but you don't store the bootstrapped value, instead re-calculate it when the step is sampled later. For each action taken, you store the four things: State, Action, Next State, Reward. Then you take a small mini-batch (1 per step is fine, but more e.g. 10 is typical) from this list and for Q-learning calculate the new max action and its value to create the supervised learning mini-batch (also called the TD target).

– Neil Slater
Aug 12 '17 at 16:42

That should be "frozen copy of the approximator (i.e. the neural network" (if the quote is from one of my comments or answers, please point me at it and I will correct it. It's very simple - just keep two copies of the weight params $mathbfw$, the "live" one that you update, and a "recent old" one that you copy from the "live" one every few hundred updates. When you calculate the TD target e.g. $R + gamma textmax_a' hatq(S',a',mathbfw)$ then use the "old" copy to calculate $hatq$, but then train the "live" one with those values.

– Neil Slater
Aug 15 '17 at 7:13

|
show 11 more comments

2 Answers
2

active

oldest

votes

If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:

too many layers, or

too many recurrent cycles if you're using an RNN.

You can verify if you have exploding gradients as follows:

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
 for g in tf.gradients(loss, weights_list)])**0.5

Some approaches to solving the problem of exploding gradients are:

Use RELU or ELU activations

Use Xavier initialization

Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

answered Mar 30 '18 at 20:17

Default picture

22115

add a comment |

If you are using a fixed point iteration to solve Bellman, it might not only be degenerate but also might have attractors at infinity or orbits. Dig into the problem you are solving and understand it deeply. Have a look at control theory. RL folks tend not to write about this as much.

answered Apr 10 at 11:33

mathtick

1115

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f22163%2fwhy-does-q-learning-diverge%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:

too many layers, or

too many recurrent cycles if you're using an RNN.

You can verify if you have exploding gradients as follows:

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
 for g in tf.gradients(loss, weights_list)])**0.5

Some approaches to solving the problem of exploding gradients are:

Use RELU or ELU activations

Use Xavier initialization

Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

answered Mar 30 '18 at 20:17

Default picture

22115

add a comment |

If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:

too many layers, or

too many recurrent cycles if you're using an RNN.

You can verify if you have exploding gradients as follows:

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
 for g in tf.gradients(loss, weights_list)])**0.5

Some approaches to solving the problem of exploding gradients are:

Use RELU or ELU activations

Use Xavier initialization

Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

answered Mar 30 '18 at 20:17

Default picture

22115

add a comment |

If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:

too many layers, or

too many recurrent cycles if you're using an RNN.

You can verify if you have exploding gradients as follows:

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
 for g in tf.gradients(loss, weights_list)])**0.5

Some approaches to solving the problem of exploding gradients are:

Use RELU or ELU activations

Use Xavier initialization

Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

answered Mar 30 '18 at 20:17

Default picture

22115

If your weights are diverging then your optimizer or your gradients aren't behaving well. A common reason for diverging weights is exploding gradients, which can result from:

too many layers, or

too many recurrent cycles if you're using an RNN.

You can verify if you have exploding gradients as follows:

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
 for g in tf.gradients(loss, weights_list)])**0.5

Some approaches to solving the problem of exploding gradients are:

Use RELU or ELU activations

Use Xavier initialization

Use a Deep Residual architecture. This will keep the gradients from being squished by subsequent layers.

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

answered Mar 30 '18 at 20:17

Default picture

22115

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

edited Mar 30 '18 at 20:40

Stephen Rauch♦

1,53551330

answered Mar 30 '18 at 20:17

Default picture

22115

answered Mar 30 '18 at 20:17

Default picture

22115

answered Mar 30 '18 at 20:17

Default picture

22115

add a comment |

answered Apr 10 at 11:33

mathtick

1115

add a comment |

answered Apr 10 at 11:33

mathtick

1115

add a comment |

answered Apr 10 at 11:33

mathtick

1115

answered Apr 10 at 11:33

mathtick

1115

answered Apr 10 at 11:33

mathtick

1115

answered Apr 10 at 11:33

mathtick

1115

answered Apr 10 at 11:33

mathtick

1115

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

W,O RRS3ZNSkna9cPtHM1UZhPELVYwXZG oJ,nzN6,Sb5k,xYvANhg81KaKiYYGxV yScb0exG,isV 5NzNtC,Vh,Omp

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2