DQN fails to find optimal policy The 2019 Stack Overflow Developer Survey Results Are InDQN cannot learn or convergeQ-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columnsWhy random sample from replay for DQN?Does employment of engineered immediate rewards in RL introduce a non-linear problem to an agent?Hindsight Experience Replay: what the reward w.r.t. to sample goal meansWhy is my loss function for DQN converging too quickly?Potential-based reward shaping in DQN reinforcement learningDQN cannot learn or convergeIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?RL - Weighthing negative rewardsDeep Reinforcement Learning for dynamic pricing

Does Parliament need to approve the new Brexit delay to 31 October 2019?

number sequence puzzle deep six

For what reasons would an animal species NOT cross a *horizontal* land bridge?

What do I do when my TA workload is more than expected?

Free operad over a monoid object

How many cones with angle theta can I pack into the unit sphere?

Why are Marketing Cloud timestamps not stored in the same timezone as Sales Cloud?

Do working physicists consider Newtonian mechanics to be "falsified"?

Single author papers against my advisor's will?

Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

What can I do if neighbor is blocking my solar panels intentionally?

How to fill page vertically?

Why did Peik Lin say, "I'm not an animal"?

How to handle characters who are more educated than the author?

Why doesn't shell automatically fix "useless use of cat"?

What information about me do stores get via my credit card?

Simulating Exploding Dice

Using dividends to reduce short term capital gains?

Identify 80s or 90s comics with ripped creatures (not dwarves)

Pretty sure I'm over complicating my loops but unsure how to simplify

Using `min_active_rowversion` for global temporary tables

Why can't wing-mounted spoilers be used to steepen approaches?

Why not take a picture of a closer black hole?

DQN fails to find optimal policy

The 2019 Stack Overflow Developer Survey Results Are InDQN cannot learn or convergeQ-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columnsWhy random sample from replay for DQN?Does employment of engineered immediate rewards in RL introduce a non-linear problem to an agent?Hindsight Experience Replay: what the reward w.r.t. to sample goal meansWhy is my loss function for DQN converging too quickly?Potential-based reward shaping in DQN reinforcement learningDQN cannot learn or convergeIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?RL - Weighthing negative rewardsDeep Reinforcement Learning for dynamic pricing

Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a reward of +1), while simultaneously allowing them to regrow, which maximizes his reward (if he depletes the resource too quickly, he looses future reward). The grid-game is visible on the picture below, where the player is a red square, his direction grey, and apple green:
enter image description here
As given in the publication, I've built a DQN to solve the game. However, regardless of playing with learning rate, loss, exploration rate and its decay, batch size, optimizer, replay buffer, increasing the NN size the DQN does not find an optimal policy pictured below:
enter image description here
I wonder if there is some mistake in my DQN code (with the similar implementation I've managed to solve OpenAI Gym CartPole task.) Pasting my code below:

class DDQNAgent(RLDebugger):
 def __init__(self, observation_space, action_space):
 RLDebugger.__init__(self)
 # get size of state and action
 self.state_size = observation_space[0]
 self.action_size = action_space
 # hyper parameters
 self.learning_rate = .00025
 self.model = self.build_model()
 self.target_model = self.model
 self.gamma = 0.999
 self.epsilon_max = 1.
 self.epsilon = 1.
 self.t = 0
 self.epsilon_min = 0.1
 self.n_first_exploration_steps = 1500
 self.epsilon_decay_len = 1000000
 self.batch_size = 32
 self.train_start = 64
 # create replay memory using deque
 self.memory = deque(maxlen=1000000)
 self.target_model = self.build_model(trainable=False)

 # approximate Q function using Neural Network
 # state is input and Q Value of each action is output of network
 def build_model(self, trainable=True):
 model = Sequential()
 # This is a simple one hidden layer model, thought it should be enough here,
 # it is much easier to train with different achitectures (stack layers, change activation)
 model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
 model.add(Dense(32, activation='relu', trainable=trainable))
 model.add(Dense(self.action_size, activation='linear', trainable=trainable))
 model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
 model.summary()
 # 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
 # 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
 return model

 # get action from model using greedy policy
 def get_action(self, state):
 if random.random() < self.epsilon:
 return random.randrange(self.action_size)
 q_value = self.model.predict(state)
 return np.argmax(q_value[0])

 # decay epsilon
 def update_epsilon(self):
 self.t += 1
 self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
 (self.epsilon_decay_len - max(0.,
 self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

 # train the target network on the selected action and transition
 def train_model(self, action, state, next_state, reward, done):

 # save sample <s,a,r,s'> to the replay memory
 self.memory.append((state, action, reward, next_state, done))

 if len(self.memory) >= self.train_start:
 states, actions, rewards, next_states, dones = self.create_minibatch()

 targets = self.model.predict(states)
 target_values = self.target_model.predict(next_states)

 for i in range(self.batch_size):
 # Approx Q Learning
 if dones[i]:
 targets[i][actions[i]] = rewards[i]
 else:
 targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

 # and do the model fit!
 loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

 for i in range(self.batch_size):
 self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

 def create_minibatch(self):
 # pick samples randomly from replay memory (using batch_size)

 batch_size = min(self.batch_size, len(self.memory))
 samples = random.sample(self.memory, batch_size)

 states = np.array([_[0][0] for _ in samples])
 actions = np.array([_[1] for _ in samples])
 rewards = np.array([_[2] for _ in samples])
 next_states = np.array([_[3][0] for _ in samples])
 dones = np.array([_[4] for _ in samples])

 return (states, actions, rewards, next_states, dones)

 def update_target_model(self):
 self.target_model.set_weights(self.model.get_weights())

And this is the code which I use to train the model:

from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
 episode += 1
 state = env.reset()
 state = np.reshape(state, [1, state_size])
 #if episode % 100 == 0:
 # env.render_env()
 total_reward = 0

 step = 0
 gameover = False
 while not gameover:
 step += 1
 #if episode % 100 == 0:
 # env.render_env()
 action = agent.get_action(state)
 reward, next_state, done = env.step(action)
 next_state = np.reshape(next_state, [1, state_size])
 total_reward += reward
 agent.train_model(action, state, next_state, reward, done)
 agent.update_epsilon()
 state = next_state
 terminal = (step >= max_episode_len)
 if done or terminal:
 last_rewards.append(total_reward)
 agent.update_target_model()
 gameover = True

 print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)

With the model being updated after each episode (episode=1000 steps).

Looking at logs, the agent sometimes tends to achieve very high results more than few times in a row, but always fails to stabilize and the results from episode to episode have an extremely high variance (even after increasing epsilon and running for few thousands of episodes). Looking at my code and the game, do you have any ideas for what might help the algorithm stabilize the results/converge? I've been playing a lot with hyperparameters but nothing gives very significant improvement.

Some parameters on the game & training:
Reward: +1 for collecting each apple (green square)
Episode: 1000 steps, after 1000 steps or in case the player completely depletes the resource, the game automatically resets.
Target model update: after each game termination
Hyperparameters can be found in the code above.

Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com

P.S. I know that this is a similar problem to the one presented below. But I have tried what has been suggested there with no success, hence decided to create another question.
DQN cannot learn or converge

EDIT: Added the reward graph (below).

enter image description here

edited Apr 1 at 17:54

asked Apr 1 at 1:23

macwiatrak

113

$begingroup$
This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
$endgroup$
– Neil Slater
Apr 1 at 18:50

$begingroup$
I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
$endgroup$
– macwiatrak
Apr 1 at 22:33

$begingroup$
"Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
$endgroup$
– Neil Slater
Apr 2 at 6:20

$begingroup$
Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
$endgroup$
– macwiatrak
Apr 3 at 11:02

add a comment |

class DDQNAgent(RLDebugger):
 def __init__(self, observation_space, action_space):
 RLDebugger.__init__(self)
 # get size of state and action
 self.state_size = observation_space[0]
 self.action_size = action_space
 # hyper parameters
 self.learning_rate = .00025
 self.model = self.build_model()
 self.target_model = self.model
 self.gamma = 0.999
 self.epsilon_max = 1.
 self.epsilon = 1.
 self.t = 0
 self.epsilon_min = 0.1
 self.n_first_exploration_steps = 1500
 self.epsilon_decay_len = 1000000
 self.batch_size = 32
 self.train_start = 64
 # create replay memory using deque
 self.memory = deque(maxlen=1000000)
 self.target_model = self.build_model(trainable=False)

 # approximate Q function using Neural Network
 # state is input and Q Value of each action is output of network
 def build_model(self, trainable=True):
 model = Sequential()
 # This is a simple one hidden layer model, thought it should be enough here,
 # it is much easier to train with different achitectures (stack layers, change activation)
 model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
 model.add(Dense(32, activation='relu', trainable=trainable))
 model.add(Dense(self.action_size, activation='linear', trainable=trainable))
 model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
 model.summary()
 # 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
 # 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
 return model

 # get action from model using greedy policy
 def get_action(self, state):
 if random.random() < self.epsilon:
 return random.randrange(self.action_size)
 q_value = self.model.predict(state)
 return np.argmax(q_value[0])

 # decay epsilon
 def update_epsilon(self):
 self.t += 1
 self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
 (self.epsilon_decay_len - max(0.,
 self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

 # train the target network on the selected action and transition
 def train_model(self, action, state, next_state, reward, done):

 # save sample <s,a,r,s'> to the replay memory
 self.memory.append((state, action, reward, next_state, done))

 if len(self.memory) >= self.train_start:
 states, actions, rewards, next_states, dones = self.create_minibatch()

 targets = self.model.predict(states)
 target_values = self.target_model.predict(next_states)

 for i in range(self.batch_size):
 # Approx Q Learning
 if dones[i]:
 targets[i][actions[i]] = rewards[i]
 else:
 targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

 # and do the model fit!
 loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

 for i in range(self.batch_size):
 self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

 def create_minibatch(self):
 # pick samples randomly from replay memory (using batch_size)

 batch_size = min(self.batch_size, len(self.memory))
 samples = random.sample(self.memory, batch_size)

 states = np.array([_[0][0] for _ in samples])
 actions = np.array([_[1] for _ in samples])
 rewards = np.array([_[2] for _ in samples])
 next_states = np.array([_[3][0] for _ in samples])
 dones = np.array([_[4] for _ in samples])

 return (states, actions, rewards, next_states, dones)

 def update_target_model(self):
 self.target_model.set_weights(self.model.get_weights())

And this is the code which I use to train the model:

from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
 episode += 1
 state = env.reset()
 state = np.reshape(state, [1, state_size])
 #if episode % 100 == 0:
 # env.render_env()
 total_reward = 0

 step = 0
 gameover = False
 while not gameover:
 step += 1
 #if episode % 100 == 0:
 # env.render_env()
 action = agent.get_action(state)
 reward, next_state, done = env.step(action)
 next_state = np.reshape(next_state, [1, state_size])
 total_reward += reward
 agent.train_model(action, state, next_state, reward, done)
 agent.update_epsilon()
 state = next_state
 terminal = (step >= max_episode_len)
 if done or terminal:
 last_rewards.append(total_reward)
 agent.update_target_model()
 gameover = True

 print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)

With the model being updated after each episode (episode=1000 steps).

Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com

EDIT: Added the reward graph (below).

enter image description here

edited Apr 1 at 17:54

asked Apr 1 at 1:23

macwiatrak

113

$begingroup$
This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
$endgroup$
– Neil Slater
Apr 1 at 18:50

$begingroup$
I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
$endgroup$
– macwiatrak
Apr 1 at 22:33

$begingroup$
"Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
$endgroup$
– Neil Slater
Apr 2 at 6:20

$begingroup$
Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
$endgroup$
– macwiatrak
Apr 3 at 11:02

add a comment |

class DDQNAgent(RLDebugger):
 def __init__(self, observation_space, action_space):
 RLDebugger.__init__(self)
 # get size of state and action
 self.state_size = observation_space[0]
 self.action_size = action_space
 # hyper parameters
 self.learning_rate = .00025
 self.model = self.build_model()
 self.target_model = self.model
 self.gamma = 0.999
 self.epsilon_max = 1.
 self.epsilon = 1.
 self.t = 0
 self.epsilon_min = 0.1
 self.n_first_exploration_steps = 1500
 self.epsilon_decay_len = 1000000
 self.batch_size = 32
 self.train_start = 64
 # create replay memory using deque
 self.memory = deque(maxlen=1000000)
 self.target_model = self.build_model(trainable=False)

 # approximate Q function using Neural Network
 # state is input and Q Value of each action is output of network
 def build_model(self, trainable=True):
 model = Sequential()
 # This is a simple one hidden layer model, thought it should be enough here,
 # it is much easier to train with different achitectures (stack layers, change activation)
 model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
 model.add(Dense(32, activation='relu', trainable=trainable))
 model.add(Dense(self.action_size, activation='linear', trainable=trainable))
 model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
 model.summary()
 # 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
 # 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
 return model

 # get action from model using greedy policy
 def get_action(self, state):
 if random.random() < self.epsilon:
 return random.randrange(self.action_size)
 q_value = self.model.predict(state)
 return np.argmax(q_value[0])

 # decay epsilon
 def update_epsilon(self):
 self.t += 1
 self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
 (self.epsilon_decay_len - max(0.,
 self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

 # train the target network on the selected action and transition
 def train_model(self, action, state, next_state, reward, done):

 # save sample <s,a,r,s'> to the replay memory
 self.memory.append((state, action, reward, next_state, done))

 if len(self.memory) >= self.train_start:
 states, actions, rewards, next_states, dones = self.create_minibatch()

 targets = self.model.predict(states)
 target_values = self.target_model.predict(next_states)

 for i in range(self.batch_size):
 # Approx Q Learning
 if dones[i]:
 targets[i][actions[i]] = rewards[i]
 else:
 targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

 # and do the model fit!
 loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

 for i in range(self.batch_size):
 self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

 def create_minibatch(self):
 # pick samples randomly from replay memory (using batch_size)

 batch_size = min(self.batch_size, len(self.memory))
 samples = random.sample(self.memory, batch_size)

 states = np.array([_[0][0] for _ in samples])
 actions = np.array([_[1] for _ in samples])
 rewards = np.array([_[2] for _ in samples])
 next_states = np.array([_[3][0] for _ in samples])
 dones = np.array([_[4] for _ in samples])

 return (states, actions, rewards, next_states, dones)

 def update_target_model(self):
 self.target_model.set_weights(self.model.get_weights())

And this is the code which I use to train the model:

from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
 episode += 1
 state = env.reset()
 state = np.reshape(state, [1, state_size])
 #if episode % 100 == 0:
 # env.render_env()
 total_reward = 0

 step = 0
 gameover = False
 while not gameover:
 step += 1
 #if episode % 100 == 0:
 # env.render_env()
 action = agent.get_action(state)
 reward, next_state, done = env.step(action)
 next_state = np.reshape(next_state, [1, state_size])
 total_reward += reward
 agent.train_model(action, state, next_state, reward, done)
 agent.update_epsilon()
 state = next_state
 terminal = (step >= max_episode_len)
 if done or terminal:
 last_rewards.append(total_reward)
 agent.update_target_model()
 gameover = True

 print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)

With the model being updated after each episode (episode=1000 steps).

Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com

EDIT: Added the reward graph (below).

enter image description here

edited Apr 1 at 17:54

asked Apr 1 at 1:23

macwiatrak

113

class DDQNAgent(RLDebugger):
 def __init__(self, observation_space, action_space):
 RLDebugger.__init__(self)
 # get size of state and action
 self.state_size = observation_space[0]
 self.action_size = action_space
 # hyper parameters
 self.learning_rate = .00025
 self.model = self.build_model()
 self.target_model = self.model
 self.gamma = 0.999
 self.epsilon_max = 1.
 self.epsilon = 1.
 self.t = 0
 self.epsilon_min = 0.1
 self.n_first_exploration_steps = 1500
 self.epsilon_decay_len = 1000000
 self.batch_size = 32
 self.train_start = 64
 # create replay memory using deque
 self.memory = deque(maxlen=1000000)
 self.target_model = self.build_model(trainable=False)

 # approximate Q function using Neural Network
 # state is input and Q Value of each action is output of network
 def build_model(self, trainable=True):
 model = Sequential()
 # This is a simple one hidden layer model, thought it should be enough here,
 # it is much easier to train with different achitectures (stack layers, change activation)
 model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
 model.add(Dense(32, activation='relu', trainable=trainable))
 model.add(Dense(self.action_size, activation='linear', trainable=trainable))
 model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
 model.summary()
 # 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
 # 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
 return model

 # get action from model using greedy policy
 def get_action(self, state):
 if random.random() < self.epsilon:
 return random.randrange(self.action_size)
 q_value = self.model.predict(state)
 return np.argmax(q_value[0])

 # decay epsilon
 def update_epsilon(self):
 self.t += 1
 self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
 (self.epsilon_decay_len - max(0.,
 self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

 # train the target network on the selected action and transition
 def train_model(self, action, state, next_state, reward, done):

 # save sample <s,a,r,s'> to the replay memory
 self.memory.append((state, action, reward, next_state, done))

 if len(self.memory) >= self.train_start:
 states, actions, rewards, next_states, dones = self.create_minibatch()

 targets = self.model.predict(states)
 target_values = self.target_model.predict(next_states)

 for i in range(self.batch_size):
 # Approx Q Learning
 if dones[i]:
 targets[i][actions[i]] = rewards[i]
 else:
 targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

 # and do the model fit!
 loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

 for i in range(self.batch_size):
 self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

 def create_minibatch(self):
 # pick samples randomly from replay memory (using batch_size)

 batch_size = min(self.batch_size, len(self.memory))
 samples = random.sample(self.memory, batch_size)

 states = np.array([_[0][0] for _ in samples])
 actions = np.array([_[1] for _ in samples])
 rewards = np.array([_[2] for _ in samples])
 next_states = np.array([_[3][0] for _ in samples])
 dones = np.array([_[4] for _ in samples])

 return (states, actions, rewards, next_states, dones)

 def update_target_model(self):
 self.target_model.set_weights(self.model.get_weights())

And this is the code which I use to train the model:

from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
 episode += 1
 state = env.reset()
 state = np.reshape(state, [1, state_size])
 #if episode % 100 == 0:
 # env.render_env()
 total_reward = 0

 step = 0
 gameover = False
 while not gameover:
 step += 1
 #if episode % 100 == 0:
 # env.render_env()
 action = agent.get_action(state)
 reward, next_state, done = env.step(action)
 next_state = np.reshape(next_state, [1, state_size])
 total_reward += reward
 agent.train_model(action, state, next_state, reward, done)
 agent.update_epsilon()
 state = next_state
 terminal = (step >= max_episode_len)
 if done or terminal:
 last_rewards.append(total_reward)
 agent.update_target_model()
 gameover = True

 print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)

With the model being updated after each episode (episode=1000 steps).

Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com

EDIT: Added the reward graph (below).

enter image description here

reinforcement-learning q-learning dqn convergence deepmind

edited Apr 1 at 17:54

asked Apr 1 at 1:23

macwiatrak

113

edited Apr 1 at 17:54

asked Apr 1 at 1:23

macwiatrak

113

edited Apr 1 at 17:54

asked Apr 1 at 1:23

macwiatrak

113

asked Apr 1 at 1:23

macwiatrak

113

asked Apr 1 at 1:23

macwiatrak

113

$begingroup$
This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
$endgroup$
– Neil Slater
Apr 1 at 18:50

$begingroup$
I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
$endgroup$
– macwiatrak
Apr 1 at 22:33

$begingroup$
"Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
$endgroup$
– Neil Slater
Apr 2 at 6:20

$begingroup$
Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
$endgroup$
– macwiatrak
Apr 3 at 11:02

add a comment |

$begingroup$
This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
$endgroup$
– Neil Slater
Apr 1 at 18:50

$begingroup$
I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
$endgroup$
– macwiatrak
Apr 1 at 22:33

$begingroup$
"Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
$endgroup$
– Neil Slater
Apr 2 at 6:20

$begingroup$
Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
$endgroup$
– macwiatrak
Apr 3 at 11:02

This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this

– Neil Slater
Apr 1 at 18:50

I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.

– macwiatrak
Apr 1 at 22:33

"Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.

– Neil Slater
Apr 2 at 6:20

Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).

– macwiatrak
Apr 3 at 11:02

add a comment |

1 Answer
1

active

oldest

votes

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?

Is there no improvement in the results if you use Adam instead of RMSProp? I'd be surprised if this was a "difficult" environment that needed a lot of tweaking. I'm curious as to what your reward graph looks like when compared to just "random" behaviour which I think would also survive in this environment.

edited Apr 1 at 13:35

answered Apr 1 at 13:23

tryingtolearn

1064

$begingroup$
The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
$endgroup$
– macwiatrak
Apr 1 at 18:32

$begingroup$
Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
$endgroup$
– macwiatrak
Apr 2 at 2:48

$begingroup$
@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
$endgroup$
– tryingtolearn
Apr 2 at 3:25

$begingroup$
@macwiatrak If this solved your problem can you mark this as the answer? Thanks
$endgroup$
– tryingtolearn
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48322%2fdqn-fails-to-find-optimal-policy%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?

edited Apr 1 at 13:35

answered Apr 1 at 13:23

tryingtolearn

1064

$begingroup$
The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
$endgroup$
– macwiatrak
Apr 1 at 18:32

$begingroup$
Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
$endgroup$
– macwiatrak
Apr 2 at 2:48

$begingroup$
@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
$endgroup$
– tryingtolearn
Apr 2 at 3:25

$begingroup$
@macwiatrak If this solved your problem can you mark this as the answer? Thanks
$endgroup$
– tryingtolearn
2 days ago

add a comment |

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?

edited Apr 1 at 13:35

answered Apr 1 at 13:23

tryingtolearn

1064

$begingroup$
The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
$endgroup$
– macwiatrak
Apr 1 at 18:32

$begingroup$
Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
$endgroup$
– macwiatrak
Apr 2 at 2:48

$begingroup$
@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
$endgroup$
– tryingtolearn
Apr 2 at 3:25

$begingroup$
@macwiatrak If this solved your problem can you mark this as the answer? Thanks
$endgroup$
– tryingtolearn
2 days ago

add a comment |

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?

edited Apr 1 at 13:35

answered Apr 1 at 13:23

tryingtolearn

1064

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?

edited Apr 1 at 13:35

answered Apr 1 at 13:23

tryingtolearn

1064

edited Apr 1 at 13:35

answered Apr 1 at 13:23

tryingtolearn

1064

answered Apr 1 at 13:23

tryingtolearn

1064

answered Apr 1 at 13:23

tryingtolearn

1064

$begingroup$
The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
$endgroup$
– macwiatrak
Apr 1 at 18:32

$begingroup$
Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
$endgroup$
– macwiatrak
Apr 2 at 2:48

$begingroup$
@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
$endgroup$
– tryingtolearn
Apr 2 at 3:25

$begingroup$
@macwiatrak If this solved your problem can you mark this as the answer? Thanks
$endgroup$
– tryingtolearn
2 days ago

add a comment |

$begingroup$
The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
$endgroup$
– macwiatrak
Apr 1 at 18:32

$begingroup$
Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
$endgroup$
– macwiatrak
Apr 2 at 2:48

$begingroup$
@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
$endgroup$
– tryingtolearn
Apr 2 at 3:25

$begingroup$
@macwiatrak If this solved your problem can you mark this as the answer? Thanks
$endgroup$
– tryingtolearn
2 days ago

The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).

– macwiatrak
Apr 1 at 18:32

Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.

– macwiatrak
Apr 2 at 2:48

@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.

– tryingtolearn
Apr 2 at 3:25

@macwiatrak If this solved your problem can you mark this as the answer? Thanks

– tryingtolearn
2 days ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High

1 Answer
1

1 Answer
1

1 Answer
1