DQN fails to find optimal policy The 2019 Stack Overflow Developer Survey Results Are InDQN cannot learn or convergeQ-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columnsWhy random sample from replay for DQN?Does employment of engineered immediate rewards in RL introduce a non-linear problem to an agent?Hindsight Experience Replay: what the reward w.r.t. to sample goal meansWhy is my loss function for DQN converging too quickly?Potential-based reward shaping in DQN reinforcement learningDQN cannot learn or convergeIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?RL - Weighthing negative rewardsDeep Reinforcement Learning for dynamic pricing

Does Parliament need to approve the new Brexit delay to 31 October 2019?

number sequence puzzle deep six

For what reasons would an animal species NOT cross a *horizontal* land bridge?

What do I do when my TA workload is more than expected?

Free operad over a monoid object

How many cones with angle theta can I pack into the unit sphere?

Why are Marketing Cloud timestamps not stored in the same timezone as Sales Cloud?

Do working physicists consider Newtonian mechanics to be "falsified"?

Single author papers against my advisor's will?

Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

What can I do if neighbor is blocking my solar panels intentionally?

How to fill page vertically?

Why did Peik Lin say, "I'm not an animal"?

How to handle characters who are more educated than the author?

Why doesn't shell automatically fix "useless use of cat"?

What information about me do stores get via my credit card?

Simulating Exploding Dice

Using dividends to reduce short term capital gains?

Identify 80s or 90s comics with ripped creatures (not dwarves)

Pretty sure I'm over complicating my loops but unsure how to simplify

Using `min_active_rowversion` for global temporary tables

Why can't wing-mounted spoilers be used to steepen approaches?

Why not take a picture of a closer black hole?



DQN fails to find optimal policy



The 2019 Stack Overflow Developer Survey Results Are InDQN cannot learn or convergeQ-learning with a state-action-state reward structure and a Q-matrix with states as rows and actions as columnsWhy random sample from replay for DQN?Does employment of engineered immediate rewards in RL introduce a non-linear problem to an agent?Hindsight Experience Replay: what the reward w.r.t. to sample goal meansWhy is my loss function for DQN converging too quickly?Potential-based reward shaping in DQN reinforcement learningDQN cannot learn or convergeIn Reinforcement Learning can I randomly assign next_states from the state space to my agent while creating transition set?RL - Weighthing negative rewardsDeep Reinforcement Learning for dynamic pricing










2












$begingroup$


Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a reward of +1), while simultaneously allowing them to regrow, which maximizes his reward (if he depletes the resource too quickly, he looses future reward). The grid-game is visible on the picture below, where the player is a red square, his direction grey, and apple green:
enter image description here
As given in the publication, I've built a DQN to solve the game. However, regardless of playing with learning rate, loss, exploration rate and its decay, batch size, optimizer, replay buffer, increasing the NN size the DQN does not find an optimal policy pictured below:
enter image description here
I wonder if there is some mistake in my DQN code (with the similar implementation I've managed to solve OpenAI Gym CartPole task.) Pasting my code below:



class DDQNAgent(RLDebugger):
def __init__(self, observation_space, action_space):
RLDebugger.__init__(self)
# get size of state and action
self.state_size = observation_space[0]
self.action_size = action_space
# hyper parameters
self.learning_rate = .00025
self.model = self.build_model()
self.target_model = self.model
self.gamma = 0.999
self.epsilon_max = 1.
self.epsilon = 1.
self.t = 0
self.epsilon_min = 0.1
self.n_first_exploration_steps = 1500
self.epsilon_decay_len = 1000000
self.batch_size = 32
self.train_start = 64
# create replay memory using deque
self.memory = deque(maxlen=1000000)
self.target_model = self.build_model(trainable=False)

# approximate Q function using Neural Network
# state is input and Q Value of each action is output of network
def build_model(self, trainable=True):
model = Sequential()
# This is a simple one hidden layer model, thought it should be enough here,
# it is much easier to train with different achitectures (stack layers, change activation)
model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
model.add(Dense(32, activation='relu', trainable=trainable))
model.add(Dense(self.action_size, activation='linear', trainable=trainable))
model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
model.summary()
# 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
# 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
return model

# get action from model using greedy policy
def get_action(self, state):
if random.random() < self.epsilon:
return random.randrange(self.action_size)
q_value = self.model.predict(state)
return np.argmax(q_value[0])

# decay epsilon
def update_epsilon(self):
self.t += 1
self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
(self.epsilon_decay_len - max(0.,
self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

# train the target network on the selected action and transition
def train_model(self, action, state, next_state, reward, done):

# save sample <s,a,r,s'> to the replay memory
self.memory.append((state, action, reward, next_state, done))

if len(self.memory) >= self.train_start:
states, actions, rewards, next_states, dones = self.create_minibatch()

targets = self.model.predict(states)
target_values = self.target_model.predict(next_states)

for i in range(self.batch_size):
# Approx Q Learning
if dones[i]:
targets[i][actions[i]] = rewards[i]
else:
targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

# and do the model fit!
loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

for i in range(self.batch_size):
self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

def create_minibatch(self):
# pick samples randomly from replay memory (using batch_size)

batch_size = min(self.batch_size, len(self.memory))
samples = random.sample(self.memory, batch_size)

states = np.array([_[0][0] for _ in samples])
actions = np.array([_[1] for _ in samples])
rewards = np.array([_[2] for _ in samples])
next_states = np.array([_[3][0] for _ in samples])
dones = np.array([_[4] for _ in samples])

return (states, actions, rewards, next_states, dones)

def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())


And this is the code which I use to train the model:



from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
episode += 1
state = env.reset()
state = np.reshape(state, [1, state_size])
#if episode % 100 == 0:
# env.render_env()
total_reward = 0

step = 0
gameover = False
while not gameover:
step += 1
#if episode % 100 == 0:
# env.render_env()
action = agent.get_action(state)
reward, next_state, done = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
total_reward += reward
agent.train_model(action, state, next_state, reward, done)
agent.update_epsilon()
state = next_state
terminal = (step >= max_episode_len)
if done or terminal:
last_rewards.append(total_reward)
agent.update_target_model()
gameover = True

print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)


With the model being updated after each episode (episode=1000 steps).



Looking at logs, the agent sometimes tends to achieve very high results more than few times in a row, but always fails to stabilize and the results from episode to episode have an extremely high variance (even after increasing epsilon and running for few thousands of episodes). Looking at my code and the game, do you have any ideas for what might help the algorithm stabilize the results/converge? I've been playing a lot with hyperparameters but nothing gives very significant improvement.



Some parameters on the game & training:
Reward: +1 for collecting each apple (green square)
Episode: 1000 steps, after 1000 steps or in case the player completely depletes the resource, the game automatically resets.
Target model update: after each game termination
Hyperparameters can be found in the code above.



Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com



P.S. I know that this is a similar problem to the one presented below. But I have tried what has been suggested there with no success, hence decided to create another question.
DQN cannot learn or converge



EDIT: Added the reward graph (below).



enter image description here










share|improve this question











$endgroup$











  • $begingroup$
    This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
    $endgroup$
    – Neil Slater
    Apr 1 at 18:50










  • $begingroup$
    I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
    $endgroup$
    – macwiatrak
    Apr 1 at 22:33










  • $begingroup$
    "Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
    $endgroup$
    – Neil Slater
    Apr 2 at 6:20











  • $begingroup$
    Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
    $endgroup$
    – macwiatrak
    Apr 3 at 11:02















2












$begingroup$


Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a reward of +1), while simultaneously allowing them to regrow, which maximizes his reward (if he depletes the resource too quickly, he looses future reward). The grid-game is visible on the picture below, where the player is a red square, his direction grey, and apple green:
enter image description here
As given in the publication, I've built a DQN to solve the game. However, regardless of playing with learning rate, loss, exploration rate and its decay, batch size, optimizer, replay buffer, increasing the NN size the DQN does not find an optimal policy pictured below:
enter image description here
I wonder if there is some mistake in my DQN code (with the similar implementation I've managed to solve OpenAI Gym CartPole task.) Pasting my code below:



class DDQNAgent(RLDebugger):
def __init__(self, observation_space, action_space):
RLDebugger.__init__(self)
# get size of state and action
self.state_size = observation_space[0]
self.action_size = action_space
# hyper parameters
self.learning_rate = .00025
self.model = self.build_model()
self.target_model = self.model
self.gamma = 0.999
self.epsilon_max = 1.
self.epsilon = 1.
self.t = 0
self.epsilon_min = 0.1
self.n_first_exploration_steps = 1500
self.epsilon_decay_len = 1000000
self.batch_size = 32
self.train_start = 64
# create replay memory using deque
self.memory = deque(maxlen=1000000)
self.target_model = self.build_model(trainable=False)

# approximate Q function using Neural Network
# state is input and Q Value of each action is output of network
def build_model(self, trainable=True):
model = Sequential()
# This is a simple one hidden layer model, thought it should be enough here,
# it is much easier to train with different achitectures (stack layers, change activation)
model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
model.add(Dense(32, activation='relu', trainable=trainable))
model.add(Dense(self.action_size, activation='linear', trainable=trainable))
model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
model.summary()
# 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
# 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
return model

# get action from model using greedy policy
def get_action(self, state):
if random.random() < self.epsilon:
return random.randrange(self.action_size)
q_value = self.model.predict(state)
return np.argmax(q_value[0])

# decay epsilon
def update_epsilon(self):
self.t += 1
self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
(self.epsilon_decay_len - max(0.,
self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

# train the target network on the selected action and transition
def train_model(self, action, state, next_state, reward, done):

# save sample <s,a,r,s'> to the replay memory
self.memory.append((state, action, reward, next_state, done))

if len(self.memory) >= self.train_start:
states, actions, rewards, next_states, dones = self.create_minibatch()

targets = self.model.predict(states)
target_values = self.target_model.predict(next_states)

for i in range(self.batch_size):
# Approx Q Learning
if dones[i]:
targets[i][actions[i]] = rewards[i]
else:
targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

# and do the model fit!
loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

for i in range(self.batch_size):
self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

def create_minibatch(self):
# pick samples randomly from replay memory (using batch_size)

batch_size = min(self.batch_size, len(self.memory))
samples = random.sample(self.memory, batch_size)

states = np.array([_[0][0] for _ in samples])
actions = np.array([_[1] for _ in samples])
rewards = np.array([_[2] for _ in samples])
next_states = np.array([_[3][0] for _ in samples])
dones = np.array([_[4] for _ in samples])

return (states, actions, rewards, next_states, dones)

def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())


And this is the code which I use to train the model:



from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
episode += 1
state = env.reset()
state = np.reshape(state, [1, state_size])
#if episode % 100 == 0:
# env.render_env()
total_reward = 0

step = 0
gameover = False
while not gameover:
step += 1
#if episode % 100 == 0:
# env.render_env()
action = agent.get_action(state)
reward, next_state, done = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
total_reward += reward
agent.train_model(action, state, next_state, reward, done)
agent.update_epsilon()
state = next_state
terminal = (step >= max_episode_len)
if done or terminal:
last_rewards.append(total_reward)
agent.update_target_model()
gameover = True

print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)


With the model being updated after each episode (episode=1000 steps).



Looking at logs, the agent sometimes tends to achieve very high results more than few times in a row, but always fails to stabilize and the results from episode to episode have an extremely high variance (even after increasing epsilon and running for few thousands of episodes). Looking at my code and the game, do you have any ideas for what might help the algorithm stabilize the results/converge? I've been playing a lot with hyperparameters but nothing gives very significant improvement.



Some parameters on the game & training:
Reward: +1 for collecting each apple (green square)
Episode: 1000 steps, after 1000 steps or in case the player completely depletes the resource, the game automatically resets.
Target model update: after each game termination
Hyperparameters can be found in the code above.



Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com



P.S. I know that this is a similar problem to the one presented below. But I have tried what has been suggested there with no success, hence decided to create another question.
DQN cannot learn or converge



EDIT: Added the reward graph (below).



enter image description here










share|improve this question











$endgroup$











  • $begingroup$
    This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
    $endgroup$
    – Neil Slater
    Apr 1 at 18:50










  • $begingroup$
    I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
    $endgroup$
    – macwiatrak
    Apr 1 at 22:33










  • $begingroup$
    "Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
    $endgroup$
    – Neil Slater
    Apr 2 at 6:20











  • $begingroup$
    Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
    $endgroup$
    – macwiatrak
    Apr 3 at 11:02













2












2








2





$begingroup$


Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a reward of +1), while simultaneously allowing them to regrow, which maximizes his reward (if he depletes the resource too quickly, he looses future reward). The grid-game is visible on the picture below, where the player is a red square, his direction grey, and apple green:
enter image description here
As given in the publication, I've built a DQN to solve the game. However, regardless of playing with learning rate, loss, exploration rate and its decay, batch size, optimizer, replay buffer, increasing the NN size the DQN does not find an optimal policy pictured below:
enter image description here
I wonder if there is some mistake in my DQN code (with the similar implementation I've managed to solve OpenAI Gym CartPole task.) Pasting my code below:



class DDQNAgent(RLDebugger):
def __init__(self, observation_space, action_space):
RLDebugger.__init__(self)
# get size of state and action
self.state_size = observation_space[0]
self.action_size = action_space
# hyper parameters
self.learning_rate = .00025
self.model = self.build_model()
self.target_model = self.model
self.gamma = 0.999
self.epsilon_max = 1.
self.epsilon = 1.
self.t = 0
self.epsilon_min = 0.1
self.n_first_exploration_steps = 1500
self.epsilon_decay_len = 1000000
self.batch_size = 32
self.train_start = 64
# create replay memory using deque
self.memory = deque(maxlen=1000000)
self.target_model = self.build_model(trainable=False)

# approximate Q function using Neural Network
# state is input and Q Value of each action is output of network
def build_model(self, trainable=True):
model = Sequential()
# This is a simple one hidden layer model, thought it should be enough here,
# it is much easier to train with different achitectures (stack layers, change activation)
model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
model.add(Dense(32, activation='relu', trainable=trainable))
model.add(Dense(self.action_size, activation='linear', trainable=trainable))
model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
model.summary()
# 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
# 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
return model

# get action from model using greedy policy
def get_action(self, state):
if random.random() < self.epsilon:
return random.randrange(self.action_size)
q_value = self.model.predict(state)
return np.argmax(q_value[0])

# decay epsilon
def update_epsilon(self):
self.t += 1
self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
(self.epsilon_decay_len - max(0.,
self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

# train the target network on the selected action and transition
def train_model(self, action, state, next_state, reward, done):

# save sample <s,a,r,s'> to the replay memory
self.memory.append((state, action, reward, next_state, done))

if len(self.memory) >= self.train_start:
states, actions, rewards, next_states, dones = self.create_minibatch()

targets = self.model.predict(states)
target_values = self.target_model.predict(next_states)

for i in range(self.batch_size):
# Approx Q Learning
if dones[i]:
targets[i][actions[i]] = rewards[i]
else:
targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

# and do the model fit!
loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

for i in range(self.batch_size):
self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

def create_minibatch(self):
# pick samples randomly from replay memory (using batch_size)

batch_size = min(self.batch_size, len(self.memory))
samples = random.sample(self.memory, batch_size)

states = np.array([_[0][0] for _ in samples])
actions = np.array([_[1] for _ in samples])
rewards = np.array([_[2] for _ in samples])
next_states = np.array([_[3][0] for _ in samples])
dones = np.array([_[4] for _ in samples])

return (states, actions, rewards, next_states, dones)

def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())


And this is the code which I use to train the model:



from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
episode += 1
state = env.reset()
state = np.reshape(state, [1, state_size])
#if episode % 100 == 0:
# env.render_env()
total_reward = 0

step = 0
gameover = False
while not gameover:
step += 1
#if episode % 100 == 0:
# env.render_env()
action = agent.get_action(state)
reward, next_state, done = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
total_reward += reward
agent.train_model(action, state, next_state, reward, done)
agent.update_epsilon()
state = next_state
terminal = (step >= max_episode_len)
if done or terminal:
last_rewards.append(total_reward)
agent.update_target_model()
gameover = True

print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)


With the model being updated after each episode (episode=1000 steps).



Looking at logs, the agent sometimes tends to achieve very high results more than few times in a row, but always fails to stabilize and the results from episode to episode have an extremely high variance (even after increasing epsilon and running for few thousands of episodes). Looking at my code and the game, do you have any ideas for what might help the algorithm stabilize the results/converge? I've been playing a lot with hyperparameters but nothing gives very significant improvement.



Some parameters on the game & training:
Reward: +1 for collecting each apple (green square)
Episode: 1000 steps, after 1000 steps or in case the player completely depletes the resource, the game automatically resets.
Target model update: after each game termination
Hyperparameters can be found in the code above.



Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com



P.S. I know that this is a similar problem to the one presented below. But I have tried what has been suggested there with no success, hence decided to create another question.
DQN cannot learn or converge



EDIT: Added the reward graph (below).



enter image description here










share|improve this question











$endgroup$




Based on DeepMind publication, I've recreated the environment and I am trying to make the DQN find and converge to an optimal policy. The task of an agent is to learn how to sustainably collect apples (objects), with the regrowth of the apples depending on its spatial configuration (the more apples around, the higher the regrowth). So in short: the agent has to find how to collect as many apples as he can (for collecting an apple he gets a reward of +1), while simultaneously allowing them to regrow, which maximizes his reward (if he depletes the resource too quickly, he looses future reward). The grid-game is visible on the picture below, where the player is a red square, his direction grey, and apple green:
enter image description here
As given in the publication, I've built a DQN to solve the game. However, regardless of playing with learning rate, loss, exploration rate and its decay, batch size, optimizer, replay buffer, increasing the NN size the DQN does not find an optimal policy pictured below:
enter image description here
I wonder if there is some mistake in my DQN code (with the similar implementation I've managed to solve OpenAI Gym CartPole task.) Pasting my code below:



class DDQNAgent(RLDebugger):
def __init__(self, observation_space, action_space):
RLDebugger.__init__(self)
# get size of state and action
self.state_size = observation_space[0]
self.action_size = action_space
# hyper parameters
self.learning_rate = .00025
self.model = self.build_model()
self.target_model = self.model
self.gamma = 0.999
self.epsilon_max = 1.
self.epsilon = 1.
self.t = 0
self.epsilon_min = 0.1
self.n_first_exploration_steps = 1500
self.epsilon_decay_len = 1000000
self.batch_size = 32
self.train_start = 64
# create replay memory using deque
self.memory = deque(maxlen=1000000)
self.target_model = self.build_model(trainable=False)

# approximate Q function using Neural Network
# state is input and Q Value of each action is output of network
def build_model(self, trainable=True):
model = Sequential()
# This is a simple one hidden layer model, thought it should be enough here,
# it is much easier to train with different achitectures (stack layers, change activation)
model.add(Dense(32, input_dim=self.state_size, activation='relu', trainable=trainable))
model.add(Dense(32, activation='relu', trainable=trainable))
model.add(Dense(self.action_size, activation='linear', trainable=trainable))
model.compile(loss='mse', optimizer=RMSprop(lr=self.learning_rate))
model.summary()
# 1/ You can try different losses. As an logcosh loss is a twice differenciable approximation of Huber loss
# 2/ From a theoretical perspective Learning rate should decay with time to guarantee convergence
return model

# get action from model using greedy policy
def get_action(self, state):
if random.random() < self.epsilon:
return random.randrange(self.action_size)
q_value = self.model.predict(state)
return np.argmax(q_value[0])

# decay epsilon
def update_epsilon(self):
self.t += 1
self.epsilon = self.epsilon_min + max(0., (self.epsilon_max - self.epsilon_min) *
(self.epsilon_decay_len - max(0.,
self.t - self.n_first_exploration_steps)) / self.epsilon_decay_len)

# train the target network on the selected action and transition
def train_model(self, action, state, next_state, reward, done):

# save sample <s,a,r,s'> to the replay memory
self.memory.append((state, action, reward, next_state, done))

if len(self.memory) >= self.train_start:
states, actions, rewards, next_states, dones = self.create_minibatch()

targets = self.model.predict(states)
target_values = self.target_model.predict(next_states)

for i in range(self.batch_size):
# Approx Q Learning
if dones[i]:
targets[i][actions[i]] = rewards[i]
else:
targets[i][actions[i]] = rewards[i] + self.gamma * (np.amax(target_values[i]))

# and do the model fit!
loss = self.model.fit(states, targets, verbose=0).history['loss'][0]

for i in range(self.batch_size):
self.record(actions[i], states[i], targets[i], target_values[i], loss / self.batch_size, rewards[i])

def create_minibatch(self):
# pick samples randomly from replay memory (using batch_size)

batch_size = min(self.batch_size, len(self.memory))
samples = random.sample(self.memory, batch_size)

states = np.array([_[0][0] for _ in samples])
actions = np.array([_[1] for _ in samples])
rewards = np.array([_[2] for _ in samples])
next_states = np.array([_[3][0] for _ in samples])
dones = np.array([_[4] for _ in samples])

return (states, actions, rewards, next_states, dones)

def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())


And this is the code which I use to train the model:



from dqn_agent import *
from environment import *

env = GameEnv()
observation_space = env.reset()

agent = DDQNAgent(observation_space.shape, 7)

state_size = observation_space.shape[0]
last_rewards = []
episode = 0
max_episode_len = 1000
while episode < 2100:
episode += 1
state = env.reset()
state = np.reshape(state, [1, state_size])
#if episode % 100 == 0:
# env.render_env()
total_reward = 0

step = 0
gameover = False
while not gameover:
step += 1
#if episode % 100 == 0:
# env.render_env()
action = agent.get_action(state)
reward, next_state, done = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
total_reward += reward
agent.train_model(action, state, next_state, reward, done)
agent.update_epsilon()
state = next_state
terminal = (step >= max_episode_len)
if done or terminal:
last_rewards.append(total_reward)
agent.update_target_model()
gameover = True

print('episode:', episode, 'cumulative reward: ', total_reward, 'epsilon:', agent.epsilon, 'step', step)


With the model being updated after each episode (episode=1000 steps).



Looking at logs, the agent sometimes tends to achieve very high results more than few times in a row, but always fails to stabilize and the results from episode to episode have an extremely high variance (even after increasing epsilon and running for few thousands of episodes). Looking at my code and the game, do you have any ideas for what might help the algorithm stabilize the results/converge? I've been playing a lot with hyperparameters but nothing gives very significant improvement.



Some parameters on the game & training:
Reward: +1 for collecting each apple (green square)
Episode: 1000 steps, after 1000 steps or in case the player completely depletes the resource, the game automatically resets.
Target model update: after each game termination
Hyperparameters can be found in the code above.



Let me know if you have any ideas, happy to share the github repo. Feel free to email me at macwiatrak@gmail.com



P.S. I know that this is a similar problem to the one presented below. But I have tried what has been suggested there with no success, hence decided to create another question.
DQN cannot learn or converge



EDIT: Added the reward graph (below).



enter image description here







reinforcement-learning q-learning dqn convergence deepmind






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 1 at 17:54







macwiatrak

















asked Apr 1 at 1:23









macwiatrakmacwiatrak

113




113











  • $begingroup$
    This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
    $endgroup$
    – Neil Slater
    Apr 1 at 18:50










  • $begingroup$
    I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
    $endgroup$
    – macwiatrak
    Apr 1 at 22:33










  • $begingroup$
    "Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
    $endgroup$
    – Neil Slater
    Apr 2 at 6:20











  • $begingroup$
    Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
    $endgroup$
    – macwiatrak
    Apr 3 at 11:02
















  • $begingroup$
    This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
    $endgroup$
    – Neil Slater
    Apr 1 at 18:50










  • $begingroup$
    I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
    $endgroup$
    – macwiatrak
    Apr 1 at 22:33










  • $begingroup$
    "Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
    $endgroup$
    – Neil Slater
    Apr 2 at 6:20











  • $begingroup$
    Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
    $endgroup$
    – macwiatrak
    Apr 3 at 11:02















$begingroup$
This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
$endgroup$
– Neil Slater
Apr 1 at 18:50




$begingroup$
This looks like a normal reward graph for a DQN training process. The agent learns from its mistakes, and needs to make mistakes in order to do so. Have you tried assessing the agent using a purely greedy policy? Typically you would stop training every so many episodes and assess the agent without a training loop and with epsilon set to zero, If the environment has any randomness, you should assess multiple times to get a mean result. Could you do that and show the graph? If this solves your problem I could write an answer explaining why you need to do this
$endgroup$
– Neil Slater
Apr 1 at 18:50












$begingroup$
I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
$endgroup$
– macwiatrak
Apr 1 at 22:33




$begingroup$
I have not tried it yet. I did not do that because I did not see that much point in testing it before I saw the algorithm converge. Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training? I am currently running it for 10000 episodes with a larger network to see if it helps it converge. But tomorrow I am gonna try what you said.
$endgroup$
– macwiatrak
Apr 1 at 22:33












$begingroup$
"Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
$endgroup$
– Neil Slater
Apr 2 at 6:20





$begingroup$
"Is the point of stopping training and assessing the agent without a training loop solely to test the performance of an agent at points in training?". Yes. The point is that the score during training is not a reliable measure because you have deliberately made the agent act non-optimally and randomly. You should absolutely expect large variations in reward for most environments when using trial-and-error exploration. You may also get significant variation of optimal behaviour depending on how precise optimal behaviour needs to be, but ideally you will get convergence at the end.
$endgroup$
– Neil Slater
Apr 2 at 6:20













$begingroup$
Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
$endgroup$
– macwiatrak
Apr 3 at 11:02




$begingroup$
Thanks! I am doing that currently together with fitting the hyperparameters in a slightly less complicated environment (that takes less time to train and evaluate the performance). So far the results are promising (managed to converge and stabilize it). If you are interested, I am happy to share the hyperparams once I am sure it's working (few days probably).
$endgroup$
– macwiatrak
Apr 3 at 11:02










1 Answer
1






active

oldest

votes


















0












$begingroup$

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?



Is there no improvement in the results if you use Adam instead of RMSProp? I'd be surprised if this was a "difficult" environment that needed a lot of tweaking. I'm curious as to what your reward graph looks like when compared to just "random" behaviour which I think would also survive in this environment.






share|improve this answer











$endgroup$












  • $begingroup$
    The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
    $endgroup$
    – macwiatrak
    Apr 1 at 18:32











  • $begingroup$
    Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
    $endgroup$
    – macwiatrak
    Apr 2 at 2:48










  • $begingroup$
    @macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
    $endgroup$
    – tryingtolearn
    Apr 2 at 3:25










  • $begingroup$
    @macwiatrak If this solved your problem can you mark this as the answer? Thanks
    $endgroup$
    – tryingtolearn
    2 days ago











Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48322%2fdqn-fails-to-find-optimal-policy%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?



Is there no improvement in the results if you use Adam instead of RMSProp? I'd be surprised if this was a "difficult" environment that needed a lot of tweaking. I'm curious as to what your reward graph looks like when compared to just "random" behaviour which I think would also survive in this environment.






share|improve this answer











$endgroup$












  • $begingroup$
    The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
    $endgroup$
    – macwiatrak
    Apr 1 at 18:32











  • $begingroup$
    Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
    $endgroup$
    – macwiatrak
    Apr 2 at 2:48










  • $begingroup$
    @macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
    $endgroup$
    – tryingtolearn
    Apr 2 at 3:25










  • $begingroup$
    @macwiatrak If this solved your problem can you mark this as the answer? Thanks
    $endgroup$
    – tryingtolearn
    2 days ago















0












$begingroup$

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?



Is there no improvement in the results if you use Adam instead of RMSProp? I'd be surprised if this was a "difficult" environment that needed a lot of tweaking. I'm curious as to what your reward graph looks like when compared to just "random" behaviour which I think would also survive in this environment.






share|improve this answer











$endgroup$












  • $begingroup$
    The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
    $endgroup$
    – macwiatrak
    Apr 1 at 18:32











  • $begingroup$
    Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
    $endgroup$
    – macwiatrak
    Apr 2 at 2:48










  • $begingroup$
    @macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
    $endgroup$
    – tryingtolearn
    Apr 2 at 3:25










  • $begingroup$
    @macwiatrak If this solved your problem can you mark this as the answer? Thanks
    $endgroup$
    – tryingtolearn
    2 days ago













0












0








0





$begingroup$

You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?



Is there no improvement in the results if you use Adam instead of RMSProp? I'd be surprised if this was a "difficult" environment that needed a lot of tweaking. I'm curious as to what your reward graph looks like when compared to just "random" behaviour which I think would also survive in this environment.






share|improve this answer











$endgroup$



You say that the agent can achieve high results so it appears to be learning something but it fails to do this consistently. Is there are significant drop off in performance at some points?



Is there no improvement in the results if you use Adam instead of RMSProp? I'd be surprised if this was a "difficult" environment that needed a lot of tweaking. I'm curious as to what your reward graph looks like when compared to just "random" behaviour which I think would also survive in this environment.







share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 1 at 13:35

























answered Apr 1 at 13:23









tryingtolearntryingtolearn

1064




1064











  • $begingroup$
    The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
    $endgroup$
    – macwiatrak
    Apr 1 at 18:32











  • $begingroup$
    Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
    $endgroup$
    – macwiatrak
    Apr 2 at 2:48










  • $begingroup$
    @macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
    $endgroup$
    – tryingtolearn
    Apr 2 at 3:25










  • $begingroup$
    @macwiatrak If this solved your problem can you mark this as the answer? Thanks
    $endgroup$
    – tryingtolearn
    2 days ago
















  • $begingroup$
    The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
    $endgroup$
    – macwiatrak
    Apr 1 at 18:32











  • $begingroup$
    Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
    $endgroup$
    – macwiatrak
    Apr 2 at 2:48










  • $begingroup$
    @macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
    $endgroup$
    – tryingtolearn
    Apr 2 at 3:25










  • $begingroup$
    @macwiatrak If this solved your problem can you mark this as the answer? Thanks
    $endgroup$
    – tryingtolearn
    2 days ago















$begingroup$
The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
$endgroup$
– macwiatrak
Apr 1 at 18:32





$begingroup$
The variance of the rewards is very high and what's interesting is that the extraordinary high/low episode rewards tend to go in streaks (i.e. I can often see that very low results happen in a row (around 3-5 times in a row) and same goes for very high results). Not really, surely not a significant difference between Adam and RMSProp. I've added the reward graph to the question description. Let me know what you think of the graph! For now, I've increased epsilon, increased the net and I will be looking at results after 10k episodes. (Will take few days probably).
$endgroup$
– macwiatrak
Apr 1 at 18:32













$begingroup$
Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
$endgroup$
– macwiatrak
Apr 2 at 2:48




$begingroup$
Actually, looks like Adam overperforms RMSProp in the long-term, but I am gonna double check it.
$endgroup$
– macwiatrak
Apr 2 at 2:48












$begingroup$
@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
$endgroup$
– tryingtolearn
Apr 2 at 3:25




$begingroup$
@macwiatrak In my experience, oscillations in training while improving is a symptom of learning rate (and batch size). The other thing I'd want to compare the reward graph to is exploration rate. If there's a significant correlation in their drop offs it can tell you something.
$endgroup$
– tryingtolearn
Apr 2 at 3:25












$begingroup$
@macwiatrak If this solved your problem can you mark this as the answer? Thanks
$endgroup$
– tryingtolearn
2 days ago




$begingroup$
@macwiatrak If this solved your problem can you mark this as the answer? Thanks
$endgroup$
– tryingtolearn
2 days ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48322%2fdqn-fails-to-find-optimal-policy%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High