Is Adam's optimization susceptible to Local Minima?Which Optimization method to use?Can overfitting occur in Advanced Optimization algorithms?Why is vanishing gradient a problem?Does a neural network continue to change after SGD stops improving?local minima vs saddle points in deep learningNeural Network: how to interpret this loss graph?Linear Regression OptimizationWhy are optimization algorithms slower at critical points?How does Gradient Descent and Backpropagation work together?Understanding general approach to updating optimization function parameters
what is the sudo password for a --disabled-password user
Why do Computer Science majors learn Calculus?
What was the first Intel x86 processor with "Base + Index * Scale + Displacement" addressing mode?
How can I place the product on a social media post better?
Fizzy, soft, pop and still drinks
Binary Numbers Magic Trick
A Note on N!
How exactly does Hawking radiation decrease the mass of black holes?
Rivers without rain
Was there a shared-world project before "Thieves World"?
Exchange,swap or switch
Was there a Viking Exchange as well as a Columbian one?
Are Boeing 737-800’s grounded?
How come there are so many candidates for the 2020 Democratic party presidential nomination?
How to solve constants out of the internal energy equation?
How to verbalise code in Mathematica?
Is the 5 MB static resource size limit 5,242,880 bytes or 5,000,000 bytes?
a sore throat vs a strep throat vs strep throat
What language was spoken in East Asia before Proto-Turkic?
Symbolic Multivariate Distribution
What does it mean to express a gate in Dirac notation?
Examples of subgroups where it's nontrivial to show closure under multiplication?
Why isn't the definition of absolute value applied when squaring a radical containing a variable?
Does holding a wand and speaking its command word count as V/S/M spell components?
Is Adam's optimization susceptible to Local Minima?
Which Optimization method to use?Can overfitting occur in Advanced Optimization algorithms?Why is vanishing gradient a problem?Does a neural network continue to change after SGD stops improving?local minima vs saddle points in deep learningNeural Network: how to interpret this loss graph?Linear Regression OptimizationWhy are optimization algorithms slower at critical points?How does Gradient Descent and Backpropagation work together?Understanding general approach to updating optimization function parameters
$begingroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
$endgroup$
add a comment |
$begingroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
$endgroup$
1
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all theimport
s for a fast assessment.
$endgroup$
– Esmailian
Apr 7 at 16:33
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
Apr 7 at 17:52
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
Apr 7 at 17:58
add a comment |
$begingroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
$endgroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
optimization gradient-descent loss-function
asked Apr 7 at 16:09
Arka PatraArka Patra
62
62
1
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all theimport
s for a fast assessment.
$endgroup$
– Esmailian
Apr 7 at 16:33
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
Apr 7 at 17:52
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
Apr 7 at 17:58
add a comment |
1
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all theimport
s for a fast assessment.
$endgroup$
– Esmailian
Apr 7 at 16:33
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
Apr 7 at 17:52
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
Apr 7 at 17:58
1
1
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the
import
s for a fast assessment.$endgroup$
– Esmailian
Apr 7 at 16:33
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the
import
s for a fast assessment.$endgroup$
– Esmailian
Apr 7 at 16:33
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
Apr 7 at 17:52
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
Apr 7 at 17:52
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
Apr 7 at 17:58
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
Apr 7 at 17:58
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48825%2fis-adams-optimization-susceptible-to-local-minima%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48825%2fis-adams-optimization-susceptible-to-local-minima%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the
import
s for a fast assessment.$endgroup$
– Esmailian
Apr 7 at 16:33
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
Apr 7 at 17:52
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
Apr 7 at 17:58