Newton method and Vanishing GradientRegression problem - too complex for gradient descentSimple ANN visualisationWhy do CNNs with ReLU learn that well?Why is vanishing gradient a problem?What is the relationship between hard-sigmoid function and vanishing gradient descent problem?Error in Neural NetworkFinding perfect weights for modelsHow to understand backpropagation using derivativeWhy Gradient methods work in finding the parameters in Neural Networks?The mix of leaky Relu at the first layers of CNN along with conventional Relu for object detection
Can I use my Chinese passport to enter China after I acquired another citizenship?
Can a Bard use an arcane focus?
Can a Gentile theist be saved?
How will losing mobility of one hand affect my career as a programmer?
Why has "pence" been used in this sentence, not "pences"?
Giant Toughroad SLR 2 for 200 miles in two days, will it make it?
Why is .bash_history periodically wiped?
Are Warlocks Arcane or Divine?
does this mean what I think it means - 4th last time
Bob has never been a M before
Is camera lens focus an exact point or a range?
Visiting the UK as unmarried couple
Lightning Web Components - Not available in app builder
A social experiment. What is the worst that can happen?
What is the opposite of 'gravitas'?
What does the "3am" section means in manpages?
How did Monica know how to operate Carol's "designer"?
I2C signal and power over long range (10meter cable)
What is this type of notehead called?
How to deal with or prevent idle in the test team?
Modern Day Chaucer
How to deal with loss of decision making power over a change?
A car is moving at 40 km/h. A fly at 100 km/h, starts from wall towards the car(20 km away)flies to car and back. How many trips can it make?
You're three for three
Newton method and Vanishing Gradient
Regression problem - too complex for gradient descentSimple ANN visualisationWhy do CNNs with ReLU learn that well?Why is vanishing gradient a problem?What is the relationship between hard-sigmoid function and vanishing gradient descent problem?Error in Neural NetworkFinding perfect weights for modelsHow to understand backpropagation using derivativeWhy Gradient methods work in finding the parameters in Neural Networks?The mix of leaky Relu at the first layers of CNN along with conventional Relu for object detection
$begingroup$
I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.
Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.
machine-learning neural-network optimization
New contributor
$endgroup$
add a comment |
$begingroup$
I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.
Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.
machine-learning neural-network optimization
New contributor
$endgroup$
add a comment |
$begingroup$
I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.
Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.
machine-learning neural-network optimization
New contributor
$endgroup$
I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.
Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.
machine-learning neural-network optimization
machine-learning neural-network optimization
New contributor
New contributor
edited Mar 20 at 20:50
Esmailian
1,766115
1,766115
New contributor
asked Mar 20 at 14:41
AmanAman
305
305
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Why there are so many research papers suggesting the use of Newton's
method based optimization algorithms for deep learning instead of
Gradient Descent?
Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.
Is Newton's method really needed if Gradient Descent can be modified
to rectify all the problems faced during machine learning?
Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.
As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.
Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.
$endgroup$
$begingroup$
In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
$endgroup$
– Aman
Mar 20 at 15:35
$begingroup$
@Aman removed the controversial parts
$endgroup$
– Esmailian
Mar 20 at 15:51
$begingroup$
which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
$endgroup$
– Aman
Mar 20 at 15:53
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Aman is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47679%2fnewton-method-and-vanishing-gradient%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Why there are so many research papers suggesting the use of Newton's
method based optimization algorithms for deep learning instead of
Gradient Descent?
Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.
Is Newton's method really needed if Gradient Descent can be modified
to rectify all the problems faced during machine learning?
Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.
As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.
Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.
$endgroup$
$begingroup$
In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
$endgroup$
– Aman
Mar 20 at 15:35
$begingroup$
@Aman removed the controversial parts
$endgroup$
– Esmailian
Mar 20 at 15:51
$begingroup$
which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
$endgroup$
– Aman
Mar 20 at 15:53
add a comment |
$begingroup$
Why there are so many research papers suggesting the use of Newton's
method based optimization algorithms for deep learning instead of
Gradient Descent?
Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.
Is Newton's method really needed if Gradient Descent can be modified
to rectify all the problems faced during machine learning?
Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.
As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.
Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.
$endgroup$
$begingroup$
In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
$endgroup$
– Aman
Mar 20 at 15:35
$begingroup$
@Aman removed the controversial parts
$endgroup$
– Esmailian
Mar 20 at 15:51
$begingroup$
which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
$endgroup$
– Aman
Mar 20 at 15:53
add a comment |
$begingroup$
Why there are so many research papers suggesting the use of Newton's
method based optimization algorithms for deep learning instead of
Gradient Descent?
Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.
Is Newton's method really needed if Gradient Descent can be modified
to rectify all the problems faced during machine learning?
Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.
As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.
Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.
$endgroup$
Why there are so many research papers suggesting the use of Newton's
method based optimization algorithms for deep learning instead of
Gradient Descent?
Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.
Is Newton's method really needed if Gradient Descent can be modified
to rectify all the problems faced during machine learning?
Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.
As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.
Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.
edited Mar 20 at 16:17
answered Mar 20 at 15:21
EsmailianEsmailian
1,766115
1,766115
$begingroup$
In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
$endgroup$
– Aman
Mar 20 at 15:35
$begingroup$
@Aman removed the controversial parts
$endgroup$
– Esmailian
Mar 20 at 15:51
$begingroup$
which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
$endgroup$
– Aman
Mar 20 at 15:53
add a comment |
$begingroup$
In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
$endgroup$
– Aman
Mar 20 at 15:35
$begingroup$
@Aman removed the controversial parts
$endgroup$
– Esmailian
Mar 20 at 15:51
$begingroup$
which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
$endgroup$
– Aman
Mar 20 at 15:53
$begingroup$
In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
$endgroup$
– Aman
Mar 20 at 15:35
$begingroup$
In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
$endgroup$
– Aman
Mar 20 at 15:35
$begingroup$
@Aman removed the controversial parts
$endgroup$
– Esmailian
Mar 20 at 15:51
$begingroup$
@Aman removed the controversial parts
$endgroup$
– Esmailian
Mar 20 at 15:51
$begingroup$
which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
$endgroup$
– Aman
Mar 20 at 15:53
$begingroup$
which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
$endgroup$
– Aman
Mar 20 at 15:53
add a comment |
Aman is a new contributor. Be nice, and check out our Code of Conduct.
Aman is a new contributor. Be nice, and check out our Code of Conduct.
Aman is a new contributor. Be nice, and check out our Code of Conduct.
Aman is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47679%2fnewton-method-and-vanishing-gradient%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown