Understanding minimizing cost correctlyUnderstanding Locally Weighted Linear RegressionUnderstanding Logistic Regression Cost functionCost function for Ordinal Regression using neural networksCustom c++ LSTM slows down at 0.36 cost is usual?Policy Gradient Methods - ScoreFunction & Log(policy)How to Define a Cost Fucntion?Logistic regression cost functionCost function in linear regressionML / Multivariable cost minimization problems / approach summary?Loss function minimizing by pushing precision and recall to 0
How can an organ that provides biological immortality be unable to regenerate?
1 John in Luther’s Bibel
Why is the intercept typed in as a 1 in stats packages (R, python)
When should a starting writer get his own webpage?
Is this Pascal's Matrix?
Error in master's thesis, I do not know what to do
Is it okay for a cleric of life to use spells like Animate Dead and/or Contagion?
Friend wants my recommendation but I don't want to give it to him
Recursively updating the MLE as new observations stream in
The multiplication of list of matrices
Turning a hard to access nut?
Parts of mini page are not placed properly
Have the tides ever turned twice on any open problem?
Exposing a company lying about themselves in a tightly knit industry (videogames) : Is my career at risk on the long run?
Output visual diagram of picture
How do you justify more code being written by following clean code practices?
How to balance a monster modification (zombie)?
PTIJ: Which Dr. Seuss books should one obtain?
Exit shell with shortcut (not typing exit) that closes session properly
How to test the sharpness of a knife?
Why is this tree refusing to shed its dead leaves?
Do native speakers use "ultima" and "proxima" frequently in spoken English?
Should a narrator ever describe things based on a characters view instead of fact?
When did hardware antialiasing start being available?
Understanding minimizing cost correctly
Understanding Locally Weighted Linear RegressionUnderstanding Logistic Regression Cost functionCost function for Ordinal Regression using neural networksCustom c++ LSTM slows down at 0.36 cost is usual?Policy Gradient Methods - ScoreFunction & Log(policy)How to Define a Cost Fucntion?Logistic regression cost functionCost function in linear regressionML / Multivariable cost minimization problems / approach summary?Loss function minimizing by pushing precision and recall to 0
$begingroup$
I cannot wrap my head around this simple concept.
Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):
$h(x) = theta cdot x$
The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.
Then, theta would be updated as:
$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.
From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.
Here is where I don't seem to wrap my head around:
If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.
I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?
Sorry for asking such a silly question.
Thank you.
linear-regression cost-function
New contributor
$endgroup$
add a comment |
$begingroup$
I cannot wrap my head around this simple concept.
Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):
$h(x) = theta cdot x$
The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.
Then, theta would be updated as:
$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.
From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.
Here is where I don't seem to wrap my head around:
If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.
I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?
Sorry for asking such a silly question.
Thank you.
linear-regression cost-function
New contributor
$endgroup$
add a comment |
$begingroup$
I cannot wrap my head around this simple concept.
Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):
$h(x) = theta cdot x$
The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.
Then, theta would be updated as:
$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.
From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.
Here is where I don't seem to wrap my head around:
If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.
I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?
Sorry for asking such a silly question.
Thank you.
linear-regression cost-function
New contributor
$endgroup$
I cannot wrap my head around this simple concept.
Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):
$h(x) = theta cdot x$
The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.
Then, theta would be updated as:
$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.
From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.
Here is where I don't seem to wrap my head around:
If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.
I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?
Sorry for asking such a silly question.
Thank you.
linear-regression cost-function
linear-regression cost-function
New contributor
New contributor
edited 2 days ago
Siong Thye Goh
1,332419
1,332419
New contributor
asked 2 days ago
zafirzaryazafirzarya
132
132
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$
Hence solving this, would give us $$X^TXtheta =X^Ty$$
Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.
Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.
$endgroup$
1
$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago
$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47466%2funderstanding-minimizing-cost-correctly%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$
Hence solving this, would give us $$X^TXtheta =X^Ty$$
Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.
Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.
$endgroup$
1
$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago
$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago
add a comment |
$begingroup$
Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$
Hence solving this, would give us $$X^TXtheta =X^Ty$$
Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.
Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.
$endgroup$
1
$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago
$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago
add a comment |
$begingroup$
Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$
Hence solving this, would give us $$X^TXtheta =X^Ty$$
Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.
Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.
$endgroup$
Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$
Hence solving this, would give us $$X^TXtheta =X^Ty$$
Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.
Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.
answered 2 days ago
Siong Thye GohSiong Thye Goh
1,332419
1,332419
1
$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago
$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago
add a comment |
1
$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago
$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago
1
1
$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago
$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago
$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago
$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago
add a comment |
zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.
zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.
zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.
zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47466%2funderstanding-minimizing-cost-correctly%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown