Normalizing the data setHow to normalize data for Neural Network and Decision ForestWhy is my PCA boomerang-shaped when normalizing?Normalizing time dataCan one build linear models on “chunks” of the data set, if one can't build them on the entire data set?Normalizing test dataWhat are some situations when normalizing input data to zero mean, unit variance is not appropriate or not beneficial?Normalization set dataNormalizing / standardizing training and validation datanormalizing data and avoiding dividing by zeroNormalizing Jaccard similarity scores in relation to differences in document length
Relation between independence and correlation of uniform random variables
Using Past-Perfect interchangeably with the Past Continuous
PTIJ: Do Irish Jews have "the luck of the Irish"?
What does Jesus mean regarding "Raca," and "you fool?" - is he contrasting them?
두음법칙 - When did North and South diverge in pronunciation of initial ㄹ?
World War I as a war of liberals against authoritarians?
Would it be believable to defy demographics in a story?
If "dar" means "to give", what does "daros" mean?
Should I be concerned about student access to a test bank?
Comment Box for Substitution Method of Integrals
Maths symbols and unicode-math input inside siunitx commands
What (if any) is the reason to buy in small local stores?
Knife as defense against stray dogs
Unfrosted light bulb
Constant Current LED Circuit
Is it possible to stack the damage done by the Absorb Elements spell?
Help rendering a complicated sum/product formula
Does .bashrc contain syntax errors?
In Aliens, how many people were on LV-426 before the Marines arrived?
How is the partial sum of a geometric sequence calculated?
Am I eligible for the Eurail Youth pass? I am 27.5 years old
Loading the leaflet Map in Lightning Web Component
What is the relationship between relativity and the Doppler effect?
Do US professors/group leaders only get a salary, but no group budget?
Normalizing the data set
How to normalize data for Neural Network and Decision ForestWhy is my PCA boomerang-shaped when normalizing?Normalizing time dataCan one build linear models on “chunks” of the data set, if one can't build them on the entire data set?Normalizing test dataWhat are some situations when normalizing input data to zero mean, unit variance is not appropriate or not beneficial?Normalization set dataNormalizing / standardizing training and validation datanormalizing data and avoiding dividing by zeroNormalizing Jaccard similarity scores in relation to differences in document length
$begingroup$
I have two questions :
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?
- When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .
I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .
Thanks !
linear-regression normalization
$endgroup$
add a comment |
$begingroup$
I have two questions :
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?
- When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .
I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .
Thanks !
linear-regression normalization
$endgroup$
$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday
$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday
$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday
$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday
add a comment |
$begingroup$
I have two questions :
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?
- When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .
I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .
Thanks !
linear-regression normalization
$endgroup$
I have two questions :
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?
- When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .
I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .
Thanks !
linear-regression normalization
linear-regression normalization
edited yesterday
I_Play_With_Data
1,214532
1,214532
asked 2 days ago
Apoorv JainApoorv Jain
1242
1242
$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday
$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday
$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday
$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday
add a comment |
$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday
$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday
$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday
$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday
$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday
$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday
$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday
$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday
$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday
$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday
$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday
$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?
Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$
If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$
If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into
$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$
Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.
But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.
- When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
change completely because the model had different ranges of features
as compared to the ranges of features in the target set.
Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.
Does the significance of the parameters change?
In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.
For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by
$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$
is the data matrix with an added $1$-column for the bias.
Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$
in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.
The $t$-value for a regression weight given by
$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$
The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.
If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe
$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$
The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$
By these observations, we see that the $t$-value stays invariant.
$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$
Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.
New contributor
$endgroup$
$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday
$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago
$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago
$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago
$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47392%2fnormalizing-the-data-set%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?
Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$
If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$
If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into
$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$
Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.
But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.
- When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
change completely because the model had different ranges of features
as compared to the ranges of features in the target set.
Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.
Does the significance of the parameters change?
In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.
For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by
$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$
is the data matrix with an added $1$-column for the bias.
Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$
in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.
The $t$-value for a regression weight given by
$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$
The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.
If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe
$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$
The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$
By these observations, we see that the $t$-value stays invariant.
$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$
Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.
New contributor
$endgroup$
$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday
$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago
$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago
$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago
$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago
add a comment |
$begingroup$
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?
Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$
If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$
If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into
$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$
Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.
But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.
- When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
change completely because the model had different ranges of features
as compared to the ranges of features in the target set.
Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.
Does the significance of the parameters change?
In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.
For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by
$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$
is the data matrix with an added $1$-column for the bias.
Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$
in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.
The $t$-value for a regression weight given by
$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$
The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.
If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe
$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$
The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$
By these observations, we see that the $t$-value stays invariant.
$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$
Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.
New contributor
$endgroup$
$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday
$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago
$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago
$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago
$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago
add a comment |
$begingroup$
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?
Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$
If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$
If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into
$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$
Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.
But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.
- When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
change completely because the model had different ranges of features
as compared to the ranges of features in the target set.
Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.
Does the significance of the parameters change?
In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.
For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by
$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$
is the data matrix with an added $1$-column for the bias.
Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$
in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.
The $t$-value for a regression weight given by
$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$
The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.
If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe
$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$
The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$
By these observations, we see that the $t$-value stays invariant.
$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$
Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.
New contributor
$endgroup$
- Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?
Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$
If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$
If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into
$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$
Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.
But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.
- When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
change completely because the model had different ranges of features
as compared to the ranges of features in the target set.
Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.
Does the significance of the parameters change?
In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.
For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by
$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$
is the data matrix with an added $1$-column for the bias.
Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$
in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.
The $t$-value for a regression weight given by
$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$
The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.
If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe
$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$
The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$
By these observations, we see that the $t$-value stays invariant.
$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$
Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.
New contributor
edited 17 hours ago
New contributor
answered yesterday
MachineLearnerMachineLearner
1438
1438
New contributor
New contributor
$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday
$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago
$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago
$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago
$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago
add a comment |
$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday
$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago
$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago
$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago
$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago
$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday
$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday
$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago
$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago
$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago
$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago
$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago
$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago
$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago
$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47392%2fnormalizing-the-data-set%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday
$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday
$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday
$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday