Normalizing the data setHow to normalize data for Neural Network and Decision ForestWhy is my PCA boomerang-shaped when normalizing?Normalizing time dataCan one build linear models on “chunks” of the data set, if one can't build them on the entire data set?Normalizing test dataWhat are some situations when normalizing input data to zero mean, unit variance is not appropriate or not beneficial?Normalization set dataNormalizing / standardizing training and validation datanormalizing data and avoiding dividing by zeroNormalizing Jaccard similarity scores in relation to differences in document length

Relation between independence and correlation of uniform random variables

Using Past-Perfect interchangeably with the Past Continuous

PTIJ: Do Irish Jews have "the luck of the Irish"?

What does Jesus mean regarding "Raca," and "you fool?" - is he contrasting them?

두음법칙 - When did North and South diverge in pronunciation of initial ㄹ?

World War I as a war of liberals against authoritarians?

Would it be believable to defy demographics in a story?

If "dar" means "to give", what does "daros" mean?

Should I be concerned about student access to a test bank?

Comment Box for Substitution Method of Integrals

Maths symbols and unicode-math input inside siunitx commands

What (if any) is the reason to buy in small local stores?

Knife as defense against stray dogs

Unfrosted light bulb

Constant Current LED Circuit

Is it possible to stack the damage done by the Absorb Elements spell?

Help rendering a complicated sum/product formula

Does .bashrc contain syntax errors?

In Aliens, how many people were on LV-426 before the Marines arrived​?

How is the partial sum of a geometric sequence calculated?

Am I eligible for the Eurail Youth pass? I am 27.5 years old

Loading the leaflet Map in Lightning Web Component

What is the relationship between relativity and the Doppler effect?

Do US professors/group leaders only get a salary, but no group budget?



Normalizing the data set


How to normalize data for Neural Network and Decision ForestWhy is my PCA boomerang-shaped when normalizing?Normalizing time dataCan one build linear models on “chunks” of the data set, if one can't build them on the entire data set?Normalizing test dataWhat are some situations when normalizing input data to zero mean, unit variance is not appropriate or not beneficial?Normalization set dataNormalizing / standardizing training and validation datanormalizing data and avoiding dividing by zeroNormalizing Jaccard similarity scores in relation to differences in document length













2












$begingroup$


I have two questions :



  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?

  2. When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .

I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .



Thanks !










share|improve this question











$endgroup$











  • $begingroup$
    Can you give the code in which implement this linear regressor?
    $endgroup$
    – Alireza Zolanvari
    yesterday










  • $begingroup$
    What do you mean by "performance"? Computational performance, score performance, residuals?
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    I mean score performance
    $endgroup$
    – Apoorv Jain
    yesterday










  • $begingroup$
    Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
    $endgroup$
    – Aditya
    yesterday
















2












$begingroup$


I have two questions :



  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?

  2. When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .

I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .



Thanks !










share|improve this question











$endgroup$











  • $begingroup$
    Can you give the code in which implement this linear regressor?
    $endgroup$
    – Alireza Zolanvari
    yesterday










  • $begingroup$
    What do you mean by "performance"? Computational performance, score performance, residuals?
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    I mean score performance
    $endgroup$
    – Apoorv Jain
    yesterday










  • $begingroup$
    Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
    $endgroup$
    – Aditya
    yesterday














2












2








2





$begingroup$


I have two questions :



  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?

  2. When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .

I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .



Thanks !










share|improve this question











$endgroup$




I have two questions :



  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated ) ?

  2. When we normalize the training set we ought to normalize the target set too . Won't it affect the performance ? I mean won't the data set change completely because the model had different Ranges of features as compared to the ranges of features in target set .

I tried googling the questions but was not able to come to conclusion . Any help would be appreciated .



Thanks !







linear-regression normalization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday









I_Play_With_Data

1,214532




1,214532










asked 2 days ago









Apoorv JainApoorv Jain

1242




1242











  • $begingroup$
    Can you give the code in which implement this linear regressor?
    $endgroup$
    – Alireza Zolanvari
    yesterday










  • $begingroup$
    What do you mean by "performance"? Computational performance, score performance, residuals?
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    I mean score performance
    $endgroup$
    – Apoorv Jain
    yesterday










  • $begingroup$
    Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
    $endgroup$
    – Aditya
    yesterday

















  • $begingroup$
    Can you give the code in which implement this linear regressor?
    $endgroup$
    – Alireza Zolanvari
    yesterday










  • $begingroup$
    What do you mean by "performance"? Computational performance, score performance, residuals?
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    I mean score performance
    $endgroup$
    – Apoorv Jain
    yesterday










  • $begingroup$
    Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
    $endgroup$
    – Aditya
    yesterday
















$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday




$begingroup$
Can you give the code in which implement this linear regressor?
$endgroup$
– Alireza Zolanvari
yesterday












$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday




$begingroup$
What do you mean by "performance"? Computational performance, score performance, residuals?
$endgroup$
– gented
yesterday












$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday




$begingroup$
I mean score performance
$endgroup$
– Apoorv Jain
yesterday












$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday





$begingroup$
Linear regressor will be effected by the scaling for sure, so try making sure that you did it correctly. Otherwise since it assigns weights to the cols, it will just pick the ones which will help it to reach the target
$endgroup$
– Aditya
yesterday











1 Answer
1






active

oldest

votes


















3












$begingroup$


  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?



Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$



If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$



If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into



$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$



Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.



But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.




  1. When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
    change completely because the model had different ranges of features
    as compared to the ranges of features in the target set.



Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.



Does the significance of the parameters change?



In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.



For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by



$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$



is the data matrix with an added $1$-column for the bias.



Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$



in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.



The $t$-value for a regression weight given by



$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$



The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.



If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe



$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$

The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$



By these observations, we see that the $t$-value stays invariant.



$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$



Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.






share|improve this answer










New contributor




MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$












  • $begingroup$
    This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
    $endgroup$
    – gented
    19 hours ago










  • $begingroup$
    The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
    $endgroup$
    – gented
    19 hours ago











  • $begingroup$
    @gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
    $endgroup$
    – MachineLearner
    17 hours ago










  • $begingroup$
    thank you, it's now a thorough answer, +1 :)
    $endgroup$
    – gented
    17 hours ago










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47392%2fnormalizing-the-data-set%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3












$begingroup$


  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?



Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$



If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$



If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into



$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$



Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.



But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.




  1. When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
    change completely because the model had different ranges of features
    as compared to the ranges of features in the target set.



Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.



Does the significance of the parameters change?



In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.



For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by



$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$



is the data matrix with an added $1$-column for the bias.



Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$



in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.



The $t$-value for a regression weight given by



$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$



The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.



If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe



$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$

The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$



By these observations, we see that the $t$-value stays invariant.



$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$



Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.






share|improve this answer










New contributor




MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$












  • $begingroup$
    This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
    $endgroup$
    – gented
    19 hours ago










  • $begingroup$
    The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
    $endgroup$
    – gented
    19 hours ago











  • $begingroup$
    @gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
    $endgroup$
    – MachineLearner
    17 hours ago










  • $begingroup$
    thank you, it's now a thorough answer, +1 :)
    $endgroup$
    – gented
    17 hours ago















3












$begingroup$


  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?



Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$



If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$



If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into



$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$



Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.



But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.




  1. When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
    change completely because the model had different ranges of features
    as compared to the ranges of features in the target set.



Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.



Does the significance of the parameters change?



In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.



For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by



$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$



is the data matrix with an added $1$-column for the bias.



Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$



in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.



The $t$-value for a regression weight given by



$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$



The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.



If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe



$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$

The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$



By these observations, we see that the $t$-value stays invariant.



$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$



Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.






share|improve this answer










New contributor




MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$












  • $begingroup$
    This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
    $endgroup$
    – gented
    19 hours ago










  • $begingroup$
    The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
    $endgroup$
    – gented
    19 hours ago











  • $begingroup$
    @gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
    $endgroup$
    – MachineLearner
    17 hours ago










  • $begingroup$
    thank you, it's now a thorough answer, +1 :)
    $endgroup$
    – gented
    17 hours ago













3












3








3





$begingroup$


  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?



Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$



If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$



If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into



$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$



Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.



But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.




  1. When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
    change completely because the model had different ranges of features
    as compared to the ranges of features in the target set.



Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.



Does the significance of the parameters change?



In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.



For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by



$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$



is the data matrix with an added $1$-column for the bias.



Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$



in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.



The $t$-value for a regression weight given by



$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$



The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.



If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe



$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$

The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$



By these observations, we see that the $t$-value stays invariant.



$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$



Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.






share|improve this answer










New contributor




MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$




  1. Why doesn't normalization have any effect on linear regressor performance (mathematical approach is appreciated)?



Theoretically, normalization does not influence the performance of the model. In order to understand this let us have a look at a standard linear regression.
$$y_i = boldsymbolw^Tboldsymbolx_i + b+varepsilon_i$$



If we Introduce the scaled independent variable $boldsymbolz_i=dfrac1sigmaleft[ boldsymbolx_i - barboldsymbolxright] implies boldsymbolx_i=barboldsymbolx+sigmaboldsymbolz_i$. This will result in
$$y_i = boldsymbolw^Tbarboldsymbolx+sigmaboldsymbolw^Tboldsymbolz_i+b+varepsilon_i.$$



If we introduce $tildeb=b+boldsymbolw^Tbarboldsymbolx$ and $tildeboldsymbolw^T=sigmaboldsymbolw^T$ we can rewrite the equation into



$$y_i=tildeboldsymbolw^Tboldsymbolz_i+tildeb+varepsilon_i.$$



Hence, we see that the transformed independent variables just change the bias (if a translation is included) and the weights are scaled by $sigma$. The significance of the parameters will not change only their specified values.



But if normalization does not really enhance the model, why do we still do it? The reason is more computationally inspired. If we had very large values the weights would need to be very small such that we can scale the output of the regression into a reasonable range. A large range of numerical values forces us to reserve more memory for our variables that we use. Hence, it is better to normalize our inputs, such that the parameters don't need to scale the inputs down to a reasonable amount that it fits the output.




  1. When we normalize the training set we ought to normalize the target set too. Won't it affect the performance? I mean won't the data set
    change completely because the model had different ranges of features
    as compared to the ranges of features in the target set.



Normalizing the output is not necessary, but it can also improve the numerical efficiency. You can just use the previous linear transformation on your dependent variable (output) and you will see that you can rewrite it to a standard linear regression in the new output. Just remember to transform your inputs and retransform your outputs if you want to use the original variables.



Does the significance of the parameters change?



In order to show that scaling the inputs by a constant factor $sigma$ does not influence the significance of the parameters, we will calculate the $t$-value for a given coefficient. If the $t$-value stays invariant the $p$-value will also stay invariant.



For the linear regression $y_i = boldsymbolwboldsymbolx_i+varepsilon_i$ (bias absorbed into the weight vector). The regression coefficients are given by



$$hatboldsymbolw=left[boldsymbolX^TboldsymbolX right]^-1boldsymbolX^Tboldsymbolyquad text , in which quad
boldsymbolX=beginbmatrixboldsymbolx_1^T \ vdots \ boldsymbolx_N^Tendbmatrix$$



is the data matrix with an added $1$-column for the bias.



Additionally, we need the matrix
$$boldsymbolC=left[boldsymbolX^TboldsymbolX right]^-1quad text and quad s_e = sqrtdfrac(boldsymboly-hatboldsymboly)^T(boldsymboly-hatboldsymboly)N-p-1,$$



in which $N$ is the number of observations, $p$ is the number of predictors, $boldsymboly$ is the vector of outputs and $hatboldsymboly$ is the vector of predicted outputs. We saw that the predicted values will stay invariant under scaling. Hence $s_e$ is invariant under scaling.



The $t$-value for a regression weight given by



$$t_i= dfrachatw_i-mathbbEleft[hatw_iright]s_esqrtc_ii.$$



The $c_ii$-values are the corresponding diagonal values of the $boldsymbolC$ matrix.



If we scale our inputs by $sigma$ (we ignore the 1 column of the data set which is only important for the bias) then we observe



$$boldsymbolX'=sigmaboldsymbolX
text , hatboldsymbolw' = dfrac1sigmahatboldsymbolw text , mathbbEleft[ hatboldsymbolw'right]=dfrac1sigmamathbbEleft[ hatboldsymbolwright] quad text and quad boldsymbolC' = dfrac1sigma^2boldsymbolC.$$

The last condition implies
$$implies c_ii' = dfrac1sigma^2c_ii implies sqrtc_ii' = dfrac1sigmasqrtc_ii.$$



By these observations, we see that the $t$-value stays invariant.



$$t' = dfrachatw_i'-mathbbEleft[hatw_i' right]s_esqrtc_ii'=dfrachatw_i-mathbbEleft[hatw_i right]s_esqrtc_ii=t$$



Hence, the significance of the regression coefficients didn't change either. In theory, the same should apply individually scaling the variables but the algebra gets more complicated.







share|improve this answer










New contributor




MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this answer



share|improve this answer








edited 17 hours ago





















New contributor




MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









answered yesterday









MachineLearnerMachineLearner

1438




1438




New contributor




MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






MachineLearner is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
    $endgroup$
    – gented
    19 hours ago










  • $begingroup$
    The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
    $endgroup$
    – gented
    19 hours ago











  • $begingroup$
    @gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
    $endgroup$
    – MachineLearner
    17 hours ago










  • $begingroup$
    thank you, it's now a thorough answer, +1 :)
    $endgroup$
    – gented
    17 hours ago
















  • $begingroup$
    This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
    $endgroup$
    – gented
    yesterday










  • $begingroup$
    Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
    $endgroup$
    – gented
    19 hours ago










  • $begingroup$
    The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
    $endgroup$
    – gented
    19 hours ago











  • $begingroup$
    @gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
    $endgroup$
    – MachineLearner
    17 hours ago










  • $begingroup$
    thank you, it's now a thorough answer, +1 :)
    $endgroup$
    – gented
    17 hours ago















$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday




$begingroup$
This is slightly incorrect. The $sigma$ aren't necessarily the same for all components and in that case I'm not sure that eventually you can factor them out.
$endgroup$
– gented
yesterday












$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago




$begingroup$
Yes, you can, but the coefficients you end up with are different than the original ones (not just a re-scaling if the $sigma$ are different). Essentially you just showed that a composition of two affine maps is an affine map, which is obvious - but that has nothing to do with invariance of the performance.
$endgroup$
– gented
19 hours ago












$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago





$begingroup$
The intercept changes and the coefficients are re-scaled individually: this means that the significance tests may result in different values (they may or they may not). Basically my point is that an answer to the question must prove that such tests don't change - which you haven't. Same goes for the residuals (they may or may not change, but it must be proven). A sketch of the answer is provided here: stats.stackexchange.com/questions/162399/…
$endgroup$
– gented
19 hours ago













$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago




$begingroup$
@gented I added the claimed result for uniform scaling. Nonuniform scaling should work as well but the algebra is more involved.
$endgroup$
– MachineLearner
17 hours ago












$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago




$begingroup$
thank you, it's now a thorough answer, +1 :)
$endgroup$
– gented
17 hours ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47392%2fnormalizing-the-data-set%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?