Test RMSE of polynomial regression drops when using more variables? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsTrying determining degree polynomial for polynomical regressionHow to decide power of independent variables in case of non-linear polynomial regression?Choosing a model for dataset with categorical variablesIs linear regression fit for this dataIdeas for using polynomial regression for multivariate time series predictionNeural network only converges when data cloud is close to 0LSTM doesnt find finer dependencies than the Random Forest modelRegression model for continuous dependent variable and count independent variablesHigh RMSE and MAE and low MAPEinput transformation for polynomial regression

Why don't the Weasley twins use magic outside of school if the Trace can only find the location of spells cast?

What are the pros and cons of Aerospike nosecones?

How can I make names more distinctive without making them longer?

What is this single-engine low-wing propeller plane?

What is the musical term for a note that continously plays through a melody?

Withdrew £2800, but only £2000 shows as withdrawn on online banking; what are my obligations?

Why is "Consequences inflicted." not a sentence?

How to find all the available tools in macOS terminal?

When -s is used with third person singular. What's its use in this context?

How to bypass password on Windows XP account?

Super Attribute Position on Product Page Magento 1

The logistics of corpse disposal

What are the motives behind Cersei's orders given to Bronn?

What would be the ideal power source for a cybernetic eye?

How can players work together to take actions that are otherwise impossible?

Check which numbers satisfy the condition [A*B*C = A! + B! + C!]

What LEGO pieces have "real-world" functionality?

What's the purpose of writing one's academic bio in 3rd person?

3 doors, three guards, one stone

How to deal with a team lead who never gives me credit?

Using et al. for a last / senior author rather than for a first author

Is the Standard Deduction better than Itemized when both are the same amount?

Why are there no cargo aircraft with "flying wing" design?

Do you forfeit tax refunds/credits if you aren't required to and don't file by April 15?



Test RMSE of polynomial regression drops when using more variables?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsTrying determining degree polynomial for polynomical regressionHow to decide power of independent variables in case of non-linear polynomial regression?Choosing a model for dataset with categorical variablesIs linear regression fit for this dataIdeas for using polynomial regression for multivariate time series predictionNeural network only converges when data cloud is close to 0LSTM doesnt find finer dependencies than the Random Forest modelRegression model for continuous dependent variable and count independent variablesHigh RMSE and MAE and low MAPEinput transformation for polynomial regression










2












$begingroup$


I am testing polynomial regression for a data set of 50 variables and a sample size of 5000. I ordered the coefficients of the linear model from high to low and then made different models using the p most explanatory variables. The RMSE values of these models are shown in the following figures. For polynomial regression of degree=2 everything seems to be normal, but for degree=3 something strange happens. (The dotted line shows the location of minimum RMSE)



Comparison of RMSE in polynomial regression of degree 2Comparison of RMSE in polynomial regression of degree 3



I cannot understand why the test RMSE drops when using more than 27 variables? It seems to start underfitting rather than overfitting with more variables. Interestingly, this happens at the same time as the training RMSE becomes around 1e-14. Meanwhile, the adjused R-squared shows the following strange behaviour at that moment:



Comparison of adjusted R2 in polynomial regression of degree 3



Am I using just too many dimensions for this polynomial regression? Or is there another reason why this is happening? I would love to understand what is going on. I will quickly move on to better algorithms, but I just wanted to make a benchmark for my ML project. Thanks in advance!










share|improve this question









$endgroup$











  • $begingroup$
    The cubics tend to explode, especially with data that has not been seen in the training set. Also, multicorrelations between variables are more likely, if all variables are cubed in your model.
    $endgroup$
    – Nutle
    Apr 3 at 10:14










  • $begingroup$
    If the cubics explode, wouldn't the learning algorithm just assign 0 to their coefficients? I still do not get why the test error decreases by learning more features.
    $endgroup$
    – Salmon
    Apr 4 at 22:49










  • $begingroup$
    Learning algorithm is learning only on your training set, and you can see that RMSE is OK when training, but explodes when testing. So if the results are fine when training - the algorithm doesn't see the need to assign 0 for the coefficients. High test errors that you are seeing indicates a poorly selected model when in training (in other words, a possible overfit - though my guess would still be an exploded third power). Did you try plotting fitted vs actual values for both good models and the one that has large RMSE?
    $endgroup$
    – Nutle
    Apr 4 at 23:57











  • $begingroup$
    Thanks for your reply! I understand that high test errors are caused by an overfit model, but the part I cannot wrap my head around is something else. By increasing the amount of variables that we are using in the model (see figure 2 in the original post) the testing error first blows up and then becomes normal again. Why doesn't it keep blowing up?
    $endgroup$
    – Salmon
    Apr 5 at 6:35










  • $begingroup$
    I don't think it becomes normal, as in the RMSE is still higher than when using 1 variable (from what I can tell from the graph), but I get your point. Did you try analyzing/plotting the specific models? Do you get the same RMSE results when bootstrapping your sample or shuffling between different train/test samples?
    $endgroup$
    – Nutle
    Apr 5 at 9:09
















2












$begingroup$


I am testing polynomial regression for a data set of 50 variables and a sample size of 5000. I ordered the coefficients of the linear model from high to low and then made different models using the p most explanatory variables. The RMSE values of these models are shown in the following figures. For polynomial regression of degree=2 everything seems to be normal, but for degree=3 something strange happens. (The dotted line shows the location of minimum RMSE)



Comparison of RMSE in polynomial regression of degree 2Comparison of RMSE in polynomial regression of degree 3



I cannot understand why the test RMSE drops when using more than 27 variables? It seems to start underfitting rather than overfitting with more variables. Interestingly, this happens at the same time as the training RMSE becomes around 1e-14. Meanwhile, the adjused R-squared shows the following strange behaviour at that moment:



Comparison of adjusted R2 in polynomial regression of degree 3



Am I using just too many dimensions for this polynomial regression? Or is there another reason why this is happening? I would love to understand what is going on. I will quickly move on to better algorithms, but I just wanted to make a benchmark for my ML project. Thanks in advance!










share|improve this question









$endgroup$











  • $begingroup$
    The cubics tend to explode, especially with data that has not been seen in the training set. Also, multicorrelations between variables are more likely, if all variables are cubed in your model.
    $endgroup$
    – Nutle
    Apr 3 at 10:14










  • $begingroup$
    If the cubics explode, wouldn't the learning algorithm just assign 0 to their coefficients? I still do not get why the test error decreases by learning more features.
    $endgroup$
    – Salmon
    Apr 4 at 22:49










  • $begingroup$
    Learning algorithm is learning only on your training set, and you can see that RMSE is OK when training, but explodes when testing. So if the results are fine when training - the algorithm doesn't see the need to assign 0 for the coefficients. High test errors that you are seeing indicates a poorly selected model when in training (in other words, a possible overfit - though my guess would still be an exploded third power). Did you try plotting fitted vs actual values for both good models and the one that has large RMSE?
    $endgroup$
    – Nutle
    Apr 4 at 23:57











  • $begingroup$
    Thanks for your reply! I understand that high test errors are caused by an overfit model, but the part I cannot wrap my head around is something else. By increasing the amount of variables that we are using in the model (see figure 2 in the original post) the testing error first blows up and then becomes normal again. Why doesn't it keep blowing up?
    $endgroup$
    – Salmon
    Apr 5 at 6:35










  • $begingroup$
    I don't think it becomes normal, as in the RMSE is still higher than when using 1 variable (from what I can tell from the graph), but I get your point. Did you try analyzing/plotting the specific models? Do you get the same RMSE results when bootstrapping your sample or shuffling between different train/test samples?
    $endgroup$
    – Nutle
    Apr 5 at 9:09














2












2








2





$begingroup$


I am testing polynomial regression for a data set of 50 variables and a sample size of 5000. I ordered the coefficients of the linear model from high to low and then made different models using the p most explanatory variables. The RMSE values of these models are shown in the following figures. For polynomial regression of degree=2 everything seems to be normal, but for degree=3 something strange happens. (The dotted line shows the location of minimum RMSE)



Comparison of RMSE in polynomial regression of degree 2Comparison of RMSE in polynomial regression of degree 3



I cannot understand why the test RMSE drops when using more than 27 variables? It seems to start underfitting rather than overfitting with more variables. Interestingly, this happens at the same time as the training RMSE becomes around 1e-14. Meanwhile, the adjused R-squared shows the following strange behaviour at that moment:



Comparison of adjusted R2 in polynomial regression of degree 3



Am I using just too many dimensions for this polynomial regression? Or is there another reason why this is happening? I would love to understand what is going on. I will quickly move on to better algorithms, but I just wanted to make a benchmark for my ML project. Thanks in advance!










share|improve this question









$endgroup$




I am testing polynomial regression for a data set of 50 variables and a sample size of 5000. I ordered the coefficients of the linear model from high to low and then made different models using the p most explanatory variables. The RMSE values of these models are shown in the following figures. For polynomial regression of degree=2 everything seems to be normal, but for degree=3 something strange happens. (The dotted line shows the location of minimum RMSE)



Comparison of RMSE in polynomial regression of degree 2Comparison of RMSE in polynomial regression of degree 3



I cannot understand why the test RMSE drops when using more than 27 variables? It seems to start underfitting rather than overfitting with more variables. Interestingly, this happens at the same time as the training RMSE becomes around 1e-14. Meanwhile, the adjused R-squared shows the following strange behaviour at that moment:



Comparison of adjusted R2 in polynomial regression of degree 3



Am I using just too many dimensions for this polynomial regression? Or is there another reason why this is happening? I would love to understand what is going on. I will quickly move on to better algorithms, but I just wanted to make a benchmark for my ML project. Thanks in advance!







machine-learning regression supervised-learning model-selection






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 3 at 5:19









SalmonSalmon

111




111











  • $begingroup$
    The cubics tend to explode, especially with data that has not been seen in the training set. Also, multicorrelations between variables are more likely, if all variables are cubed in your model.
    $endgroup$
    – Nutle
    Apr 3 at 10:14










  • $begingroup$
    If the cubics explode, wouldn't the learning algorithm just assign 0 to their coefficients? I still do not get why the test error decreases by learning more features.
    $endgroup$
    – Salmon
    Apr 4 at 22:49










  • $begingroup$
    Learning algorithm is learning only on your training set, and you can see that RMSE is OK when training, but explodes when testing. So if the results are fine when training - the algorithm doesn't see the need to assign 0 for the coefficients. High test errors that you are seeing indicates a poorly selected model when in training (in other words, a possible overfit - though my guess would still be an exploded third power). Did you try plotting fitted vs actual values for both good models and the one that has large RMSE?
    $endgroup$
    – Nutle
    Apr 4 at 23:57











  • $begingroup$
    Thanks for your reply! I understand that high test errors are caused by an overfit model, but the part I cannot wrap my head around is something else. By increasing the amount of variables that we are using in the model (see figure 2 in the original post) the testing error first blows up and then becomes normal again. Why doesn't it keep blowing up?
    $endgroup$
    – Salmon
    Apr 5 at 6:35










  • $begingroup$
    I don't think it becomes normal, as in the RMSE is still higher than when using 1 variable (from what I can tell from the graph), but I get your point. Did you try analyzing/plotting the specific models? Do you get the same RMSE results when bootstrapping your sample or shuffling between different train/test samples?
    $endgroup$
    – Nutle
    Apr 5 at 9:09

















  • $begingroup$
    The cubics tend to explode, especially with data that has not been seen in the training set. Also, multicorrelations between variables are more likely, if all variables are cubed in your model.
    $endgroup$
    – Nutle
    Apr 3 at 10:14










  • $begingroup$
    If the cubics explode, wouldn't the learning algorithm just assign 0 to their coefficients? I still do not get why the test error decreases by learning more features.
    $endgroup$
    – Salmon
    Apr 4 at 22:49










  • $begingroup$
    Learning algorithm is learning only on your training set, and you can see that RMSE is OK when training, but explodes when testing. So if the results are fine when training - the algorithm doesn't see the need to assign 0 for the coefficients. High test errors that you are seeing indicates a poorly selected model when in training (in other words, a possible overfit - though my guess would still be an exploded third power). Did you try plotting fitted vs actual values for both good models and the one that has large RMSE?
    $endgroup$
    – Nutle
    Apr 4 at 23:57











  • $begingroup$
    Thanks for your reply! I understand that high test errors are caused by an overfit model, but the part I cannot wrap my head around is something else. By increasing the amount of variables that we are using in the model (see figure 2 in the original post) the testing error first blows up and then becomes normal again. Why doesn't it keep blowing up?
    $endgroup$
    – Salmon
    Apr 5 at 6:35










  • $begingroup$
    I don't think it becomes normal, as in the RMSE is still higher than when using 1 variable (from what I can tell from the graph), but I get your point. Did you try analyzing/plotting the specific models? Do you get the same RMSE results when bootstrapping your sample or shuffling between different train/test samples?
    $endgroup$
    – Nutle
    Apr 5 at 9:09
















$begingroup$
The cubics tend to explode, especially with data that has not been seen in the training set. Also, multicorrelations between variables are more likely, if all variables are cubed in your model.
$endgroup$
– Nutle
Apr 3 at 10:14




$begingroup$
The cubics tend to explode, especially with data that has not been seen in the training set. Also, multicorrelations between variables are more likely, if all variables are cubed in your model.
$endgroup$
– Nutle
Apr 3 at 10:14












$begingroup$
If the cubics explode, wouldn't the learning algorithm just assign 0 to their coefficients? I still do not get why the test error decreases by learning more features.
$endgroup$
– Salmon
Apr 4 at 22:49




$begingroup$
If the cubics explode, wouldn't the learning algorithm just assign 0 to their coefficients? I still do not get why the test error decreases by learning more features.
$endgroup$
– Salmon
Apr 4 at 22:49












$begingroup$
Learning algorithm is learning only on your training set, and you can see that RMSE is OK when training, but explodes when testing. So if the results are fine when training - the algorithm doesn't see the need to assign 0 for the coefficients. High test errors that you are seeing indicates a poorly selected model when in training (in other words, a possible overfit - though my guess would still be an exploded third power). Did you try plotting fitted vs actual values for both good models and the one that has large RMSE?
$endgroup$
– Nutle
Apr 4 at 23:57





$begingroup$
Learning algorithm is learning only on your training set, and you can see that RMSE is OK when training, but explodes when testing. So if the results are fine when training - the algorithm doesn't see the need to assign 0 for the coefficients. High test errors that you are seeing indicates a poorly selected model when in training (in other words, a possible overfit - though my guess would still be an exploded third power). Did you try plotting fitted vs actual values for both good models and the one that has large RMSE?
$endgroup$
– Nutle
Apr 4 at 23:57













$begingroup$
Thanks for your reply! I understand that high test errors are caused by an overfit model, but the part I cannot wrap my head around is something else. By increasing the amount of variables that we are using in the model (see figure 2 in the original post) the testing error first blows up and then becomes normal again. Why doesn't it keep blowing up?
$endgroup$
– Salmon
Apr 5 at 6:35




$begingroup$
Thanks for your reply! I understand that high test errors are caused by an overfit model, but the part I cannot wrap my head around is something else. By increasing the amount of variables that we are using in the model (see figure 2 in the original post) the testing error first blows up and then becomes normal again. Why doesn't it keep blowing up?
$endgroup$
– Salmon
Apr 5 at 6:35












$begingroup$
I don't think it becomes normal, as in the RMSE is still higher than when using 1 variable (from what I can tell from the graph), but I get your point. Did you try analyzing/plotting the specific models? Do you get the same RMSE results when bootstrapping your sample or shuffling between different train/test samples?
$endgroup$
– Nutle
Apr 5 at 9:09





$begingroup$
I don't think it becomes normal, as in the RMSE is still higher than when using 1 variable (from what I can tell from the graph), but I get your point. Did you try analyzing/plotting the specific models? Do you get the same RMSE results when bootstrapping your sample or shuffling between different train/test samples?
$endgroup$
– Nutle
Apr 5 at 9:09











0






active

oldest

votes












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48486%2ftest-rmse-of-polynomial-regression-drops-when-using-more-variables%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48486%2ftest-rmse-of-polynomial-regression-drops-when-using-more-variables%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High