Manual feature engineering based on the outputForecasting an individual based on a representative groupImprove a regression model and feature selectionTraining Validation Testing set split for facial expression datasetShould I relabel this data or remove the potentially leaky feature?Is there a model-agnostic way to determine feature importance?validation_curve differs from cross_val_score?LSTM Feature selection processInterpreting lasso logistic regression feature coefficients in multiclass problemFeature Importance PythonFeature matrix for email classification:

Should I outline or discovery write my stories?

React - map array to child component

Why is it that I can sometimes guess the next note?

Redundant comparison & "if" before assignment

What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?

Can the Supreme Court Overturn an Impeachment?

Problem with TransformedDistribution

Are the IPv6 address space and IPv4 address space completely disjoint?

Does an advisor owe his/her student anything? Will an advisor keep a PhD student only out of pity?

How to explain what's wrong with this application of the chain rule?

Calculating Wattage for Resistor in High Frequency Application?

By means of an example, show that P(A) + P(B) = 1 does not mean that B is the complement of A.

Strong empirical falsification of quantum mechanics based on vacuum energy density

Travelling outside the UK without a passport

Is it better practice to read straight from sheet music rather than memorize it?

Multiplicative persistence

Why should universal income be universal?

Melting point of aspirin, contradicting sources

Is a model fitted to data or is data fitted to a model?

Longest common substring in linear time

250 Floor Tower

When a Cleric spontaneously casts a Cure Light Wounds spell, will a Pearl of Power recover the original spell or Cure Light Wounds?

Reply 'no position' while the job posting is still there

Did arcade monitors have same pixel aspect ratio as TV sets?



Manual feature engineering based on the output


Forecasting an individual based on a representative groupImprove a regression model and feature selectionTraining Validation Testing set split for facial expression datasetShould I relabel this data or remove the potentially leaky feature?Is there a model-agnostic way to determine feature importance?validation_curve differs from cross_val_score?LSTM Feature selection processInterpreting lasso logistic regression feature coefficients in multiclass problemFeature Importance PythonFeature matrix for email classification:













3












$begingroup$


So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.



My prediction would be based on those features to predict the value of signed for that person.



I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting



Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:



After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.



First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )



Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )










share|improve this question









New contributor




Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
    $endgroup$
    – Ben Reiniger
    Mar 19 at 18:24










  • $begingroup$
    Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
    $endgroup$
    – Blenzus
    Mar 20 at 8:21















3












$begingroup$


So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.



My prediction would be based on those features to predict the value of signed for that person.



I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting



Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:



After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.



First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )



Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )










share|improve this question









New contributor




Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
    $endgroup$
    – Ben Reiniger
    Mar 19 at 18:24










  • $begingroup$
    Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
    $endgroup$
    – Blenzus
    Mar 20 at 8:21













3












3








3





$begingroup$


So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.



My prediction would be based on those features to predict the value of signed for that person.



I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting



Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:



After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.



First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )



Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )










share|improve this question









New contributor




Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.



My prediction would be based on those features to predict the value of signed for that person.



I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting



Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:



After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.



First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )



Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )







machine-learning feature-selection feature-engineering correlation data-leakage






share|improve this question









New contributor




Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited Mar 20 at 10:44







Blenzus













New contributor




Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Mar 19 at 14:08









BlenzusBlenzus

446




446




New contributor




Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
    $endgroup$
    – Ben Reiniger
    Mar 19 at 18:24










  • $begingroup$
    Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
    $endgroup$
    – Blenzus
    Mar 20 at 8:21
















  • $begingroup$
    This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
    $endgroup$
    – Ben Reiniger
    Mar 19 at 18:24










  • $begingroup$
    Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
    $endgroup$
    – Blenzus
    Mar 20 at 8:21















$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24




$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24












$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21




$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21










2 Answers
2






active

oldest

votes


















3












$begingroup$

You can create features based on output values, but you should be careful in doing this.



When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.



One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.



In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!






share|improve this answer











$endgroup$












  • $begingroup$
    Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
    $endgroup$
    – Blenzus
    Mar 20 at 8:29






  • 1




    $begingroup$
    This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
    $endgroup$
    – Dan Carter
    Mar 20 at 12:28










  • $begingroup$
    Yes this is exactly what i was looking for !
    $endgroup$
    – Blenzus
    yesterday


















3












$begingroup$

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.



It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.



Good resource on data leakage






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
    $endgroup$
    – Blenzus
    Mar 19 at 15:34










  • $begingroup$
    Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:17










  • $begingroup$
    Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
    $endgroup$
    – Blenzus
    Mar 19 at 16:27










  • $begingroup$
    How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:38







  • 2




    $begingroup$
    Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:53











Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






Blenzus is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47619%2fmanual-feature-engineering-based-on-the-output%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









3












$begingroup$

You can create features based on output values, but you should be careful in doing this.



When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.



One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.



In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!






share|improve this answer











$endgroup$












  • $begingroup$
    Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
    $endgroup$
    – Blenzus
    Mar 20 at 8:29






  • 1




    $begingroup$
    This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
    $endgroup$
    – Dan Carter
    Mar 20 at 12:28










  • $begingroup$
    Yes this is exactly what i was looking for !
    $endgroup$
    – Blenzus
    yesterday















3












$begingroup$

You can create features based on output values, but you should be careful in doing this.



When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.



One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.



In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!






share|improve this answer











$endgroup$












  • $begingroup$
    Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
    $endgroup$
    – Blenzus
    Mar 20 at 8:29






  • 1




    $begingroup$
    This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
    $endgroup$
    – Dan Carter
    Mar 20 at 12:28










  • $begingroup$
    Yes this is exactly what i was looking for !
    $endgroup$
    – Blenzus
    yesterday













3












3








3





$begingroup$

You can create features based on output values, but you should be careful in doing this.



When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.



One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.



In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!






share|improve this answer











$endgroup$



You can create features based on output values, but you should be careful in doing this.



When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.



One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.



In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 20 at 11:48









Blenzus

446




446










answered Mar 19 at 17:07









raghuraghu

40133




40133











  • $begingroup$
    Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
    $endgroup$
    – Blenzus
    Mar 20 at 8:29






  • 1




    $begingroup$
    This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
    $endgroup$
    – Dan Carter
    Mar 20 at 12:28










  • $begingroup$
    Yes this is exactly what i was looking for !
    $endgroup$
    – Blenzus
    yesterday
















  • $begingroup$
    Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
    $endgroup$
    – Blenzus
    Mar 20 at 8:29






  • 1




    $begingroup$
    This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
    $endgroup$
    – Dan Carter
    Mar 20 at 12:28










  • $begingroup$
    Yes this is exactly what i was looking for !
    $endgroup$
    – Blenzus
    yesterday















$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29




$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29




1




1




$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28




$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28












$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday




$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday











3












$begingroup$

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.



It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.



Good resource on data leakage






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
    $endgroup$
    – Blenzus
    Mar 19 at 15:34










  • $begingroup$
    Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:17










  • $begingroup$
    Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
    $endgroup$
    – Blenzus
    Mar 19 at 16:27










  • $begingroup$
    How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:38







  • 2




    $begingroup$
    Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:53
















3












$begingroup$

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.



It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.



Good resource on data leakage






share|improve this answer









$endgroup$












  • $begingroup$
    Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
    $endgroup$
    – Blenzus
    Mar 19 at 15:34










  • $begingroup$
    Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:17










  • $begingroup$
    Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
    $endgroup$
    – Blenzus
    Mar 19 at 16:27










  • $begingroup$
    How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:38







  • 2




    $begingroup$
    Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:53














3












3








3





$begingroup$

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.



It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.



Good resource on data leakage






share|improve this answer









$endgroup$



No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.



It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.



Good resource on data leakage







share|improve this answer












share|improve this answer



share|improve this answer










answered Mar 19 at 15:04









Simon LarssonSimon Larsson

51910




51910











  • $begingroup$
    Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
    $endgroup$
    – Blenzus
    Mar 19 at 15:34










  • $begingroup$
    Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:17










  • $begingroup$
    Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
    $endgroup$
    – Blenzus
    Mar 19 at 16:27










  • $begingroup$
    How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:38







  • 2




    $begingroup$
    Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:53

















  • $begingroup$
    Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
    $endgroup$
    – Blenzus
    Mar 19 at 15:34










  • $begingroup$
    Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:17










  • $begingroup$
    Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
    $endgroup$
    – Blenzus
    Mar 19 at 16:27










  • $begingroup$
    How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:38







  • 2




    $begingroup$
    Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
    $endgroup$
    – Simon Larsson
    Mar 19 at 16:53
















$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34




$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34












$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17




$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17












$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27




$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27












$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38





$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38





2




2




$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53





$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53











Blenzus is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















Blenzus is a new contributor. Be nice, and check out our Code of Conduct.












Blenzus is a new contributor. Be nice, and check out our Code of Conduct.











Blenzus is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47619%2fmanual-feature-engineering-based-on-the-output%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High