Manual feature engineering based on the outputForecasting an individual based on a representative groupImprove a regression model and feature selectionTraining Validation Testing set split for facial expression datasetShould I relabel this data or remove the potentially leaky feature?Is there a model-agnostic way to determine feature importance?validation_curve differs from cross_val_score?LSTM Feature selection processInterpreting lasso logistic regression feature coefficients in multiclass problemFeature Importance PythonFeature matrix for email classification:
Should I outline or discovery write my stories?
React - map array to child component
Why is it that I can sometimes guess the next note?
Redundant comparison & "if" before assignment
What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?
Can the Supreme Court Overturn an Impeachment?
Problem with TransformedDistribution
Are the IPv6 address space and IPv4 address space completely disjoint?
Does an advisor owe his/her student anything? Will an advisor keep a PhD student only out of pity?
How to explain what's wrong with this application of the chain rule?
Calculating Wattage for Resistor in High Frequency Application?
By means of an example, show that P(A) + P(B) = 1 does not mean that B is the complement of A.
Strong empirical falsification of quantum mechanics based on vacuum energy density
Travelling outside the UK without a passport
Is it better practice to read straight from sheet music rather than memorize it?
Multiplicative persistence
Why should universal income be universal?
Melting point of aspirin, contradicting sources
Is a model fitted to data or is data fitted to a model?
Longest common substring in linear time
250 Floor Tower
When a Cleric spontaneously casts a Cure Light Wounds spell, will a Pearl of Power recover the original spell or Cure Light Wounds?
Reply 'no position' while the job posting is still there
Did arcade monitors have same pixel aspect ratio as TV sets?
Manual feature engineering based on the output
Forecasting an individual based on a representative groupImprove a regression model and feature selectionTraining Validation Testing set split for facial expression datasetShould I relabel this data or remove the potentially leaky feature?Is there a model-agnostic way to determine feature importance?validation_curve differs from cross_val_score?LSTM Feature selection processInterpreting lasso logistic regression feature coefficients in multiclass problemFeature Importance PythonFeature matrix for email classification:
$begingroup$
So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.
My prediction would be based on those features to predict the value of signed for that person.
I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting
Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:
After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.
First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )
Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )
machine-learning feature-selection feature-engineering correlation data-leakage
New contributor
$endgroup$
add a comment |
$begingroup$
So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.
My prediction would be based on those features to predict the value of signed for that person.
I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting
Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:
After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.
First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )
Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )
machine-learning feature-selection feature-engineering correlation data-leakage
New contributor
$endgroup$
$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24
$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21
add a comment |
$begingroup$
So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.
My prediction would be based on those features to predict the value of signed for that person.
I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting
Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:
After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.
First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )
Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )
machine-learning feature-selection feature-engineering correlation data-leakage
New contributor
$endgroup$
So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.
My prediction would be based on those features to predict the value of signed for that person.
I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting
Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:
After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.
First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )
Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )
machine-learning feature-selection feature-engineering correlation data-leakage
machine-learning feature-selection feature-engineering correlation data-leakage
New contributor
New contributor
edited Mar 20 at 10:44
Blenzus
New contributor
asked Mar 19 at 14:08
BlenzusBlenzus
446
446
New contributor
New contributor
$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24
$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21
add a comment |
$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24
$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21
$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24
$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24
$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21
$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
You can create features based on output values, but you should be careful in doing this.
When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.
One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.
In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!
$endgroup$
$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29
1
$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28
$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.
It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.
Good resource on data leakage
$endgroup$
$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34
$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17
$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27
$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38
2
$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Blenzus is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47619%2fmanual-feature-engineering-based-on-the-output%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You can create features based on output values, but you should be careful in doing this.
When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.
One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.
In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!
$endgroup$
$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29
1
$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28
$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
You can create features based on output values, but you should be careful in doing this.
When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.
One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.
In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!
$endgroup$
$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29
1
$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28
$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
You can create features based on output values, but you should be careful in doing this.
When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.
One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.
In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!
$endgroup$
You can create features based on output values, but you should be careful in doing this.
When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.
One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.
In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!
edited Mar 20 at 11:48
Blenzus
446
446
answered Mar 19 at 17:07
raghuraghu
40133
40133
$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29
1
$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28
$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29
1
$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28
$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday
$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29
$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29
1
1
$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28
$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28
$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday
$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday
add a comment |
$begingroup$
No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.
It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.
Good resource on data leakage
$endgroup$
$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34
$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17
$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27
$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38
2
$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53
|
show 1 more comment
$begingroup$
No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.
It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.
Good resource on data leakage
$endgroup$
$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34
$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17
$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27
$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38
2
$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53
|
show 1 more comment
$begingroup$
No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.
It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.
Good resource on data leakage
$endgroup$
No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.
It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.
Good resource on data leakage
answered Mar 19 at 15:04
Simon LarssonSimon Larsson
51910
51910
$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34
$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17
$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27
$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38
2
$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53
|
show 1 more comment
$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34
$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17
$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27
$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38
2
$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53
$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34
$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34
$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17
$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17
$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27
$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27
$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38
$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38
2
2
$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53
$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53
|
show 1 more comment
Blenzus is a new contributor. Be nice, and check out our Code of Conduct.
Blenzus is a new contributor. Be nice, and check out our Code of Conduct.
Blenzus is a new contributor. Be nice, and check out our Code of Conduct.
Blenzus is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47619%2fmanual-feature-engineering-based-on-the-output%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24
$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21