Manual feature engineering based on the outputForecasting an individual based on a representative groupImprove a regression model and feature selectionTraining Validation Testing set split for facial expression datasetShould I relabel this data or remove the potentially leaky feature?Is there a model-agnostic way to determine feature importance?validation_curve differs from cross_val_score?LSTM Feature selection processInterpreting lasso logistic regression feature coefficients in multiclass problemFeature Importance PythonFeature matrix for email classification:

Should I outline or discovery write my stories?

React - map array to child component

Why is it that I can sometimes guess the next note?

Redundant comparison & "if" before assignment

What was the exact wording from Ivanhoe of this advice on how to free yourself from slavery?

Can the Supreme Court Overturn an Impeachment?

Problem with TransformedDistribution

Are the IPv6 address space and IPv4 address space completely disjoint?

Does an advisor owe his/her student anything? Will an advisor keep a PhD student only out of pity?

How to explain what's wrong with this application of the chain rule?

Calculating Wattage for Resistor in High Frequency Application?

By means of an example, show that P(A) + P(B) = 1 does not mean that B is the complement of A.

Strong empirical falsification of quantum mechanics based on vacuum energy density

Travelling outside the UK without a passport

Is it better practice to read straight from sheet music rather than memorize it?

Multiplicative persistence

Why should universal income be universal?

Melting point of aspirin, contradicting sources

Is a model fitted to data or is data fitted to a model?

Longest common substring in linear time

250 Floor Tower

When a Cleric spontaneously casts a Cure Light Wounds spell, will a Pearl of Power recover the original spell or Cure Light Wounds?

Reply 'no position' while the job posting is still there

Did arcade monitors have same pixel aspect ratio as TV sets?

Manual feature engineering based on the output

Forecasting an individual based on a representative groupImprove a regression model and feature selectionTraining Validation Testing set split for facial expression datasetShould I relabel this data or remove the potentially leaky feature?Is there a model-agnostic way to determine feature importance?validation_curve differs from cross_val_score?LSTM Feature selection processInterpreting lasso logistic regression feature coefficients in multiclass problemFeature Importance PythonFeature matrix for email classification:

So, I'm working on a ML model that would have as potential predictors : age , a code for his city , his social status ( married / single and so on ) , number of his children and the output signed which is binary ( 0 or 1 ). Thats the initial dataset I have.

My prediction would be based on those features to predict the value of signed for that person.

I already generated a prediction on unseen data. Upon validation of the results with the predicted result vs the real data , I have 25% accuracy. While cross-validation gives me 65% accuracy. So I thought : Over-fitting

Here is my question, I went back to the early stages of the whole process, and started creating new features. Example : Instead of the code for the city which makes no sense to have that as input to a ML model, I created classes based on the percentage of signed, the city ,with the higher percentage of 'signed' ( the ouput ) , gets assigned to a higher value of class_city, which improved a lot in my Correlation Matrix the relationship signed-class_city which makes sense. Is what I'm doing correct? or shouldn't I create features based on the ouput?
Here is my CM
:

After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.

First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )

Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )

edited Mar 20 at 10:44

asked Mar 19 at 14:08

Blenzus

446

New contributor

$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24

$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21

add a comment |

My prediction would be based on those features to predict the value of signed for that person.

After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.

First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )

Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )

edited Mar 20 at 10:44

asked Mar 19 at 14:08

Blenzus

446

New contributor

$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24

$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21

add a comment |

My prediction would be based on those features to predict the value of signed for that person.

After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.

First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )

Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )

edited Mar 20 at 10:44

asked Mar 19 at 14:08

Blenzus

446

New contributor

My prediction would be based on those features to predict the value of signed for that person.

After re-modelling with 3 features only ( department_class , age and situation ) i tested my model on unseen data made of 148 rows compared to 60k rows in the training file.

First model with the old feature ( the ID of the departement ) gave 25% accuracy while the second model with the new feature class_department gave 71% ( Again on unseen data )

Note : First model with 25% has some other features as ID's ( they might be causing the model to have such a weak accuracy with the deparment_ID )

machine-learning feature-selection feature-engineering correlation data-leakage

edited Mar 20 at 10:44

asked Mar 19 at 14:08

Blenzus

446

New contributor

edited Mar 20 at 10:44

asked Mar 19 at 14:08

Blenzus

446

New contributor

edited Mar 20 at 10:44

asked Mar 19 at 14:08

Blenzus

446

New contributor

asked Mar 19 at 14:08

Blenzus

446

asked Mar 19 at 14:08

Blenzus

446

New contributor

Blenzus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24

$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21

add a comment |

$begingroup$
This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.
$endgroup$
– Ben Reiniger
Mar 19 at 18:24

$begingroup$
Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set
$endgroup$
– Blenzus
Mar 20 at 8:21

This is called "target encoding." As you and the answers note, the encoding should not involve your test set(s). Also possibly problematic are rare levels in the variable.

– Ben Reiniger
Mar 19 at 18:24

Thank you for your answer. my test set is untouched and un-used during the whole process if that's what you mean by not involving my test set

– Blenzus
Mar 20 at 8:21

add a comment |

2 Answers
2

active

oldest

votes

You can create features based on output values, but you should be careful in doing this.

When you use the value of class_city (based on percentage of signed for that city) for a given data point, note that this calculation cannot include the current data point, since you will not have the value of ‘signed’ during prediction.

One way to handle this is to split the total data you have into three parts - estimation, train, test. The estimation set is used only to estimate the class_city values for each city. These values can then be used in the train and test data. This way, you have the label values without your model doing anything ‘unfair’. For testing, you can infact use the data from estimation+train sets to estimate the class_city values for use in the test set. The same holds true for any unseen data. You can use the class_city values estimated from all the previous data points.

In the context of time series data, for example, the class_city value for any data point can potentially use information from all previous data points, and should not use any information from future data points!

edited Mar 20 at 11:48

Blenzus

446

answered Mar 19 at 17:07

raghu

40133

$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29

1

$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28

$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday

add a comment |

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.

Good resource on data leakage

answered Mar 19 at 15:04

Simon Larsson

51910

$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34

$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17

$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27

$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38

2

$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Blenzus is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47619%2fmanual-feature-engineering-based-on-the-output%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can create features based on output values, but you should be careful in doing this.

edited Mar 20 at 11:48

Blenzus

446

answered Mar 19 at 17:07

raghu

40133

$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29

1

$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28

$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday

add a comment |

You can create features based on output values, but you should be careful in doing this.

edited Mar 20 at 11:48

Blenzus

446

answered Mar 19 at 17:07

raghu

40133

$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29

1

$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28

$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday

add a comment |

You can create features based on output values, but you should be careful in doing this.

edited Mar 20 at 11:48

Blenzus

446

answered Mar 19 at 17:07

raghu

40133

You can create features based on output values, but you should be careful in doing this.

edited Mar 20 at 11:48

Blenzus

446

answered Mar 19 at 17:07

raghu

40133

edited Mar 20 at 11:48

Blenzus

446

edited Mar 20 at 11:48

Blenzus

446

edited Mar 20 at 11:48

Blenzus

446

answered Mar 19 at 17:07

raghu

40133

answered Mar 19 at 17:07

raghu

40133

answered Mar 19 at 17:07

raghu

40133

$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29

1

$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28

$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday

add a comment |

$begingroup$
Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!
$endgroup$
– Blenzus
Mar 20 at 8:29

1

$begingroup$
This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.
$endgroup$
– Dan Carter
Mar 20 at 12:28

$begingroup$
Yes this is exactly what i was looking for !
$endgroup$
– Blenzus
yesterday

Thank you for your answer. I'm not quite familiar with the concept of data points, but this makes sense!

– Blenzus
Mar 20 at 8:29

This last part about future data is important. It's also important not to use data from the current date, as this would also be a data leak.

– Dan Carter
Mar 20 at 12:28

Yes this is exactly what i was looking for !

– Blenzus
yesterday

add a comment |

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.

Good resource on data leakage

answered Mar 19 at 15:04

Simon Larsson

51910

$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34

$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17

$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27

$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38

2

$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53

|
show 1 more comment

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.

Good resource on data leakage

answered Mar 19 at 15:04

Simon Larsson

51910

$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34

$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17

$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27

$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38

2

$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53

|
show 1 more comment

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.

Good resource on data leakage

answered Mar 19 at 15:04

Simon Larsson

51910

No you should not do this, it is causing a data leak. Data leaks happen when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

It will give your model information about your test data during training. It will cause your test scores to be overly optimistic and will make the model worse at generalizing to totally unseen data.

Good resource on data leakage

answered Mar 19 at 15:04

Simon Larsson

51910

answered Mar 19 at 15:04

Simon Larsson

51910

answered Mar 19 at 15:04

Simon Larsson

51910

answered Mar 19 at 15:04

Simon Larsson

51910

$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34

$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17

$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27

$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38

2

$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53

|
show 1 more comment

$begingroup$
Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?
$endgroup$
– Blenzus
Mar 19 at 15:34

$begingroup$
Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.
$endgroup$
– Simon Larsson
Mar 19 at 16:17

$begingroup$
Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!
$endgroup$
– Blenzus
Mar 19 at 16:27

$begingroup$
How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".
$endgroup$
– Simon Larsson
Mar 19 at 16:38

2

$begingroup$
Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.
$endgroup$
– Simon Larsson
Mar 19 at 16:53

Thanks for the answer, what would you recommend in case of ID's being given as input in a dataset? ( in the case of department_code ), what's a common approach for that kind of data to improve my model?

– Blenzus
Mar 19 at 15:34

Glad it helped! That is really a new question not directly related to this and is not really for the comments. If you feel that the question you asked here has been answered, mark this as correct. If you have new things you want answered you should open a new question.

– Simon Larsson
Mar 19 at 16:17

Your comment makes sense, but my results contradict your answer, i'm waiting for maybe a better answer, if not, i'll mark your question as correct ( again it makes sense that i shouldn't give away information on the values i want to predict in features ) Note : I'm posting my results with the old feature and the new feature on unseen data!

– Blenzus
Mar 19 at 16:27

How does the result contradict my answer? You got higher score on both train and test set, which is what you expect from data leakage. Or is it something that I am missing? I am only addressing "shouldn't I create features based on the ouput?".

– Simon Larsson
Mar 19 at 16:38

Yes, that is what I am saying. Data leakage gives you an overly optimistic test score.

– Simon Larsson
Mar 19 at 16:53

|
show 1 more comment

Blenzus is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Blenzus is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Y0RqUS,fk nr3JGtVpO,zMrmz1lmg JN qj8O JzKODtQ9I,n8Ez,WNZOK aPUFMT,gQhqtAOT QZddydxgECvpD j6MEd9c

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2