Different number of features after using OneHotEncoderdecision trees on mix of categorical and real value parametersDoes using unimportant features hurt accuracy?Encoding features in sklearnSuitable aggregations (mean, median or something else) to make features?Feature importance varying a lot using same data with same featuresTraining a regression algorithm with a variable number of featuresNumber of features of the model must match the input. Model n_features is `N` and input n

Different number of features after using OneHotEncoderdecision trees on mix of categorical and real value parametersDoes using unimportant features hurt accuracy?Encoding features in sklearnSuitable aggregations (mean, median or something else) to make features?Feature importance varying a lot using same data with same featuresTraining a regression algorithm with a variable number of featuresNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.

How to set the font color of quantity objects (Version 11.3 vs version 12)?

How to pass attribute when redirecting from lwc to aura component

Why do TACANs not have a symbol for compulsory reporting on IFR Enroute Low Altitude charts?

Why does Bran Stark feel that Jon Snow "needs to know" about his lineage?

How does a Swashbuckler rogue "fight with two weapons while safely darting away"?

Is it possible to measure lightning discharges as Nikola Tesla?

Are some sounds more pleasing to the ear, like ㄴ and ㅁ?

A question regarding using the definite article

I listed a wrong degree date on my background check. What will happen?

Need help understanding harmonic series and intervals

Transfer over $10k

What does "rf" mean in "rfkill"?

"ne paelici suspectaretur" (Tacitus)

Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?

Find the coordinate of two line segments that are perpendicular

What's the polite way to say "I need to urinate"?

What does 「再々起」mean?

Minimum value of 4 digit number divided by sum of its digits

Is creating your own "experiment" considered cheating during a physics exam?

How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?

Do I have an "anti-research" personality?

Colliding particles and Activation energy

How to replace the "space symbol" (squat-u) in listings?

Reverse the word in a string with the same order in javascript

Different number of features after using OneHotEncoder

decision trees on mix of categorical and real value parametersDoes using unimportant features hurt accuracy?Encoding features in sklearnSuitable aggregations (mean, median or something else) to make features?Feature importance varying a lot using same data with same featuresTraining a regression algorithm with a variable number of featuresNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.

I have train and test data in two separate files.

OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?

asked Mar 13 '18 at 12:54

Sameed

1063

$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55

1

$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04

5

$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25

2

$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07

1

$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56

|
show 3 more comments

I have train and test data in two separate files.

asked Mar 13 '18 at 12:54

Sameed

1063

$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55

1

$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04

5

$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25

2

$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07

1

$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56

|
show 3 more comments

I have train and test data in two separate files.

asked Mar 13 '18 at 12:54

Sameed

1063

I have train and test data in two separate files.

scikit-learn feature-engineering feature-scaling

asked Mar 13 '18 at 12:54

Sameed

1063

asked Mar 13 '18 at 12:54

Sameed

1063

asked Mar 13 '18 at 12:54

Sameed

1063

asked Mar 13 '18 at 12:54

Sameed

1063

asked Mar 13 '18 at 12:54

Sameed

1063

$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55

1

$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04

5

$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25

2

$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07

1

$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56

|
show 3 more comments

$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55

1

$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04

5

$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25

2

$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07

1

$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56

Create features (one hot encoding) before splitting into train and test sets.

– Ankit Seth
Mar 13 '18 at 12:55

Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.

– Ankit Seth
Mar 13 '18 at 13:04

It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!

– AlexR
Mar 13 '18 at 13:25

If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.

– TwinPenguins
Mar 14 '18 at 22:07

@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!

– TwinPenguins
Mar 16 '18 at 8:56

|
show 3 more comments

3 Answers
3

active

oldest

votes

Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.

You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)

answered Mar 13 '18 at 13:31

missrg

36518

add a comment |

One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.

Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.

Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".

One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.

Periodlcally refresh the model to include the recent appeared levels.

answered May 12 '18 at 19:22

Marmite Bomber

9731611

add a comment |

One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].

If you are using scikit learn to convert the value into one hot encoder then in training time you should use

enc = OneHotEncoder()
enc.fit(x_train)

If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
enc.transform(x_test)

The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column

edited Apr 9 at 6:28

answered Apr 9 at 6:17

Swapnil Pote

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29028%2fdifferent-number-of-features-after-using-onehotencoder%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.

answered Mar 13 '18 at 13:31

missrg

36518

add a comment |

Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.

answered Mar 13 '18 at 13:31

missrg

36518

add a comment |

Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.

answered Mar 13 '18 at 13:31

missrg

36518

Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.

answered Mar 13 '18 at 13:31

missrg

36518

answered Mar 13 '18 at 13:31

missrg

36518

answered Mar 13 '18 at 13:31

missrg

36518

answered Mar 13 '18 at 13:31

missrg

36518

add a comment |

One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.

Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".

Periodlcally refresh the model to include the recent appeared levels.

answered May 12 '18 at 19:22

Marmite Bomber

9731611

add a comment |

One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.

Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".

Periodlcally refresh the model to include the recent appeared levels.

answered May 12 '18 at 19:22

Marmite Bomber

9731611

add a comment |

One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.

Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".

Periodlcally refresh the model to include the recent appeared levels.

answered May 12 '18 at 19:22

Marmite Bomber

9731611

One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.

Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".

Periodlcally refresh the model to include the recent appeared levels.

answered May 12 '18 at 19:22

Marmite Bomber

9731611

answered May 12 '18 at 19:22

Marmite Bomber

9731611

answered May 12 '18 at 19:22

Marmite Bomber

9731611

answered May 12 '18 at 19:22

Marmite Bomber

9731611

add a comment |

One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].

If you are using scikit learn to convert the value into one hot encoder then in training time you should use

enc = OneHotEncoder()
enc.fit(x_train)

If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
enc.transform(x_test)

edited Apr 9 at 6:28

answered Apr 9 at 6:17

Swapnil Pote

add a comment |

One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].

If you are using scikit learn to convert the value into one hot encoder then in training time you should use

enc = OneHotEncoder()
enc.fit(x_train)

If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
enc.transform(x_test)

edited Apr 9 at 6:28

answered Apr 9 at 6:17

Swapnil Pote

add a comment |

One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].

If you are using scikit learn to convert the value into one hot encoder then in training time you should use

enc = OneHotEncoder()
enc.fit(x_train)

If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
enc.transform(x_test)

edited Apr 9 at 6:28

answered Apr 9 at 6:17

Swapnil Pote

One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].

If you are using scikit learn to convert the value into one hot encoder then in training time you should use

enc = OneHotEncoder()
enc.fit(x_train)

If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
enc.transform(x_test)

edited Apr 9 at 6:28

answered Apr 9 at 6:17

Swapnil Pote

edited Apr 9 at 6:28

answered Apr 9 at 6:17

Swapnil Pote

answered Apr 9 at 6:17

Swapnil Pote

answered Apr 9 at 6:17

Swapnil Pote

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

0R,8XIIi wyR DUYaebRdMp8qJfRh5rK0iC0a2z9jjk,S1R8gioa8gx5tzq5i,NdlD4TYzz1 e8nSBUFMq9krHRd,fiN

搜尋此網誌

Trjtdtk

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

3 Answers
3

3 Answers
3

3 Answers
3