Different number of features after using OneHotEncoderdecision trees on mix of categorical and real value parametersDoes using unimportant features hurt accuracy?Encoding features in sklearnSuitable aggregations (mean, median or something else) to make features?Feature importance varying a lot using same data with same featuresTraining a regression algorithm with a variable number of featuresNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.
How to set the font color of quantity objects (Version 11.3 vs version 12)?
How to pass attribute when redirecting from lwc to aura component
Why do TACANs not have a symbol for compulsory reporting on IFR Enroute Low Altitude charts?
Why does Bran Stark feel that Jon Snow "needs to know" about his lineage?
How does a Swashbuckler rogue "fight with two weapons while safely darting away"?
Is it possible to measure lightning discharges as Nikola Tesla?
Are some sounds more pleasing to the ear, like ㄴ and ㅁ?
A question regarding using the definite article
I listed a wrong degree date on my background check. What will happen?
Need help understanding harmonic series and intervals
Transfer over $10k
What does "rf" mean in "rfkill"?
"ne paelici suspectaretur" (Tacitus)
Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?
Find the coordinate of two line segments that are perpendicular
What's the polite way to say "I need to urinate"?
What does 「再々起」mean?
Minimum value of 4 digit number divided by sum of its digits
Is creating your own "experiment" considered cheating during a physics exam?
How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?
Do I have an "anti-research" personality?
Colliding particles and Activation energy
How to replace the "space symbol" (squat-u) in listings?
Reverse the word in a string with the same order in javascript
Different number of features after using OneHotEncoder
decision trees on mix of categorical and real value parametersDoes using unimportant features hurt accuracy?Encoding features in sklearnSuitable aggregations (mean, median or something else) to make features?Feature importance varying a lot using same data with same featuresTraining a regression algorithm with a variable number of featuresNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.
$begingroup$
I have train and test data in two separate files.
OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?
scikit-learn feature-engineering feature-scaling
$endgroup$
|
show 3 more comments
$begingroup$
I have train and test data in two separate files.
OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?
scikit-learn feature-engineering feature-scaling
$endgroup$
$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55
1
$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04
5
$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25
2
$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07
1
$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56
|
show 3 more comments
$begingroup$
I have train and test data in two separate files.
OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?
scikit-learn feature-engineering feature-scaling
$endgroup$
I have train and test data in two separate files.
OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?
scikit-learn feature-engineering feature-scaling
scikit-learn feature-engineering feature-scaling
asked Mar 13 '18 at 12:54
SameedSameed
1063
1063
$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55
1
$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04
5
$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25
2
$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07
1
$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56
|
show 3 more comments
$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55
1
$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04
5
$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25
2
$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07
1
$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56
$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55
$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55
1
1
$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04
$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04
5
5
$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25
$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25
2
2
$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07
$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07
1
1
$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56
$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56
|
show 3 more comments
3 Answers
3
active
oldest
votes
$begingroup$
Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.
You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)
$endgroup$
add a comment |
$begingroup$
One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.
Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.
Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".
One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.
Periodlcally refresh the model to include the recent appeared levels.
$endgroup$
add a comment |
$begingroup$
One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].
If you are using scikit learn to convert the value into one hot encoder then in training time you should use
enc = OneHotEncoder()
enc.fit(x_train)
If you are using scikit learn to convert the value into one hot encoder then in testing time you should use enc.transform(x_test)
The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29028%2fdifferent-number-of-features-after-using-onehotencoder%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.
You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)
$endgroup$
add a comment |
$begingroup$
Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.
You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)
$endgroup$
add a comment |
$begingroup$
Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.
You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)
$endgroup$
Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.
You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)
answered Mar 13 '18 at 13:31
missrgmissrg
36518
36518
add a comment |
add a comment |
$begingroup$
One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.
Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.
Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".
One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.
Periodlcally refresh the model to include the recent appeared levels.
$endgroup$
add a comment |
$begingroup$
One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.
Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.
Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".
One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.
Periodlcally refresh the model to include the recent appeared levels.
$endgroup$
add a comment |
$begingroup$
One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.
Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.
Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".
One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.
Periodlcally refresh the model to include the recent appeared levels.
$endgroup$
One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.
Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.
Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".
One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.
Periodlcally refresh the model to include the recent appeared levels.
answered May 12 '18 at 19:22
Marmite BomberMarmite Bomber
9731611
9731611
add a comment |
add a comment |
$begingroup$
One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].
If you are using scikit learn to convert the value into one hot encoder then in training time you should use
enc = OneHotEncoder()
enc.fit(x_train)
If you are using scikit learn to convert the value into one hot encoder then in testing time you should use enc.transform(x_test)
The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column
$endgroup$
add a comment |
$begingroup$
One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].
If you are using scikit learn to convert the value into one hot encoder then in training time you should use
enc = OneHotEncoder()
enc.fit(x_train)
If you are using scikit learn to convert the value into one hot encoder then in testing time you should use enc.transform(x_test)
The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column
$endgroup$
add a comment |
$begingroup$
One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].
If you are using scikit learn to convert the value into one hot encoder then in training time you should use
enc = OneHotEncoder()
enc.fit(x_train)
If you are using scikit learn to convert the value into one hot encoder then in testing time you should use enc.transform(x_test)
The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column
$endgroup$
One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].
If you are using scikit learn to convert the value into one hot encoder then in training time you should use
enc = OneHotEncoder()
enc.fit(x_train)
If you are using scikit learn to convert the value into one hot encoder then in testing time you should use enc.transform(x_test)
The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column
edited Apr 9 at 6:28
answered Apr 9 at 6:17
Swapnil PoteSwapnil Pote
11
11
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29028%2fdifferent-number-of-features-after-using-onehotencoder%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55
1
$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04
5
$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25
2
$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07
1
$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56