Different number of features after using OneHotEncoderdecision trees on mix of categorical and real value parametersDoes using unimportant features hurt accuracy?Encoding features in sklearnSuitable aggregations (mean, median or something else) to make features?Feature importance varying a lot using same data with same featuresTraining a regression algorithm with a variable number of featuresNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.

How to set the font color of quantity objects (Version 11.3 vs version 12)?

How to pass attribute when redirecting from lwc to aura component

Why do TACANs not have a symbol for compulsory reporting on IFR Enroute Low Altitude charts?

Why does Bran Stark feel that Jon Snow "needs to know" about his lineage?

How does a Swashbuckler rogue "fight with two weapons while safely darting away"?

Is it possible to measure lightning discharges as Nikola Tesla?

Are some sounds more pleasing to the ear, like ㄴ and ㅁ?

A question regarding using the definite article

I listed a wrong degree date on my background check. What will happen?

Need help understanding harmonic series and intervals

Transfer over $10k

What does "rf" mean in "rfkill"?

"ne paelici suspectaretur" (Tacitus)

Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?

Find the coordinate of two line segments that are perpendicular

What's the polite way to say "I need to urinate"?

What does 「再々起」mean?

Minimum value of 4 digit number divided by sum of its digits

Is creating your own "experiment" considered cheating during a physics exam?

How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?

Do I have an "anti-research" personality?

Colliding particles and Activation energy

How to replace the "space symbol" (squat-u) in listings?

Reverse the word in a string with the same order in javascript



Different number of features after using OneHotEncoder


decision trees on mix of categorical and real value parametersDoes using unimportant features hurt accuracy?Encoding features in sklearnSuitable aggregations (mean, median or something else) to make features?Feature importance varying a lot using same data with same featuresTraining a regression algorithm with a variable number of featuresNumber of features of the model must match the input. Model n_features is `N` and input n_features is `X`.













1












$begingroup$


I have train and test data in two separate files.



OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?










share|improve this question









$endgroup$











  • $begingroup$
    Create features (one hot encoding) before splitting into train and test sets.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 12:55






  • 1




    $begingroup$
    Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 13:04






  • 5




    $begingroup$
    It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
    $endgroup$
    – AlexR
    Mar 13 '18 at 13:25






  • 2




    $begingroup$
    If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
    $endgroup$
    – TwinPenguins
    Mar 14 '18 at 22:07






  • 1




    $begingroup$
    @Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
    $endgroup$
    – TwinPenguins
    Mar 16 '18 at 8:56















1












$begingroup$


I have train and test data in two separate files.



OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?










share|improve this question









$endgroup$











  • $begingroup$
    Create features (one hot encoding) before splitting into train and test sets.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 12:55






  • 1




    $begingroup$
    Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 13:04






  • 5




    $begingroup$
    It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
    $endgroup$
    – AlexR
    Mar 13 '18 at 13:25






  • 2




    $begingroup$
    If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
    $endgroup$
    – TwinPenguins
    Mar 14 '18 at 22:07






  • 1




    $begingroup$
    @Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
    $endgroup$
    – TwinPenguins
    Mar 16 '18 at 8:56













1












1








1


1



$begingroup$


I have train and test data in two separate files.



OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?










share|improve this question









$endgroup$




I have train and test data in two separate files.



OneHotEncoder gives different number of features for Train and Test Data based on the different values they have. But the classifier requires that the number of features for test and train data should be equal, how can I solve this problem?







scikit-learn feature-engineering feature-scaling






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 13 '18 at 12:54









SameedSameed

1063




1063











  • $begingroup$
    Create features (one hot encoding) before splitting into train and test sets.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 12:55






  • 1




    $begingroup$
    Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 13:04






  • 5




    $begingroup$
    It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
    $endgroup$
    – AlexR
    Mar 13 '18 at 13:25






  • 2




    $begingroup$
    If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
    $endgroup$
    – TwinPenguins
    Mar 14 '18 at 22:07






  • 1




    $begingroup$
    @Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
    $endgroup$
    – TwinPenguins
    Mar 16 '18 at 8:56
















  • $begingroup$
    Create features (one hot encoding) before splitting into train and test sets.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 12:55






  • 1




    $begingroup$
    Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
    $endgroup$
    – Ankit Seth
    Mar 13 '18 at 13:04






  • 5




    $begingroup$
    It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
    $endgroup$
    – AlexR
    Mar 13 '18 at 13:25






  • 2




    $begingroup$
    If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
    $endgroup$
    – TwinPenguins
    Mar 14 '18 at 22:07






  • 1




    $begingroup$
    @Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
    $endgroup$
    – TwinPenguins
    Mar 16 '18 at 8:56















$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55




$begingroup$
Create features (one hot encoding) before splitting into train and test sets.
$endgroup$
– Ankit Seth
Mar 13 '18 at 12:55




1




1




$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04




$begingroup$
Yes. You have to make sure you take all data to one hot encoding so that you get all features. I would suggest that you create a dummy variable for train and test data so that you can use it to split the combined data after your encoding process.
$endgroup$
– Ankit Seth
Mar 13 '18 at 13:04




5




5




$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25




$begingroup$
It‘s a red flag that one feature expression only occurs in train/test but not both. Make sure to double check if it even makes sense to include the feature in the model at all!
$endgroup$
– AlexR
Mar 13 '18 at 13:25




2




2




$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07




$begingroup$
If you categorical features represent high cardinality (and of course distinct in test and train), OneHotEncoder is not a way-to-go method, besides other issues it may cause! Everyone here is talking about combining train+test and do encoding, but that is usually a wrong practice. Test is just a sample of unseen data, which can have different subcategories of a particular feature! In machine learning, we just do not want to build a model for one time use. It is expected to predict on incoming flow of unseen data, otherwise why bother! Combining train and test is against this principle.
$endgroup$
– TwinPenguins
Mar 14 '18 at 22:07




1




1




$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56




$begingroup$
@Sameed Well, I am not 100% sure what are the best practices. The more I study the more I learn that OneHotEncoder is not the only one and often not the best. I suggest you search a bit more. Maybe you wanna check Catboost (tech.yandex.com/catboost/doc/dg/concepts/…) and the ways they offer to do categorical encoding beyond OneHotEncoder for Gradient Boosting Trees, or a Python implementation for various categorical encoding: github.com/scikit-learn-contrib/categorical-encoding!
$endgroup$
– TwinPenguins
Mar 16 '18 at 8:56










3 Answers
3






active

oldest

votes


















0












$begingroup$

Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.



You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)






share|improve this answer









$endgroup$




















    0












    $begingroup$

    One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.



    Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.



    Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".



    One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.



    Periodlcally refresh the model to include the recent appeared levels.






    share|improve this answer









    $endgroup$




















      0












      $begingroup$

      One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].



      If you are using scikit learn to convert the value into one hot encoder then in training time you should use



      enc = OneHotEncoder()
      enc.fit(x_train)


      If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
      enc.transform(x_test)



      The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column






      share|improve this answer











      $endgroup$













        Your Answer








        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "557"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: false,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: null,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29028%2fdifferent-number-of-features-after-using-onehotencoder%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        0












        $begingroup$

        Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.



        You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)






        share|improve this answer









        $endgroup$

















          0












          $begingroup$

          Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.



          You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)






          share|improve this answer









          $endgroup$















            0












            0








            0





            $begingroup$

            Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.



            You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)






            share|improve this answer









            $endgroup$



            Data preprocessing (including creation of dummy variables from categorical features) needs to be done before splitting the data into train and test set. This would solve your issue.



            You don't have much details on what you are trying to do - so if what I say is irrelevant just skip it; but, the fact that your test set contains less categories than the training set is something that I would try to avoid. If this happens for multiple categories, maybe you should try to group some of the categories together into more general ones (e.g. if the variable contains "ways of going to work", you could merge "bus", "tram", "metro" into "public transport"). Also, why don't you try cross-validation instead of having a static test set? (Useful link for cross-validation with scikit-learn)







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Mar 13 '18 at 13:31









            missrgmissrg

            36518




            36518





















                0












                $begingroup$

                One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.



                Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.



                Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".



                One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.



                Periodlcally refresh the model to include the recent appeared levels.






                share|improve this answer









                $endgroup$

















                  0












                  $begingroup$

                  One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.



                  Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.



                  Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".



                  One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.



                  Periodlcally refresh the model to include the recent appeared levels.






                  share|improve this answer









                  $endgroup$















                    0












                    0








                    0





                    $begingroup$

                    One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.



                    Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.



                    Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".



                    One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.



                    Periodlcally refresh the model to include the recent appeared levels.






                    share|improve this answer









                    $endgroup$



                    One hot encoding is only a symptom. The cause of the problem is that your factor valible has not the same leveles in the test and train data.



                    Here you should distinct. Is it only a problem of sampling? You created your test data as (say) 20% sample of the original data. Some levels with small cardinalty could fail to get in the sample. If it is the case you must take care to sample all levels and get 20% of data for each level.



                    Other problem is if your factor valible is not static and through the time new lavels can emerge. Here is it realy possible to encounter new levels in "unseen data".



                    One possible approach to handle this is to train an explicit unknown level based on some prepared average data. In the preprocessing phase all new levels are recognised and mapped to this unknown level.



                    Periodlcally refresh the model to include the recent appeared levels.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered May 12 '18 at 19:22









                    Marmite BomberMarmite Bomber

                    9731611




                    9731611





















                        0












                        $begingroup$

                        One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].



                        If you are using scikit learn to convert the value into one hot encoder then in training time you should use



                        enc = OneHotEncoder()
                        enc.fit(x_train)


                        If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
                        enc.transform(x_test)



                        The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column






                        share|improve this answer











                        $endgroup$

















                          0












                          $begingroup$

                          One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].



                          If you are using scikit learn to convert the value into one hot encoder then in training time you should use



                          enc = OneHotEncoder()
                          enc.fit(x_train)


                          If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
                          enc.transform(x_test)



                          The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column






                          share|improve this answer











                          $endgroup$















                            0












                            0








                            0





                            $begingroup$

                            One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].



                            If you are using scikit learn to convert the value into one hot encoder then in training time you should use



                            enc = OneHotEncoder()
                            enc.fit(x_train)


                            If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
                            enc.transform(x_test)



                            The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column






                            share|improve this answer











                            $endgroup$



                            One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0, 0, 1].



                            If you are using scikit learn to convert the value into one hot encoder then in training time you should use



                            enc = OneHotEncoder()
                            enc.fit(x_train)


                            If you are using scikit learn to convert the value into one hot encoder then in testing time you should use
                            enc.transform(x_test)



                            The reason we are using transform function in case of testing is we have to consider the label values on the basis of which we have converted data in training time. Because in testing time we not get all labels for that column







                            share|improve this answer














                            share|improve this answer



                            share|improve this answer








                            edited Apr 9 at 6:28

























                            answered Apr 9 at 6:17









                            Swapnil PoteSwapnil Pote

                            11




                            11



























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Data Science Stack Exchange!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                Use MathJax to format equations. MathJax reference.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f29028%2fdifferent-number-of-features-after-using-onehotencoder%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                                Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                                Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High