Dealing with multiple distinct-value categorical variablesChoosing the right data mining method to find the effect of each parameter over the targetHow to visualise multidimensional categorical data with additional time dimensionHow can I dynamically distinguish between categorical data and numerical data?Imputation of missing values and dealing with categorical valuesOutlier detection on categorical network log dataPreparing, Scaling and Selecting from a combination of numerical and categorical featureshow does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?ML Models: How to handle categorical feature with over 1000 unique valuesProblem with important feature having a lot of missing valueTraining NLP with multiple text input features

Why must traveling waves have the same amplitude to form a standing wave?

How do I hide Chekhov's Gun?

Current sense amp + op-amp buffer + ADC: Measuring down to 0 with single supply

Min function accepting varying number of arguments in C++17

What is IP squat space

Software described as 香ばしい

Brexit - No Deal Rejection

Welcoming 2019 Pi day: How to draw the letter π?

Define, (actually define) the "stability" and "energy" of a compound

Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?

Can I respond to my Infinite loop with a summon

Russian cases: A few examples, I'm really confused

How to deal with a cynical class?

Why doesn't using two cd commands in bash script execute the second command?

Set readonly fields in a constructor local function c#

Identifying the interval from A♭ to D♯

What options are left, if Britain cannot decide?

How to deal with taxi scam when on vacation?

Life insurance that covers only simultaneous/dual deaths

Is having access to past exams cheating and, if yes, could it be proven just by a good grade?

Is it possible to upcast ritual spells?

Could the Saturn V actually have launched astronauts around Venus?

Would it take an action or something similar to activate the blindsight property of a Dragon Mask?

How to write cleanly even if my character uses expletive language?



Dealing with multiple distinct-value categorical variables


Choosing the right data mining method to find the effect of each parameter over the targetHow to visualise multidimensional categorical data with additional time dimensionHow can I dynamically distinguish between categorical data and numerical data?Imputation of missing values and dealing with categorical valuesOutlier detection on categorical network log dataPreparing, Scaling and Selecting from a combination of numerical and categorical featureshow does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?ML Models: How to handle categorical feature with over 1000 unique valuesProblem with important feature having a lot of missing valueTraining NLP with multiple text input features













1












$begingroup$


So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.



For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?



Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.



I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.



Much Regards










share|improve this question







New contributor




Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Can you explain more about the problem you are trying to solve?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
    $endgroup$
    – Abdullah Mohamed
    yesterday










  • $begingroup$
    Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
    $endgroup$
    – Shamit Verma
    yesterday










  • $begingroup$
    What kind of information you are trying to extract from the IP?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    @ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
    $endgroup$
    – Abdullah Mohamed
    yesterday















1












$begingroup$


So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.



For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?



Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.



I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.



Much Regards










share|improve this question







New contributor




Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Can you explain more about the problem you are trying to solve?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
    $endgroup$
    – Abdullah Mohamed
    yesterday










  • $begingroup$
    Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
    $endgroup$
    – Shamit Verma
    yesterday










  • $begingroup$
    What kind of information you are trying to extract from the IP?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    @ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
    $endgroup$
    – Abdullah Mohamed
    yesterday













1












1








1





$begingroup$


So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.



For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?



Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.



I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.



Much Regards










share|improve this question







New contributor




Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.



For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?



Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.



I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.



Much Regards







machine-learning neural-network categorical-data word-embeddings






share|improve this question







New contributor




Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question







New contributor




Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question






New contributor




Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked yesterday









Abdullah MohamedAbdullah Mohamed

62




62




New contributor




Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Abdullah Mohamed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    Can you explain more about the problem you are trying to solve?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
    $endgroup$
    – Abdullah Mohamed
    yesterday










  • $begingroup$
    Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
    $endgroup$
    – Shamit Verma
    yesterday










  • $begingroup$
    What kind of information you are trying to extract from the IP?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    @ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
    $endgroup$
    – Abdullah Mohamed
    yesterday
















  • $begingroup$
    Can you explain more about the problem you are trying to solve?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
    $endgroup$
    – Abdullah Mohamed
    yesterday










  • $begingroup$
    Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
    $endgroup$
    – Shamit Verma
    yesterday










  • $begingroup$
    What kind of information you are trying to extract from the IP?
    $endgroup$
    – alireza zolanvari
    yesterday










  • $begingroup$
    @ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
    $endgroup$
    – Abdullah Mohamed
    yesterday















$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday




$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday












$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday




$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday












$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday




$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday












$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday




$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday












$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday




$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday










1 Answer
1






active

oldest

votes


















0












$begingroup$

Have you heard of CatBoostClassifier?



https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/



It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.






share|improve this answer









$endgroup$












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );






    Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.









    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47295%2fdealing-with-multiple-distinct-value-categorical-variables%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    Have you heard of CatBoostClassifier?



    https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/



    It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.






    share|improve this answer









    $endgroup$

















      0












      $begingroup$

      Have you heard of CatBoostClassifier?



      https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/



      It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.






      share|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$

        Have you heard of CatBoostClassifier?



        https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/



        It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.






        share|improve this answer









        $endgroup$



        Have you heard of CatBoostClassifier?



        https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/



        It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered yesterday









        Victor OliveiraVictor Oliveira

        3157




        3157




















            Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.









            draft saved

            draft discarded


















            Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.












            Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.











            Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.














            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47295%2fdealing-with-multiple-distinct-value-categorical-variables%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

            Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High