Dealing with multiple distinct-value categorical variablesChoosing the right data mining method to find the effect of each parameter over the targetHow to visualise multidimensional categorical data with additional time dimensionHow can I dynamically distinguish between categorical data and numerical data?Imputation of missing values and dealing with categorical valuesOutlier detection on categorical network log dataPreparing, Scaling and Selecting from a combination of numerical and categorical featureshow does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?ML Models: How to handle categorical feature with over 1000 unique valuesProblem with important feature having a lot of missing valueTraining NLP with multiple text input features
Why must traveling waves have the same amplitude to form a standing wave?
How do I hide Chekhov's Gun?
Current sense amp + op-amp buffer + ADC: Measuring down to 0 with single supply
Min function accepting varying number of arguments in C++17
What is IP squat space
Software described as 香ばしい
Brexit - No Deal Rejection
Welcoming 2019 Pi day: How to draw the letter π?
Define, (actually define) the "stability" and "energy" of a compound
Co-worker team leader wants to inject his friend's awful software into our development. What should I say to our common boss?
Can I respond to my Infinite loop with a summon
Russian cases: A few examples, I'm really confused
How to deal with a cynical class?
Why doesn't using two cd commands in bash script execute the second command?
Set readonly fields in a constructor local function c#
Identifying the interval from A♭ to D♯
What options are left, if Britain cannot decide?
How to deal with taxi scam when on vacation?
Life insurance that covers only simultaneous/dual deaths
Is having access to past exams cheating and, if yes, could it be proven just by a good grade?
Is it possible to upcast ritual spells?
Could the Saturn V actually have launched astronauts around Venus?
Would it take an action or something similar to activate the blindsight property of a Dragon Mask?
How to write cleanly even if my character uses expletive language?
Dealing with multiple distinct-value categorical variables
Choosing the right data mining method to find the effect of each parameter over the targetHow to visualise multidimensional categorical data with additional time dimensionHow can I dynamically distinguish between categorical data and numerical data?Imputation of missing values and dealing with categorical valuesOutlier detection on categorical network log dataPreparing, Scaling and Selecting from a combination of numerical and categorical featureshow does XGBoost's exact greedy split finding algorithm determine candidate split values for different feature types?ML Models: How to handle categorical feature with over 1000 unique valuesProblem with important feature having a lot of missing valueTraining NLP with multiple text input features
$begingroup$
So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.
For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?
Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.
I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.
Much Regards
machine-learning neural-network categorical-data word-embeddings
New contributor
$endgroup$
|
show 4 more comments
$begingroup$
So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.
For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?
Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.
I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.
Much Regards
machine-learning neural-network categorical-data word-embeddings
New contributor
$endgroup$
$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday
$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday
|
show 4 more comments
$begingroup$
So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.
For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?
Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.
I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.
Much Regards
machine-learning neural-network categorical-data word-embeddings
New contributor
$endgroup$
So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.
For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?
Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.
I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.
Much Regards
machine-learning neural-network categorical-data word-embeddings
machine-learning neural-network categorical-data word-embeddings
New contributor
New contributor
New contributor
asked yesterday
Abdullah MohamedAbdullah Mohamed
62
62
New contributor
New contributor
$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday
$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday
|
show 4 more comments
$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday
$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday
$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday
$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday
|
show 4 more comments
1 Answer
1
active
oldest
votes
$begingroup$
Have you heard of CatBoostClassifier?
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/
It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47295%2fdealing-with-multiple-distinct-value-categorical-variables%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Have you heard of CatBoostClassifier?
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/
It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.
$endgroup$
add a comment |
$begingroup$
Have you heard of CatBoostClassifier?
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/
It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.
$endgroup$
add a comment |
$begingroup$
Have you heard of CatBoostClassifier?
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/
It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.
$endgroup$
Have you heard of CatBoostClassifier?
https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/
It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.
answered yesterday
Victor OliveiraVictor Oliveira
3157
3157
add a comment |
add a comment |
Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.
Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.
Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.
Abdullah Mohamed is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47295%2fdealing-with-multiple-distinct-value-categorical-variables%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Can you explain more about the problem you are trying to solve?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
Mainly, I'm trying to classify data according to some inputs, the inputs mainly constitute of categorical data, each categorical variable constitutes of so many distinct values. One of the independent variables is the IP address, which is essential for my classification problem. What I'm trying to do is to binary classify based on the (mostly categorical) inputs. Does that help? Let me know if you need more details.
$endgroup$
– Abdullah Mohamed
yesterday
$begingroup$
Embedding, Domain-based-features are most promising options here. For IP, it would be subnet ID, geo-location etc. Embedding works for large number of value (Such as word embedding for 10 Million+ words)
$endgroup$
– Shamit Verma
yesterday
$begingroup$
What kind of information you are trying to extract from the IP?
$endgroup$
– alireza zolanvari
yesterday
$begingroup$
@ShamitVerma My dataset already contains countries, however, the country variable might be different than the IP country (usage of VPN's/proxies for instance). I didn't know that Embeddings work for data having millions of features actually, in that case that would be a reasonable solution for my question.
$endgroup$
– Abdullah Mohamed
yesterday