Clustering for variables with large amount of categoriesClustering of numerical dataHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?Clustering with multiple distance measuresHow to use cluster analysis with grouped data so one cluster may only have not more than one item from each group?How to select features for clustering to detect the number of different unique products in a search result?Classifying variable types on a list of variablesWhat best/correct algorithm/procedure to cluster a dataset with a lot 0's?multivariate clustering, dimensionality reduction and data scalling for regressionMeasure of variety within list/clusterHow to validate clusters after calculating Gower distances and Ward's clustering in R
What does "rf" mean in "rfkill"?
In gnome-terminal only 2 out of 3 zoom keys work
Find the coordinate of two line segments that are perpendicular
Can fracking help reduce CO2?
Is it possible to measure lightning discharges as Nikola Tesla?
Phrase for the opposite of "foolproof"
Was it really necessary for the Lunar Module to have 2 stages?
Stark VS Thanos
Can a creature tell when it has been affected by a Divination wizard's Portent?
Does a creature that is immune to a condition still make a saving throw?
Asahi Dry Black beer can
A question regarding using the definite article
Confusion about capacitors
Pressure to defend the relevance of one's area of mathematics
How to replace the "space symbol" (squat-u) in listings?
Any examples of headwear for races with animal ears?
Feels like I am getting dragged in office politics
Has any spacecraft ever had the ability to directly communicate with civilian air traffic control?
How to set the font color of quantity objects (Version 11.3 vs version 12)?
In the time of the mishna, were there Jewish cities without courts?
Does jamais mean always or never in this context?
You look catfish vs You look like a catfish
Where did the extra Pym particles come from in Endgame?
When to run a blank using ICP-MS
Clustering for variables with large amount of categories
Clustering of numerical dataHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?Clustering with multiple distance measuresHow to use cluster analysis with grouped data so one cluster may only have not more than one item from each group?How to select features for clustering to detect the number of different unique products in a search result?Classifying variable types on a list of variablesWhat best/correct algorithm/procedure to cluster a dataset with a lot 0's?multivariate clustering, dimensionality reduction and data scalling for regressionMeasure of variety within list/clusterHow to validate clusters after calculating Gower distances and Ward's clustering in R
$begingroup$
I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.
My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.
Please let me know is this method valid.
clustering unsupervised-learning
$endgroup$
add a comment |
$begingroup$
I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.
My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.
Please let me know is this method valid.
clustering unsupervised-learning
$endgroup$
$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22
$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39
add a comment |
$begingroup$
I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.
My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.
Please let me know is this method valid.
clustering unsupervised-learning
$endgroup$
I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.
My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.
Please let me know is this method valid.
clustering unsupervised-learning
clustering unsupervised-learning
edited Apr 8 at 5:34
Anony-Mousse
5,350625
5,350625
asked Apr 8 at 4:03
Michael SchroterMichael Schroter
133
133
$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22
$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39
add a comment |
$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22
$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39
$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22
$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22
$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39
$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty
$endgroup$
$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48846%2fclustering-for-variables-with-large-amount-of-categories%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty
$endgroup$
$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22
add a comment |
$begingroup$
Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty
$endgroup$
$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22
add a comment |
$begingroup$
Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty
$endgroup$
Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty
answered Apr 8 at 9:57
Aditya PatnaikAditya Patnaik
11
11
$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22
add a comment |
$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22
$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22
$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48846%2fclustering-for-variables-with-large-amount-of-categories%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22
$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39