Clustering for variables with large amount of categoriesClustering of numerical dataHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?Clustering with multiple distance measuresHow to use cluster analysis with grouped data so one cluster may only have not more than one item from each group?How to select features for clustering to detect the number of different unique products in a search result?Classifying variable types on a list of variablesWhat best/correct algorithm/procedure to cluster a dataset with a lot 0's?multivariate clustering, dimensionality reduction and data scalling for regressionMeasure of variety within list/clusterHow to validate clusters after calculating Gower distances and Ward's clustering in R

What does "rf" mean in "rfkill"?

In gnome-terminal only 2 out of 3 zoom keys work

Find the coordinate of two line segments that are perpendicular

Can fracking help reduce CO2?

Is it possible to measure lightning discharges as Nikola Tesla?

Phrase for the opposite of "foolproof"

Was it really necessary for the Lunar Module to have 2 stages?

Stark VS Thanos

Can a creature tell when it has been affected by a Divination wizard's Portent?

Does a creature that is immune to a condition still make a saving throw?

Asahi Dry Black beer can

A question regarding using the definite article

Confusion about capacitors

Pressure to defend the relevance of one's area of mathematics

How to replace the "space symbol" (squat-u) in listings?

Any examples of headwear for races with animal ears?

Feels like I am getting dragged in office politics

Has any spacecraft ever had the ability to directly communicate with civilian air traffic control?

How to set the font color of quantity objects (Version 11.3 vs version 12)?

In the time of the mishna, were there Jewish cities without courts?

Does jamais mean always or never in this context?

You look catfish vs You look like a catfish

Where did the extra Pym particles come from in Endgame?

When to run a blank using ICP-MS



Clustering for variables with large amount of categories


Clustering of numerical dataHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?Clustering with multiple distance measuresHow to use cluster analysis with grouped data so one cluster may only have not more than one item from each group?How to select features for clustering to detect the number of different unique products in a search result?Classifying variable types on a list of variablesWhat best/correct algorithm/procedure to cluster a dataset with a lot 0's?multivariate clustering, dimensionality reduction and data scalling for regressionMeasure of variety within list/clusterHow to validate clusters after calculating Gower distances and Ward's clustering in R













-2












$begingroup$


I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.



My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.



Please let me know is this method valid.










share|improve this question











$endgroup$











  • $begingroup$
    It is not really clear what you are asking for. Could you give us more details?
    $endgroup$
    – MachineLearner
    Apr 8 at 7:22










  • $begingroup$
    Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
    $endgroup$
    – Michael Schroter
    Apr 8 at 7:39
















-2












$begingroup$


I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.



My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.



Please let me know is this method valid.










share|improve this question











$endgroup$











  • $begingroup$
    It is not really clear what you are asking for. Could you give us more details?
    $endgroup$
    – MachineLearner
    Apr 8 at 7:22










  • $begingroup$
    Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
    $endgroup$
    – Michael Schroter
    Apr 8 at 7:39














-2












-2








-2





$begingroup$


I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.



My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.



Please let me know is this method valid.










share|improve this question











$endgroup$




I have a dataset which, has variables with a lot of categories (some more than 1000). Since, large amount of categories effect the accuracy of the model. I saw some literature stating that if you do not have domain knowledge to group categories that have less than 5% of representation in the dataset.



My idea is that would it be possible to cluster these variables and will it make any sense to do so. For example if there are many variables which have a lot of categories it would be an issue to cluster one variable since as mentioned above the accuracy would be less because of the high number of categories in other variables.



Please let me know is this method valid.







clustering unsupervised-learning






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 8 at 5:34









Anony-Mousse

5,350625




5,350625










asked Apr 8 at 4:03









Michael SchroterMichael Schroter

133




133











  • $begingroup$
    It is not really clear what you are asking for. Could you give us more details?
    $endgroup$
    – MachineLearner
    Apr 8 at 7:22










  • $begingroup$
    Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
    $endgroup$
    – Michael Schroter
    Apr 8 at 7:39

















  • $begingroup$
    It is not really clear what you are asking for. Could you give us more details?
    $endgroup$
    – MachineLearner
    Apr 8 at 7:22










  • $begingroup$
    Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
    $endgroup$
    – Michael Schroter
    Apr 8 at 7:39
















$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22




$begingroup$
It is not really clear what you are asking for. Could you give us more details?
$endgroup$
– MachineLearner
Apr 8 at 7:22












$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39





$begingroup$
Ok. Simply, if we want to reduce the number of categories (excluding target variable) in a variable in a dataset, could we use clustering to do it and does it make sense when there are many variables with lot of categories. See this link for the understanding of the question and other methods used. analyticsvidhya.com/blog/2015/11/…
$endgroup$
– Michael Schroter
Apr 8 at 7:39











1 Answer
1






active

oldest

votes


















0












$begingroup$

Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty






share|improve this answer









$endgroup$












  • $begingroup$
    thanks for info
    $endgroup$
    – Michael Schroter
    Apr 8 at 13:22











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48846%2fclustering-for-variables-with-large-amount-of-categories%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty






share|improve this answer









$endgroup$












  • $begingroup$
    thanks for info
    $endgroup$
    – Michael Schroter
    Apr 8 at 13:22















0












$begingroup$

Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty






share|improve this answer









$endgroup$












  • $begingroup$
    thanks for info
    $endgroup$
    – Michael Schroter
    Apr 8 at 13:22













0












0








0





$begingroup$

Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty






share|improve this answer









$endgroup$



Assuming that you have done the EDA well, you can try the categorical_encoder categorization if you haven't tried it yet.We cannot use the famous one hot encoder algo here,that's for sure. It will create as many number of columns as there were before. Which would be unnecessary. You may try the backward difference encoding.
Check this beauty







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 8 at 9:57









Aditya PatnaikAditya Patnaik

11




11











  • $begingroup$
    thanks for info
    $endgroup$
    – Michael Schroter
    Apr 8 at 13:22
















  • $begingroup$
    thanks for info
    $endgroup$
    – Michael Schroter
    Apr 8 at 13:22















$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22




$begingroup$
thanks for info
$endgroup$
– Michael Schroter
Apr 8 at 13:22

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48846%2fclustering-for-variables-with-large-amount-of-categories%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High