Is it wrong if I cluster numerical attributes and categorical attributes separately?Clustering customer data stored in ElasticSearchNumerical data and different algorithmsBest approach for this unsupervised clustering problem with categorical data?Clustering big data by reducing data accuracy?Mixed geospatial and categorical clusteringWhat kind of classification should I use?How to choose the optimal k in k-protoypes?Applying machine learning algorithms to subset of attributes in dataframeSegmenting pandas dataframe with lists as elementsMixed types of data for clustering

Unable to get dependencies from jcenter with a new project

Does grappling negate Mirror Image?

Do we have to expect a queue for the shuttle from Watford Junction to Harry Potter Studio?

What does Apple's new App Store requirement mean

What are some good ways to treat frozen vegetables such that they behave like fresh vegetables when stir frying them?

How does electrical safety system work on ISS?

What to do when eye contact makes your coworker uncomfortable?

Confused about Cramer-Rao lower bound and CLT

How to make money from a browser who sees 5 seconds into the future of any web page?

Can you use Vicious Mockery to win an argument or gain favours?

Pre-mixing cryogenic fuels and using only one fuel tank

How can I write humor as character trait?

A Trivial Diagnosis

Why is it that I can sometimes guess the next note?

Microchip documentation does not label CAN buss pins on micro controller pinout diagram

Temporarily disable WLAN internet access for children, but allow it for adults

What does "Scientists rise up against statistical significance" mean? (Comment in Nature)

Giving feedback to someone without sounding prejudiced

Multiplicative persistence

Is this part of the description of the Archfey warlock's Misty Escape feature redundant?

How much of a Devil Fruit must be consumed to gain the power?

Which Article Helped Get Rid of Technobabble in RPGs?

How can ping know if my host is down

What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?

Is it wrong if I cluster numerical attributes and categorical attributes separately?

Clustering customer data stored in ElasticSearchNumerical data and different algorithmsBest approach for this unsupervised clustering problem with categorical data?Clustering big data by reducing data accuracy?Mixed geospatial and categorical clusteringWhat kind of classification should I use?How to choose the optimal k in k-protoypes?Applying machine learning algorithms to subset of attributes in dataframeSegmenting pandas dataframe with lists as elementsMixed types of data for clustering

I have a dataset of credit customers containing mixed data types (numerical and categorical with several levels). I am trying to perform segmentation so that I can end up with k groups and then build definitions (based on attributes I have).

While there are solutions for clustering data with mixed data types (K-prototypes, hierarchical clustering with Gower's distance), why would it be wrong to cluster numerical attributes and categorical attributes separately and come up with definitions individually?

asked Mar 18 at 6:52

Rohit Gavval

617

$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04

$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19

$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58

add a comment |

asked Mar 18 at 6:52

Rohit Gavval

617

$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04

$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19

$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58

add a comment |

asked Mar 18 at 6:52

Rohit Gavval

617

clustering

asked Mar 18 at 6:52

Rohit Gavval

617

asked Mar 18 at 6:52

Rohit Gavval

617

asked Mar 18 at 6:52

Rohit Gavval

617

asked Mar 18 at 6:52

Rohit Gavval

617

asked Mar 18 at 6:52

Rohit Gavval

617

$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04

$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19

$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58

add a comment |

$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04

$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19

$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58

Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).

– Ashish
Mar 18 at 10:04

It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.

– Rohit Gavval
Mar 18 at 13:19

Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.

– Ashish
Mar 18 at 14:58

add a comment |

2 Answers
2

active

oldest

votes

There is nothing wrong with not using all attributes. In fact there are subspace clustering approaches that attempt to identify (partially) informative attributes along with clusters (but mostly for continuous variables).

On your data, you will have big data preparation issues, that would need careful weighting and nonlinear transformations. So it probably is a good idea to first try to understand each attribute before you go into any combinations.

Also bear in mind that a clustering never is correct or "optimal". A successful clustering is one that gave you a new insight. Any means that lead to verifiable insights is okay! Just don't assume that you could automate this.

answered Mar 18 at 18:50

Anony-Mousse

5,010624

$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago

add a comment |

Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:

effect of categorical numerical separation on clustering

If we cluster only on the categorical feature, clusters C1 and C2 would be merged. If we cluster only on the numerical feature, all three clusters would be merged. Therefore, clusters C1 and C2 could not be found separately.

As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.

edited 2 days ago

answered Mar 18 at 10:54

Esmailian

1,641114

$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47497%2fis-it-wrong-if-i-cluster-numerical-attributes-and-categorical-attributes-separat%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

answered Mar 18 at 18:50

Anony-Mousse

5,010624

$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago

add a comment |

answered Mar 18 at 18:50

Anony-Mousse

5,010624

$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago

add a comment |

answered Mar 18 at 18:50

Anony-Mousse

5,010624

answered Mar 18 at 18:50

Anony-Mousse

5,010624

answered Mar 18 at 18:50

Anony-Mousse

5,010624

answered Mar 18 at 18:50

Anony-Mousse

5,010624

answered Mar 18 at 18:50

Anony-Mousse

5,010624

$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago

add a comment |

$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago

I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.

– Rohit Gavval
2 days ago

Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?

– Anony-Mousse
2 days ago

add a comment |

Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:

effect of categorical numerical separation on clustering

As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.

edited 2 days ago

answered Mar 18 at 10:54

Esmailian

1,641114

$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago

add a comment |

Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:

effect of categorical numerical separation on clustering

As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.

edited 2 days ago

answered Mar 18 at 10:54

Esmailian

1,641114

$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago

add a comment |

Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:

effect of categorical numerical separation on clustering

As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.

edited 2 days ago

answered Mar 18 at 10:54

Esmailian

1,641114

Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:

effect of categorical numerical separation on clustering

As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.

edited 2 days ago

answered Mar 18 at 10:54

Esmailian

1,641114

edited 2 days ago

answered Mar 18 at 10:54

Esmailian

1,641114

answered Mar 18 at 10:54

Esmailian

1,641114

answered Mar 18 at 10:54

Esmailian

1,641114

$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago

add a comment |

$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago

$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago

Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?

– Rohit Gavval
2 days ago

@RohitGavval I updated the answer

– Esmailian
2 days ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

DRU,829QL

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2