Is it wrong if I cluster numerical attributes and categorical attributes separately?Clustering customer data stored in ElasticSearchNumerical data and different algorithmsBest approach for this unsupervised clustering problem with categorical data?Clustering big data by reducing data accuracy?Mixed geospatial and categorical clusteringWhat kind of classification should I use?How to choose the optimal k in k-protoypes?Applying machine learning algorithms to subset of attributes in dataframeSegmenting pandas dataframe with lists as elementsMixed types of data for clustering
Unable to get dependencies from jcenter with a new project
Does grappling negate Mirror Image?
Do we have to expect a queue for the shuttle from Watford Junction to Harry Potter Studio?
What does Apple's new App Store requirement mean
What are some good ways to treat frozen vegetables such that they behave like fresh vegetables when stir frying them?
How does electrical safety system work on ISS?
What to do when eye contact makes your coworker uncomfortable?
Confused about Cramer-Rao lower bound and CLT
How to make money from a browser who sees 5 seconds into the future of any web page?
Can you use Vicious Mockery to win an argument or gain favours?
Pre-mixing cryogenic fuels and using only one fuel tank
How can I write humor as character trait?
A Trivial Diagnosis
Why is it that I can sometimes guess the next note?
Microchip documentation does not label CAN buss pins on micro controller pinout diagram
Temporarily disable WLAN internet access for children, but allow it for adults
What does "Scientists rise up against statistical significance" mean? (Comment in Nature)
Giving feedback to someone without sounding prejudiced
Multiplicative persistence
Is this part of the description of the Archfey warlock's Misty Escape feature redundant?
How much of a Devil Fruit must be consumed to gain the power?
Which Article Helped Get Rid of Technobabble in RPGs?
How can ping know if my host is down
What's the name of the logical fallacy where a debater extends a statement far beyond the original statement to make it true?
Is it wrong if I cluster numerical attributes and categorical attributes separately?
Clustering customer data stored in ElasticSearchNumerical data and different algorithmsBest approach for this unsupervised clustering problem with categorical data?Clustering big data by reducing data accuracy?Mixed geospatial and categorical clusteringWhat kind of classification should I use?How to choose the optimal k in k-protoypes?Applying machine learning algorithms to subset of attributes in dataframeSegmenting pandas dataframe with lists as elementsMixed types of data for clustering
$begingroup$
I have a dataset of credit customers containing mixed data types (numerical and categorical with several levels). I am trying to perform segmentation so that I can end up with k groups and then build definitions (based on attributes I have).
While there are solutions for clustering data with mixed data types (K-prototypes, hierarchical clustering with Gower's distance), why would it be wrong to cluster numerical attributes and categorical attributes separately and come up with definitions individually?
clustering
$endgroup$
add a comment |
$begingroup$
I have a dataset of credit customers containing mixed data types (numerical and categorical with several levels). I am trying to perform segmentation so that I can end up with k groups and then build definitions (based on attributes I have).
While there are solutions for clustering data with mixed data types (K-prototypes, hierarchical clustering with Gower's distance), why would it be wrong to cluster numerical attributes and categorical attributes separately and come up with definitions individually?
clustering
$endgroup$
$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04
$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19
$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58
add a comment |
$begingroup$
I have a dataset of credit customers containing mixed data types (numerical and categorical with several levels). I am trying to perform segmentation so that I can end up with k groups and then build definitions (based on attributes I have).
While there are solutions for clustering data with mixed data types (K-prototypes, hierarchical clustering with Gower's distance), why would it be wrong to cluster numerical attributes and categorical attributes separately and come up with definitions individually?
clustering
$endgroup$
I have a dataset of credit customers containing mixed data types (numerical and categorical with several levels). I am trying to perform segmentation so that I can end up with k groups and then build definitions (based on attributes I have).
While there are solutions for clustering data with mixed data types (K-prototypes, hierarchical clustering with Gower's distance), why would it be wrong to cluster numerical attributes and categorical attributes separately and come up with definitions individually?
clustering
clustering
asked Mar 18 at 6:52
Rohit GavvalRohit Gavval
617
617
$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04
$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19
$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58
add a comment |
$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04
$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19
$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58
$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04
$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04
$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19
$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19
$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58
$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
There is nothing wrong with not using all attributes. In fact there are subspace clustering approaches that attempt to identify (partially) informative attributes along with clusters (but mostly for continuous variables).
On your data, you will have big data preparation issues, that would need careful weighting and nonlinear transformations. So it probably is a good idea to first try to understand each attribute before you go into any combinations.
Also bear in mind that a clustering never is correct or "optimal". A successful clustering is one that gave you a new insight. Any means that lead to verifiable insights is okay! Just don't assume that you could automate this.
$endgroup$
$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago
add a comment |
$begingroup$
Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:
If we cluster only on the categorical feature, clusters C1 and C2 would be merged. If we cluster only on the numerical feature, all three clusters would be merged. Therefore, clusters C1 and C2 could not be found separately.
As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.
$endgroup$
$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47497%2fis-it-wrong-if-i-cluster-numerical-attributes-and-categorical-attributes-separat%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There is nothing wrong with not using all attributes. In fact there are subspace clustering approaches that attempt to identify (partially) informative attributes along with clusters (but mostly for continuous variables).
On your data, you will have big data preparation issues, that would need careful weighting and nonlinear transformations. So it probably is a good idea to first try to understand each attribute before you go into any combinations.
Also bear in mind that a clustering never is correct or "optimal". A successful clustering is one that gave you a new insight. Any means that lead to verifiable insights is okay! Just don't assume that you could automate this.
$endgroup$
$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago
add a comment |
$begingroup$
There is nothing wrong with not using all attributes. In fact there are subspace clustering approaches that attempt to identify (partially) informative attributes along with clusters (but mostly for continuous variables).
On your data, you will have big data preparation issues, that would need careful weighting and nonlinear transformations. So it probably is a good idea to first try to understand each attribute before you go into any combinations.
Also bear in mind that a clustering never is correct or "optimal". A successful clustering is one that gave you a new insight. Any means that lead to verifiable insights is okay! Just don't assume that you could automate this.
$endgroup$
$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago
add a comment |
$begingroup$
There is nothing wrong with not using all attributes. In fact there are subspace clustering approaches that attempt to identify (partially) informative attributes along with clusters (but mostly for continuous variables).
On your data, you will have big data preparation issues, that would need careful weighting and nonlinear transformations. So it probably is a good idea to first try to understand each attribute before you go into any combinations.
Also bear in mind that a clustering never is correct or "optimal". A successful clustering is one that gave you a new insight. Any means that lead to verifiable insights is okay! Just don't assume that you could automate this.
$endgroup$
There is nothing wrong with not using all attributes. In fact there are subspace clustering approaches that attempt to identify (partially) informative attributes along with clusters (but mostly for continuous variables).
On your data, you will have big data preparation issues, that would need careful weighting and nonlinear transformations. So it probably is a good idea to first try to understand each attribute before you go into any combinations.
Also bear in mind that a clustering never is correct or "optimal". A successful clustering is one that gave you a new insight. Any means that lead to verifiable insights is okay! Just don't assume that you could automate this.
answered Mar 18 at 18:50
Anony-MousseAnony-Mousse
5,010624
5,010624
$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago
add a comment |
$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago
$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
I am already looking at feature selection. Considering I am looking at most informative features, I would like to understand what will I be missing if I separate the two types (numerical and categorical) variables and cluster them individually.
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago
$begingroup$
Same as if you only select numeric or only categoricial features by feature selection... But you do understand that feature selection best works supervised? Because otherwise you don't really know what you will be missing if you drop a feature?
$endgroup$
– Anony-Mousse
2 days ago
add a comment |
$begingroup$
Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:
If we cluster only on the categorical feature, clusters C1 and C2 would be merged. If we cluster only on the numerical feature, all three clusters would be merged. Therefore, clusters C1 and C2 could not be found separately.
As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.
$endgroup$
$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago
add a comment |
$begingroup$
Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:
If we cluster only on the categorical feature, clusters C1 and C2 would be merged. If we cluster only on the numerical feature, all three clusters would be merged. Therefore, clusters C1 and C2 could not be found separately.
As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.
$endgroup$
$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago
add a comment |
$begingroup$
Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:
If we cluster only on the categorical feature, clusters C1 and C2 would be merged. If we cluster only on the numerical feature, all three clusters would be merged. Therefore, clusters C1 and C2 could not be found separately.
As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.
$endgroup$
Generally, clustering on separate categorical and numerical features is wrong since it could lead to merging the otherwise separate clusters. Here is a visual example of why this may fail:
If we cluster only on the categorical feature, clusters C1 and C2 would be merged. If we cluster only on the numerical feature, all three clusters would be merged. Therefore, clusters C1 and C2 could not be found separately.
As a side note, this blind separation is different than a careful feature selection (mentioned in this answer) which could end up with both categorical and numerical features.
edited 2 days ago
answered Mar 18 at 10:54
EsmailianEsmailian
1,641114
1,641114
$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago
add a comment |
$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago
$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
Thanks for the visual example. The plot seems to represent the distribution of data with both numerical and categorical variables. If I separate the two types of variables, wouldn't the plots also be different? And if that is true, any partition that exists in the data would still show up. Am I missing something here?
$endgroup$
– Rohit Gavval
2 days ago
$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago
$begingroup$
@RohitGavval I updated the answer
$endgroup$
– Esmailian
2 days ago
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47497%2fis-it-wrong-if-i-cluster-numerical-attributes-and-categorical-attributes-separat%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Question, why do you want to group the variables separately? I don't think it's wrong as long as the variables in k groups are significant (statistically speaking).
$endgroup$
– Ashish
Mar 18 at 10:04
$begingroup$
It's a task (read challenge) to cluster mixed data. If I can cluster numerical and categorical variables separately and come up with definitions separately, it will make things easy or even possible in some places. I can even have k=x for categorical dataset and k=y for numerical dataset. Surely there must be a problem or constraints here and I want to know what they are.
$endgroup$
– Rohit Gavval
Mar 18 at 13:19
$begingroup$
Think about it. Real world data is often mixed in terms of data types and together it makes sense. For example, flight numbers (continuous) and passengers on board the flight. Passengers can be construed as name, sex, meal preference etc (categorical). I think it boils down to the question you want to answer. If you just want to find out about flights OR people then separate the variables, ELSE group them.
$endgroup$
– Ashish
Mar 18 at 14:58