How to validate clusters after calculating Gower distances and Ward's clustering in R The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsClustering not producing even clustersWhat is the relationship between clustering and association rule mining?Calculate feature weight vector for one-hot-encoded data frame in RMixed geospatial and categorical clusteringHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?When is centering and scaling needed before doing hierarchical clustering?Interpret clustering results after variable transformationmultivariate clustering, dimensionality reduction and data scalling for regressionHierarchical Clustering and Variable SelectionHow to calculate a weighted Hierarchical clustering in Orange
He got a vote 80% that of Emmanuel Macron’s
Match Roman Numerals
Scientific Reports - Significant Figures
If the empty set is a subset of every set, why write ... ∪ ∅?
How does this infinite series simplify to an integral?
Is it ethical to upload a automatically generated paper to a non peer-reviewed site as part of a larger research?
First use of “packing” as in carrying a gun
How can I protect witches in combat who wear limited clothing?
What aspect of planet Earth must be changed to prevent the industrial revolution?
Did the new image of black hole confirm the general theory of relativity?
Do working physicists consider Newtonian mechanics to be "falsified"?
Why not take a picture of a closer black hole?
How to delete random line from file using Unix command?
Semisimplicity of the category of coherent sheaves?
Road tyres vs "Street" tyres for charity ride on MTB Tandem
How to copy the contents of all files with a certain name into a new file?
Can undead you have reanimated wait inside a portable hole?
Can a novice safely splice in wire to lengthen 5V charging cable?
When did F become S in typeography, and why?
Why does the Event Horizon Telescope (EHT) not include telescopes from Africa, Asia or Australia?
Derivation tree not rendering
Why can't devices on different VLANs, but on the same subnet, communicate?
Is above average number of years spent on PhD considered a red flag in future academia or industry positions?
How should I replace vector<uint8_t>::const_iterator in an API?
How to validate clusters after calculating Gower distances and Ward's clustering in R
The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsClustering not producing even clustersWhat is the relationship between clustering and association rule mining?Calculate feature weight vector for one-hot-encoded data frame in RMixed geospatial and categorical clusteringHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?When is centering and scaling needed before doing hierarchical clustering?Interpret clustering results after variable transformationmultivariate clustering, dimensionality reduction and data scalling for regressionHierarchical Clustering and Variable SelectionHow to calculate a weighted Hierarchical clustering in Orange
$begingroup$
I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.
So, let me explain what I did in detail:
I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.
Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.
I calculated distance using the daisy
library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.
gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))
Then I used Ward's method for clustering, and I plotted a dendogram with the results.
hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram
This resulted in the following output:
I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust()
since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?
Next to this, I have some small doubts on the correctness of my analysis. These are the following:
- I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?
- According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?
- I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.
clustering
$endgroup$
add a comment |
$begingroup$
I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.
So, let me explain what I did in detail:
I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.
Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.
I calculated distance using the daisy
library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.
gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))
Then I used Ward's method for clustering, and I plotted a dendogram with the results.
hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram
This resulted in the following output:
I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust()
since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?
Next to this, I have some small doubts on the correctness of my analysis. These are the following:
- I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?
- According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?
- I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.
clustering
$endgroup$
$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32
$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36
$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06
$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09
add a comment |
$begingroup$
I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.
So, let me explain what I did in detail:
I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.
Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.
I calculated distance using the daisy
library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.
gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))
Then I used Ward's method for clustering, and I plotted a dendogram with the results.
hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram
This resulted in the following output:
I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust()
since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?
Next to this, I have some small doubts on the correctness of my analysis. These are the following:
- I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?
- According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?
- I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.
clustering
$endgroup$
I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.
So, let me explain what I did in detail:
I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.
Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.
I calculated distance using the daisy
library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.
gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))
Then I used Ward's method for clustering, and I plotted a dendogram with the results.
hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram
This resulted in the following output:
I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust()
since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?
Next to this, I have some small doubts on the correctness of my analysis. These are the following:
- I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?
- According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?
- I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.
clustering
clustering
asked Mar 31 at 18:04
ItKItK
1
1
$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32
$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36
$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06
$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09
add a comment |
$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32
$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36
$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06
$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09
$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32
$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32
$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36
$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36
$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06
$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06
$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09
$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48306%2fhow-to-validate-clusters-after-calculating-gower-distances-and-wards-clustering%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48306%2fhow-to-validate-clusters-after-calculating-gower-distances-and-wards-clustering%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32
$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36
$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06
$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09