How to validate clusters after calculating Gower distances and Ward's clustering in R The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsClustering not producing even clustersWhat is the relationship between clustering and association rule mining?Calculate feature weight vector for one-hot-encoded data frame in RMixed geospatial and categorical clusteringHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?When is centering and scaling needed before doing hierarchical clustering?Interpret clustering results after variable transformationmultivariate clustering, dimensionality reduction and data scalling for regressionHierarchical Clustering and Variable SelectionHow to calculate a weighted Hierarchical clustering in Orange

He got a vote 80% that of Emmanuel Macron’s

Match Roman Numerals

Scientific Reports - Significant Figures

If the empty set is a subset of every set, why write ... ∪ ∅?

How does this infinite series simplify to an integral?

Is it ethical to upload a automatically generated paper to a non peer-reviewed site as part of a larger research?

First use of “packing” as in carrying a gun

How can I protect witches in combat who wear limited clothing?

What aspect of planet Earth must be changed to prevent the industrial revolution?

Did the new image of black hole confirm the general theory of relativity?

Do working physicists consider Newtonian mechanics to be "falsified"?

Why not take a picture of a closer black hole?

How to delete random line from file using Unix command?

Semisimplicity of the category of coherent sheaves?

Road tyres vs "Street" tyres for charity ride on MTB Tandem

How to copy the contents of all files with a certain name into a new file?

Can undead you have reanimated wait inside a portable hole?

Can a novice safely splice in wire to lengthen 5V charging cable?

When did F become S in typeography, and why?

Why does the Event Horizon Telescope (EHT) not include telescopes from Africa, Asia or Australia?

Derivation tree not rendering

Why can't devices on different VLANs, but on the same subnet, communicate?

Is above average number of years spent on PhD considered a red flag in future academia or industry positions?

How should I replace vector<uint8_t>::const_iterator in an API?



How to validate clusters after calculating Gower distances and Ward's clustering in R



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsClustering not producing even clustersWhat is the relationship between clustering and association rule mining?Calculate feature weight vector for one-hot-encoded data frame in RMixed geospatial and categorical clusteringHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?When is centering and scaling needed before doing hierarchical clustering?Interpret clustering results after variable transformationmultivariate clustering, dimensionality reduction and data scalling for regressionHierarchical Clustering and Variable SelectionHow to calculate a weighted Hierarchical clustering in Orange










0












$begingroup$


I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.









share|improve this question









$endgroup$











  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09















0












$begingroup$


I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.









share|improve this question









$endgroup$











  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09













0












0








0





$begingroup$


I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.









share|improve this question









$endgroup$




I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.






clustering






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 31 at 18:04









ItKItK

1




1











  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09
















  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09















$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32




$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32












$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36




$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36












$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06




$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06












$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09




$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09










0






active

oldest

votes












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48306%2fhow-to-validate-clusters-after-calculating-gower-distances-and-wards-clustering%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48306%2fhow-to-validate-clusters-after-calculating-gower-distances-and-wards-clustering%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Marja Vauras Lähteet | Aiheesta muualla | NavigointivalikkoMarja Vauras Turun yliopiston tutkimusportaalissaInfobox OKSuomalaisen Tiedeakatemian varsinaiset jäsenetKasvatustieteiden tiedekunnan dekaanit ja muu johtoMarja VaurasKoulutusvienti on kestävyys- ja ketteryyslaji (2.5.2017)laajentamallaWorldCat Identities0000 0001 0855 9405n86069603utb201588738523620927

Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

Is this part of the description of the Archfey warlock's Misty Escape feature redundant?When is entropic ward considered “used”?How does the reaction timing work for Wrath of the Storm? Can it potentially prevent the damage from the triggering attack?Does the Dark Arts Archlich warlock patrons's Arcane Invisibility activate every time you cast a level 1+ spell?When attacking while invisible, when exactly does invisibility break?Can I cast Hellish Rebuke on my turn?Do I have to “pre-cast” a reaction spell in order for it to be triggered?What happens if a Player Misty Escapes into an Invisible CreatureCan a reaction interrupt multiattack?Does the Fiend-patron warlock's Hurl Through Hell feature dispel effects that require the target to be on the same plane as the caster?What are you allowed to do while using the Warlock's Eldritch Master feature?