How to validate clusters after calculating Gower distances and Ward's clustering in R The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsClustering not producing even clustersWhat is the relationship between clustering and association rule mining?Calculate feature weight vector for one-hot-encoded data frame in RMixed geospatial and categorical clusteringHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?When is centering and scaling needed before doing hierarchical clustering?Interpret clustering results after variable transformationmultivariate clustering, dimensionality reduction and data scalling for regressionHierarchical Clustering and Variable SelectionHow to calculate a weighted Hierarchical clustering in Orange

He got a vote 80% that of Emmanuel Macron’s

Match Roman Numerals

Scientific Reports - Significant Figures

If the empty set is a subset of every set, why write ... ∪ ∅?

How does this infinite series simplify to an integral?

Is it ethical to upload a automatically generated paper to a non peer-reviewed site as part of a larger research?

First use of “packing” as in carrying a gun

How can I protect witches in combat who wear limited clothing?

What aspect of planet Earth must be changed to prevent the industrial revolution?

Did the new image of black hole confirm the general theory of relativity?

Do working physicists consider Newtonian mechanics to be "falsified"?

Why not take a picture of a closer black hole?

How to delete random line from file using Unix command?

Semisimplicity of the category of coherent sheaves?

Road tyres vs "Street" tyres for charity ride on MTB Tandem

How to copy the contents of all files with a certain name into a new file?

Can undead you have reanimated wait inside a portable hole?

Can a novice safely splice in wire to lengthen 5V charging cable?

When did F become S in typeography, and why?

Why does the Event Horizon Telescope (EHT) not include telescopes from Africa, Asia or Australia?

Derivation tree not rendering

Why can't devices on different VLANs, but on the same subnet, communicate?

Is above average number of years spent on PhD considered a red flag in future academia or industry positions?

How should I replace vector<uint8_t>::const_iterator in an API?



How to validate clusters after calculating Gower distances and Ward's clustering in R



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsClustering not producing even clustersWhat is the relationship between clustering and association rule mining?Calculate feature weight vector for one-hot-encoded data frame in RMixed geospatial and categorical clusteringHow PCA is different from SubSpace clustering ? how to extract variables responsible for PCA1 component?When is centering and scaling needed before doing hierarchical clustering?Interpret clustering results after variable transformationmultivariate clustering, dimensionality reduction and data scalling for regressionHierarchical Clustering and Variable SelectionHow to calculate a weighted Hierarchical clustering in Orange










0












$begingroup$


I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.









share|improve this question









$endgroup$











  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09















0












$begingroup$


I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.









share|improve this question









$endgroup$











  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09













0












0








0





$begingroup$


I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.









share|improve this question









$endgroup$




I am trying to apply Ward's clustering on a mixed types dataset, and wanna explain what I did (maybe helpful to others), and I have some questions regarding this analysis, mainly how to validate my clusters.



So, let me explain what I did in detail:



I started with a dataset containing 53 variables. These variables are either numerical or binary. The first variable contains the participant number, so will not be used in the clustering. I directly coded the categorical data into binary variables so that I didn't have to transform these variables.



Then I checked for normality of the variables for all numerical variables, using the Shapiro-Wilk test. Significance values below 0.05 will be log-transformed in the next step. Binary variables don't need a normality check.



I calculated distance using the daisy library. Here, I log-transformed the variables with a value lower than 0.05 in the second step with 'logratio', the other variables were appointed as asymmetric binary and symmetric binary. I read that Gower distance is an appropriate metric for mixed data types, and that's why I carried out this step this way.



gower_dist <- daisy(mydata[, -1], metric = "gower", type = list(logratio = c(4,5,6,7,8,9,10,11,12,13,17,24,25,26,44,45,49,50,51,52), asymm = c(14,15,16,18,19,20,21,23,27,28,29,30,31,32,34,35,36,37,38,39,40,41,42,43,46,47,48), symm = c(1,3)))


Then I used Ward's method for clustering, and I plotted a dendogram with the results.



hc1 <- hclust(gower_dist, method = "ward" )
plot(hc1) # display dendogram


This resulted in the following output:



Cluster dendogram



I think that I am in the right direction with my analysis, but have some problems with finding the right number of clusters. I wanted to use pvclust() since it provides p-values for hierarchical clustering, and that is what I am interested in. However, it seems that this package is not usable when using Gower distances. Does anyone know another way to find p-values in R for Ward's clusters using the Gower distance?



Next to this, I have some small doubts on the correctness of my analysis. These are the following:



  • I read that using binary variables can be problematic in https://www.researchgate.net/publication/223532418_Hierarchical_clustering_of_mixed_data_based_on_distance_hierarchy. I assumed that with a high dimensional data set, this would not be a large problem, is that true?

  • According to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4120293/ log transforming data does not help make data less variable or more normal and may, in some circumstances, make data more variable and more skewed. Is it still wise to apply log transformation to the variables? Also, I think that using Ward's method is wise to use for high dimensional datasets, is that true?

  • I wanted to use 'Partial data cluster analysis' to deal with missing values in the data (as explained in https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/), but I am not sure how I dealt with missing data the way I carried out the analysis now. This also doesn't seem to be documented.






clustering






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 31 at 18:04









ItKItK

1




1











  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09
















  • $begingroup$
    Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:32










  • $begingroup$
    Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
    $endgroup$
    – Anony-Mousse
    Mar 31 at 19:36










  • $begingroup$
    Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
    $endgroup$
    – ItK
    Apr 1 at 9:06










  • $begingroup$
    Probably a different algorithm. Depends on your needs.
    $endgroup$
    – Anony-Mousse
    Apr 1 at 15:09















$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32




$begingroup$
Ward's criterion really relies on being used with squared Euclidean distances. Using it with Gower distance supposedly is improper.
$endgroup$
– Anony-Mousse
Mar 31 at 19:32












$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36




$begingroup$
Also I don't think a normality test is sufficient reason to do a log transformation - that does not necessarily make the data "more normal". I'd only use log transforms if this improves normality (and consider further alternatives from the box-cox family and others).
$endgroup$
– Anony-Mousse
Mar 31 at 19:36












$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06




$begingroup$
Thanks for your answer. Would you recommend using Euclidean distances instead, or would you recommend a different clustering algorithm?
$endgroup$
– ItK
Apr 1 at 9:06












$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09




$begingroup$
Probably a different algorithm. Depends on your needs.
$endgroup$
– Anony-Mousse
Apr 1 at 15:09










0






active

oldest

votes












Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48306%2fhow-to-validate-clusters-after-calculating-gower-distances-and-wards-clustering%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48306%2fhow-to-validate-clusters-after-calculating-gower-distances-and-wards-clustering%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High