How do I identify clusters that match on categorical data? The 2019 Stack Overflow Developer Survey Results Are InHow do I cluster data that is a mix of text & categorical data?How to create clusters of position data?Looking for an algorithm that correctly clusters visually separable clustersRecognize a grammar in a sequence of fuzzy tokensFeeding R agnes object into cutreeSample selection through clusteringHow can I handle missing categorical data that has significance?How to identify clusters after multiple runs?Differences between applying KMeans over PCA and applying PCA over KMeansAlgorithm for purely categorical data
Apparent duplicates between Haynes service instructions and MOT
How to Override Magento 2 vendor files
How come people say “Would of”?
What does Linus Torvalds mean when he says that Git "never ever" tracks a file?
The difference between dialogue marks
Inflated grade on resume at previous job, might former employer tell new employer?
Could Airbus resume production of the A380?
Adding labels to a table: columns and rows
Return to UK after being refused entry years previously
Is it possible to force a package to be called last in the entirety of the LaTeX, when called in the class file?
Worn-tile Scrabble
Operational amplifier basics
Is it idiomatic to use a noun as the apparent subject of a first person plural?
How to answer pointed "are you quitting" questioning when I don't want them to suspect
How are circuits which use complex ICs normally simulated?
Is an up-to-date browser secure on an out-of-date OS?
What is the steepest gradient that a canal can be traversable without locks?
What are the motivations for publishing new editions of an existing textbook, beyond new discoveries in a field?
Which Sci-Fi work first showed weapon of galactic-scale mass destruction?
How can I make payments on the Internet without leaving a money trail?
changing state of an LED using a pushbutton leads to unstable result
Is flight data recorder erased after every flight?
Falsification in Math vs Science
Could JWST stay at L2 "forever"?
How do I identify clusters that match on categorical data?
The 2019 Stack Overflow Developer Survey Results Are InHow do I cluster data that is a mix of text & categorical data?How to create clusters of position data?Looking for an algorithm that correctly clusters visually separable clustersRecognize a grammar in a sequence of fuzzy tokensFeeding R agnes object into cutreeSample selection through clusteringHow can I handle missing categorical data that has significance?How to identify clusters after multiple runs?Differences between applying KMeans over PCA and applying PCA over KMeansAlgorithm for purely categorical data
$begingroup$
I am seeking some directions for a proper path to research the solve for this problem:
My company made all our employees take a "StrengthFinders" test, which results in every employee being assigned their top five (ordered) "strengths" from a possible list of 34 strengths. We have 500 employees. I am supposed to identify all the employees that match each other for the same 5 strengths (order not important), and also for employees that match each other for 4 out of 5 strengths (again, order doesn't matter). I could potentially have multiple groups matching on different sets of strengths, e.g.:
Group 1: Billy, Sally, Michael have strengths A, H, I, K, Z
Group 2: Bobby and Suzy have strengths A, B, L, S, W
For the case where strengths match for 4 out of 5, I might have the same people from Group 1 above, plus Joe, whose strengths are A, H, M, K, Z; and
Seth, whose strengths are A, H, G, K, Z. I would expect more groupings for the case of 4 out of 5 than the 5 out of 5 case.
The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.
I am looking for an algorithmic way to identify clusters and the members of those clusters for this situation. I think I could do this brute force by repeatedly sorting data in Excel, but I'm confident that a better way must exist, and I ask you to point me in that direction. Thank you.
clustering categorical-data
$endgroup$
add a comment |
$begingroup$
I am seeking some directions for a proper path to research the solve for this problem:
My company made all our employees take a "StrengthFinders" test, which results in every employee being assigned their top five (ordered) "strengths" from a possible list of 34 strengths. We have 500 employees. I am supposed to identify all the employees that match each other for the same 5 strengths (order not important), and also for employees that match each other for 4 out of 5 strengths (again, order doesn't matter). I could potentially have multiple groups matching on different sets of strengths, e.g.:
Group 1: Billy, Sally, Michael have strengths A, H, I, K, Z
Group 2: Bobby and Suzy have strengths A, B, L, S, W
For the case where strengths match for 4 out of 5, I might have the same people from Group 1 above, plus Joe, whose strengths are A, H, M, K, Z; and
Seth, whose strengths are A, H, G, K, Z. I would expect more groupings for the case of 4 out of 5 than the 5 out of 5 case.
The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.
I am looking for an algorithmic way to identify clusters and the members of those clusters for this situation. I think I could do this brute force by repeatedly sorting data in Excel, but I'm confident that a better way must exist, and I ask you to point me in that direction. Thank you.
clustering categorical-data
$endgroup$
add a comment |
$begingroup$
I am seeking some directions for a proper path to research the solve for this problem:
My company made all our employees take a "StrengthFinders" test, which results in every employee being assigned their top five (ordered) "strengths" from a possible list of 34 strengths. We have 500 employees. I am supposed to identify all the employees that match each other for the same 5 strengths (order not important), and also for employees that match each other for 4 out of 5 strengths (again, order doesn't matter). I could potentially have multiple groups matching on different sets of strengths, e.g.:
Group 1: Billy, Sally, Michael have strengths A, H, I, K, Z
Group 2: Bobby and Suzy have strengths A, B, L, S, W
For the case where strengths match for 4 out of 5, I might have the same people from Group 1 above, plus Joe, whose strengths are A, H, M, K, Z; and
Seth, whose strengths are A, H, G, K, Z. I would expect more groupings for the case of 4 out of 5 than the 5 out of 5 case.
The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.
I am looking for an algorithmic way to identify clusters and the members of those clusters for this situation. I think I could do this brute force by repeatedly sorting data in Excel, but I'm confident that a better way must exist, and I ask you to point me in that direction. Thank you.
clustering categorical-data
$endgroup$
I am seeking some directions for a proper path to research the solve for this problem:
My company made all our employees take a "StrengthFinders" test, which results in every employee being assigned their top five (ordered) "strengths" from a possible list of 34 strengths. We have 500 employees. I am supposed to identify all the employees that match each other for the same 5 strengths (order not important), and also for employees that match each other for 4 out of 5 strengths (again, order doesn't matter). I could potentially have multiple groups matching on different sets of strengths, e.g.:
Group 1: Billy, Sally, Michael have strengths A, H, I, K, Z
Group 2: Bobby and Suzy have strengths A, B, L, S, W
For the case where strengths match for 4 out of 5, I might have the same people from Group 1 above, plus Joe, whose strengths are A, H, M, K, Z; and
Seth, whose strengths are A, H, G, K, Z. I would expect more groupings for the case of 4 out of 5 than the 5 out of 5 case.
The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.
I am looking for an algorithmic way to identify clusters and the members of those clusters for this situation. I think I could do this brute force by repeatedly sorting data in Excel, but I'm confident that a better way must exist, and I ask you to point me in that direction. Thank you.
clustering categorical-data
clustering categorical-data
asked Mar 28 at 16:10
wackojacko1997wackojacko1997
83
83
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
$begingroup$
You have just 500 data points...
Excel of course is the worst possible tool though.
Anyway, build a dictionary. Put everybody in there 6 times: 1 with all five strengths, and 5 times with one strength omitted. Then you can easily identify the largest groups, and you can also perform various completion operations easily: if you have identified a group with strengths A B C D E, you can add all that have ABCD etc. using the dictionary.
$endgroup$
$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25
$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20
$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16
$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18
add a comment |
$begingroup$
Assign each of the 34 traits a unique prime number.
Compute the product of the 5 prime numbers of each person.
Compare every person's value to find a match.
To find 4 matching traits out of 5, make the product from 4 of the 5 traits. You'll find 5 unique combinations. 1*2*3*4 , 1*2*3*5, 1*2*4*5, 2*3*4*5, and 1*3*4*5. Compare the values again to find the 4th degree matches.
$endgroup$
$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21
add a comment |
$begingroup$
You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:
Implementations:
- K-Modes
- ROCK
$endgroup$
add a comment |
$begingroup$
If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.
Here is an example in R
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48157%2fhow-do-i-identify-clusters-that-match-on-categorical-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
You have just 500 data points...
Excel of course is the worst possible tool though.
Anyway, build a dictionary. Put everybody in there 6 times: 1 with all five strengths, and 5 times with one strength omitted. Then you can easily identify the largest groups, and you can also perform various completion operations easily: if you have identified a group with strengths A B C D E, you can add all that have ABCD etc. using the dictionary.
$endgroup$
$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25
$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20
$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16
$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18
add a comment |
$begingroup$
You have just 500 data points...
Excel of course is the worst possible tool though.
Anyway, build a dictionary. Put everybody in there 6 times: 1 with all five strengths, and 5 times with one strength omitted. Then you can easily identify the largest groups, and you can also perform various completion operations easily: if you have identified a group with strengths A B C D E, you can add all that have ABCD etc. using the dictionary.
$endgroup$
$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25
$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20
$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16
$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18
add a comment |
$begingroup$
You have just 500 data points...
Excel of course is the worst possible tool though.
Anyway, build a dictionary. Put everybody in there 6 times: 1 with all five strengths, and 5 times with one strength omitted. Then you can easily identify the largest groups, and you can also perform various completion operations easily: if you have identified a group with strengths A B C D E, you can add all that have ABCD etc. using the dictionary.
$endgroup$
You have just 500 data points...
Excel of course is the worst possible tool though.
Anyway, build a dictionary. Put everybody in there 6 times: 1 with all five strengths, and 5 times with one strength omitted. Then you can easily identify the largest groups, and you can also perform various completion operations easily: if you have identified a group with strengths A B C D E, you can add all that have ABCD etc. using the dictionary.
answered Mar 28 at 19:30
Anony-MousseAnony-Mousse
5,165625
5,165625
$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25
$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20
$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16
$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18
add a comment |
$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25
$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20
$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16
$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18
$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25
$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25
$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20
$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20
$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16
$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16
$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18
$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18
add a comment |
$begingroup$
Assign each of the 34 traits a unique prime number.
Compute the product of the 5 prime numbers of each person.
Compare every person's value to find a match.
To find 4 matching traits out of 5, make the product from 4 of the 5 traits. You'll find 5 unique combinations. 1*2*3*4 , 1*2*3*5, 1*2*4*5, 2*3*4*5, and 1*3*4*5. Compare the values again to find the 4th degree matches.
$endgroup$
$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21
add a comment |
$begingroup$
Assign each of the 34 traits a unique prime number.
Compute the product of the 5 prime numbers of each person.
Compare every person's value to find a match.
To find 4 matching traits out of 5, make the product from 4 of the 5 traits. You'll find 5 unique combinations. 1*2*3*4 , 1*2*3*5, 1*2*4*5, 2*3*4*5, and 1*3*4*5. Compare the values again to find the 4th degree matches.
$endgroup$
$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21
add a comment |
$begingroup$
Assign each of the 34 traits a unique prime number.
Compute the product of the 5 prime numbers of each person.
Compare every person's value to find a match.
To find 4 matching traits out of 5, make the product from 4 of the 5 traits. You'll find 5 unique combinations. 1*2*3*4 , 1*2*3*5, 1*2*4*5, 2*3*4*5, and 1*3*4*5. Compare the values again to find the 4th degree matches.
$endgroup$
Assign each of the 34 traits a unique prime number.
Compute the product of the 5 prime numbers of each person.
Compare every person's value to find a match.
To find 4 matching traits out of 5, make the product from 4 of the 5 traits. You'll find 5 unique combinations. 1*2*3*4 , 1*2*3*5, 1*2*4*5, 2*3*4*5, and 1*3*4*5. Compare the values again to find the 4th degree matches.
edited Apr 2 at 1:18
Stephen Rauch♦
1,52551330
1,52551330
answered Apr 1 at 18:55
QuantifiedMeQuantifiedMe
111
111
$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21
add a comment |
$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21
$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21
$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21
add a comment |
$begingroup$
You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:
Implementations:
- K-Modes
- ROCK
$endgroup$
add a comment |
$begingroup$
You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:
Implementations:
- K-Modes
- ROCK
$endgroup$
add a comment |
$begingroup$
You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:
Implementations:
- K-Modes
- ROCK
$endgroup$
You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:
Implementations:
- K-Modes
- ROCK
answered Mar 28 at 17:24
Simon LarssonSimon Larsson
734114
734114
add a comment |
add a comment |
$begingroup$
If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.
Here is an example in R
$endgroup$
add a comment |
$begingroup$
If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.
Here is an example in R
$endgroup$
add a comment |
$begingroup$
If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.
Here is an example in R
$endgroup$
If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.
Here is an example in R
answered Mar 28 at 20:45
Rajat S. SubediRajat S. Subedi
1
1
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48157%2fhow-do-i-identify-clusters-that-match-on-categorical-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown