How do I identify clusters that match on categorical data? The 2019 Stack Overflow Developer Survey Results Are InHow do I cluster data that is a mix of text & categorical data?How to create clusters of position data?Looking for an algorithm that correctly clusters visually separable clustersRecognize a grammar in a sequence of fuzzy tokensFeeding R agnes object into cutreeSample selection through clusteringHow can I handle missing categorical data that has significance?How to identify clusters after multiple runs?Differences between applying KMeans over PCA and applying PCA over KMeansAlgorithm for purely categorical data

Apparent duplicates between Haynes service instructions and MOT

How to Override Magento 2 vendor files

How come people say “Would of”?

What does Linus Torvalds mean when he says that Git "never ever" tracks a file?

The difference between dialogue marks

Inflated grade on resume at previous job, might former employer tell new employer?

Could Airbus resume production of the A380?

Adding labels to a table: columns and rows

Return to UK after being refused entry years previously

Is it possible to force a package to be called last in the entirety of the LaTeX, when called in the class file?

Worn-tile Scrabble

Operational amplifier basics

Is it idiomatic to use a noun as the apparent subject of a first person plural?

How to answer pointed "are you quitting" questioning when I don't want them to suspect

How are circuits which use complex ICs normally simulated?

Is an up-to-date browser secure on an out-of-date OS?

What is the steepest gradient that a canal can be traversable without locks?

What are the motivations for publishing new editions of an existing textbook, beyond new discoveries in a field?

Which Sci-Fi work first showed weapon of galactic-scale mass destruction?

How can I make payments on the Internet without leaving a money trail?

changing state of an LED using a pushbutton leads to unstable result

Is flight data recorder erased after every flight?

Falsification in Math vs Science

Could JWST stay at L2 "forever"?

How do I identify clusters that match on categorical data?

The 2019 Stack Overflow Developer Survey Results Are InHow do I cluster data that is a mix of text & categorical data?How to create clusters of position data?Looking for an algorithm that correctly clusters visually separable clustersRecognize a grammar in a sequence of fuzzy tokensFeeding R agnes object into cutreeSample selection through clusteringHow can I handle missing categorical data that has significance?How to identify clusters after multiple runs?Differences between applying KMeans over PCA and applying PCA over KMeansAlgorithm for purely categorical data

I am seeking some directions for a proper path to research the solve for this problem:

My company made all our employees take a "StrengthFinders" test, which results in every employee being assigned their top five (ordered) "strengths" from a possible list of 34 strengths. We have 500 employees. I am supposed to identify all the employees that match each other for the same 5 strengths (order not important), and also for employees that match each other for 4 out of 5 strengths (again, order doesn't matter). I could potentially have multiple groups matching on different sets of strengths, e.g.:
Group 1: Billy, Sally, Michael have strengths A, H, I, K, Z
Group 2: Bobby and Suzy have strengths A, B, L, S, W

For the case where strengths match for 4 out of 5, I might have the same people from Group 1 above, plus Joe, whose strengths are A, H, M, K, Z; and
Seth, whose strengths are A, H, G, K, Z. I would expect more groupings for the case of 4 out of 5 than the 5 out of 5 case.

The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.

I am looking for an algorithmic way to identify clusters and the members of those clusters for this situation. I think I could do this brute force by repeatedly sorting data in Excel, but I'm confident that a better way must exist, and I ask you to point me in that direction. Thank you.

asked Mar 28 at 16:10

wackojacko1997

add a comment |

I am seeking some directions for a proper path to research the solve for this problem:

The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.

asked Mar 28 at 16:10

wackojacko1997

add a comment |

I am seeking some directions for a proper path to research the solve for this problem:

The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.

asked Mar 28 at 16:10

wackojacko1997

I am seeking some directions for a proper path to research the solve for this problem:

The strengths are categorical in nature, so what I've read so far has largely revolved around clustering of continuous numerical variables.

clustering categorical-data

asked Mar 28 at 16:10

wackojacko1997

asked Mar 28 at 16:10

wackojacko1997

asked Mar 28 at 16:10

wackojacko1997

asked Mar 28 at 16:10

wackojacko1997

asked Mar 28 at 16:10

wackojacko1997

add a comment |

4 Answers
4

active

oldest

votes

You have just 500 data points...

Excel of course is the worst possible tool though.

Anyway, build a dictionary. Put everybody in there 6 times: 1 with all five strengths, and 5 times with one strength omitted. Then you can easily identify the largest groups, and you can also perform various completion operations easily: if you have identified a group with strengths A B C D E, you can add all that have ABCD etc. using the dictionary.

answered Mar 28 at 19:30

Anony-Mousse

5,165625

$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25

$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20

$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16

$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18

add a comment |

Assign each of the 34 traits a unique prime number.

Compute the product of the 5 prime numbers of each person.

Compare every person's value to find a match.

To find 4 matching traits out of 5, make the product from 4 of the 5 traits. You'll find 5 unique combinations. 1*2*3*4 , 1*2*3*5, 1*2*4*5, 2*3*4*5, and 1*3*4*5. Compare the values again to find the 4th degree matches.

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

answered Apr 1 at 18:55

QuantifiedMe

111

$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21

add a comment |

You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:

Implementations:

K-Modes

ROCK

answered Mar 28 at 17:24

Simon Larsson

734114

add a comment |

If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.

Here is an example in R

answered Mar 28 at 20:45

Rajat S. Subedi

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48157%2fhow-do-i-identify-clusters-that-match-on-categorical-data%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

You have just 500 data points...

Excel of course is the worst possible tool though.

answered Mar 28 at 19:30

Anony-Mousse

5,165625

$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25

$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20

$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16

$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18

add a comment |

You have just 500 data points...

Excel of course is the worst possible tool though.

answered Mar 28 at 19:30

Anony-Mousse

5,165625

$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25

$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20

$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16

$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18

add a comment |

You have just 500 data points...

Excel of course is the worst possible tool though.

answered Mar 28 at 19:30

Anony-Mousse

5,165625

You have just 500 data points...

Excel of course is the worst possible tool though.

answered Mar 28 at 19:30

Anony-Mousse

5,165625

answered Mar 28 at 19:30

Anony-Mousse

5,165625

answered Mar 28 at 19:30

Anony-Mousse

5,165625

answered Mar 28 at 19:30

Anony-Mousse

5,165625

$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25

$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20

$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16

$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18

add a comment |

$begingroup$
@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.
$endgroup$
– Esmailian
Mar 30 at 22:25

$begingroup$
While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.
$endgroup$
– wackojacko1997
Apr 3 at 20:20

$begingroup$
Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.
$endgroup$
– Anony-Mousse
Apr 3 at 21:16

$begingroup$
Okay, thank you. I am accepting this answer as the general case then.
$endgroup$
– wackojacko1997
Apr 4 at 1:18

@wackojacko1997 I think this is the solution. Noting that if each key is a string, strengths need to be sorted alphabetically to place ABCD and CABD in the same group.

– Esmailian
Mar 30 at 22:25

While I need to think about the coding with the Dictionary a little bit, this answer does makes sense to me. When I look at @QuantifiedMe's answer (which I perceive as essentially the same thing, but using prime numbers), I think I can use that even without coding (directly in Excel). I'm inclined to mark this the answer, though, as the more general approach.

– wackojacko1997
Apr 3 at 20:20

Yes, he is suggesting the same thing, using a prime factor coding instead of a string coding.

– Anony-Mousse
Apr 3 at 21:16

Okay, thank you. I am accepting this answer as the general case then.

– wackojacko1997
Apr 4 at 1:18

add a comment |

Assign each of the 34 traits a unique prime number.

Compute the product of the 5 prime numbers of each person.

Compare every person's value to find a match.

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

answered Apr 1 at 18:55

QuantifiedMe

111

$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21

add a comment |

Assign each of the 34 traits a unique prime number.

Compute the product of the 5 prime numbers of each person.

Compare every person's value to find a match.

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

answered Apr 1 at 18:55

QuantifiedMe

111

$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21

add a comment |

Assign each of the 34 traits a unique prime number.

Compute the product of the 5 prime numbers of each person.

Compare every person's value to find a match.

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

answered Apr 1 at 18:55

QuantifiedMe

111

Assign each of the 34 traits a unique prime number.

Compute the product of the 5 prime numbers of each person.

Compare every person's value to find a match.

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

answered Apr 1 at 18:55

QuantifiedMe

111

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

edited Apr 2 at 1:18

Stephen Rauch♦

1,52551330

answered Apr 1 at 18:55

QuantifiedMe

111

answered Apr 1 at 18:55

QuantifiedMe

111

answered Apr 1 at 18:55

QuantifiedMe

111

$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21

add a comment |

$begingroup$
I like this approach for the simplicity and the ease of employing it.
$endgroup$
– wackojacko1997
Apr 3 at 20:21

I like this approach for the simplicity and the ease of employing it.

– wackojacko1997
Apr 3 at 20:21

add a comment |

You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:

Implementations:

K-Modes

ROCK

answered Mar 28 at 17:24

Simon Larsson

734114

add a comment |

You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:

Implementations:

K-Modes

ROCK

answered Mar 28 at 17:24

Simon Larsson

734114

add a comment |

You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:

Implementations:

K-Modes

ROCK

answered Mar 28 at 17:24

Simon Larsson

734114

You can try k-modes or ROCK which are specifically made to work with categorical values. I don't have experience with them myself but you can look at:

Implementations:

K-Modes

ROCK

answered Mar 28 at 17:24

Simon Larsson

734114

answered Mar 28 at 17:24

Simon Larsson

734114

answered Mar 28 at 17:24

Simon Larsson

734114

answered Mar 28 at 17:24

Simon Larsson

734114

add a comment |

If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.

Here is an example in R

answered Mar 28 at 20:45

Rajat S. Subedi

add a comment |

If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.

Here is an example in R

answered Mar 28 at 20:45

Rajat S. Subedi

add a comment |

If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.

Here is an example in R

answered Mar 28 at 20:45

Rajat S. Subedi

If I were you, I would approach this as an Association Mining problem. You most likely will have to pre-process your data for this type of analysis, but it shouldn't be too difficult.

Here is an example in R

answered Mar 28 at 20:45

Rajat S. Subedi

answered Mar 28 at 20:45

Rajat S. Subedi

answered Mar 28 at 20:45

Rajat S. Subedi

answered Mar 28 at 20:45

Rajat S. Subedi

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

4 Answers
4

Your Answer

Post as a guest

4 Answers
4

4 Answers
4

Post as a guest

Popular posts from this blog

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

4 Answers 4

4 Answers 4

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

4 Answers
4

4 Answers
4

4 Answers
4