Perform k-means clustering over multiple columns Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHandling covariance matrices of changing sizeNumerical data and different algorithmsCan I apply Clustering algorithms to the result of Manifold Visualization Methods?Clusering based on categorical variables?How to cluster multiple time-series from one data frameUnsupervised Classification for documentsClustering with multiple distance measuresclustering 2-dimensional euclidean vectors - appropriate dissimilarity measureNLP - How to perform semantic analysis?How to calculate a weighted Hierarchical clustering in Orange

Why did Israel vote against lifting the American embargo on Cuba?

yticklabels on the right side of yaxis

Putting Ant-Man on house arrest

Where to find documentation for `whois` command options?

Israeli soda type drink

What's parked in Mil Moscow helicopter plant?

What is a good proxy for government quality?

What is the evidence that custom checks in Northern Ireland are going to result in violence?

Does Prince Arnaud cause someone holding the Princess to lose?

What is good way to write CSS for multiple borders?

How would it unbalance gameplay to rule that Weapon Master allows for picking a fighting style?

Where can I find how to tex symbols for different fonts?

Why do people think Winterfell crypts is the safest place for women, children and old people?

Array Dynamic resize in heap

Why does the Cisco show run command not show the full version, while the show version command does?

Are `mathfont` and `mathspec` intended for same purpose?

Bright yellow or light yellow?

Can I criticise the more senior developers around me for not writing clean code?

Errors in solving coupled pdes

Writing a T-SQL stored procedure to receive 4 numbers and insert them into a table

What *exactly* is electrical current, voltage, and resistance?

Why isn't everyone flabbergasted about Bran's "gift"?

Could a cockatrice have parasitic embryos?

When speaking, how do you change your mind mid-sentence?

Perform k-means clustering over multiple columns

Unicorn Meta Zoo #1: Why another podcast?

Announcing the arrival of Valued Associate #679: Cesar Manara

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsHandling covariance matrices of changing sizeNumerical data and different algorithmsCan I apply Clustering algorithms to the result of Manifold Visualization Methods?Clusering based on categorical variables?How to cluster multiple time-series from one data frameUnsupervised Classification for documentsClustering with multiple distance measuresclustering 2-dimensional euclidean vectors - appropriate dissimilarity measureNLP - How to perform semantic analysis?How to calculate a weighted Hierarchical clustering in Orange

I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.

I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.

I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.

Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! : enter image description here

asked Apr 5 at 13:20

Lina

$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16

add a comment |

I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.

Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! : enter image description here

asked Apr 5 at 13:20

Lina

$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16

add a comment |

I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.

Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! : enter image description here

asked Apr 5 at 13:20

Lina

I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.

Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! : enter image description here

python clustering pandas

asked Apr 5 at 13:20

Lina

asked Apr 5 at 13:20

Lina

asked Apr 5 at 13:20

Lina

asked Apr 5 at 13:20

Lina

asked Apr 5 at 13:20

Lina

$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16

add a comment |

$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16

Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.

– Anony-Mousse
Apr 5 at 21:16

add a comment |

2 Answers
2

active

oldest

votes

There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:

import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())

Which outputs:

 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0

edited Apr 5 at 14:12

answered Apr 5 at 13:56

Simon Larsson

1,130215

$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06

add a comment |

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.

This is how it looks

>> _data.head(5)

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 MCP.1 Classification 
0 417.114 1 
1 468.786 1 
2 554.697 1 
3 928.220 1 
4 773.920 1

As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.

### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]

Now, perform the actual Clustering, simple as that.

clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)

There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 cluster 
0 0 
1 0 
2 0 
3 0 
4 0

How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.

### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()

Here are the imports I used

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

And this is the visualization

K-Means after PCA

answered Apr 6 at 12:50

Tasos

1,62011138

$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34

$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44

$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56

$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04

$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39

|
show 3 more comments

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48693%2fperform-k-means-clustering-over-multiple-columns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())

Which outputs:

 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0

edited Apr 5 at 14:12

answered Apr 5 at 13:56

Simon Larsson

1,130215

$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06

add a comment |

import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())

Which outputs:

 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0

edited Apr 5 at 14:12

answered Apr 5 at 13:56

Simon Larsson

1,130215

$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06

add a comment |

import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())

Which outputs:

 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0

edited Apr 5 at 14:12

answered Apr 5 at 13:56

Simon Larsson

1,130215

import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())

Which outputs:

 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0

edited Apr 5 at 14:12

answered Apr 5 at 13:56

Simon Larsson

1,130215

edited Apr 5 at 14:12

answered Apr 5 at 13:56

Simon Larsson

1,130215

answered Apr 5 at 13:56

Simon Larsson

1,130215

answered Apr 5 at 13:56

Simon Larsson

1,130215

$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06

add a comment |

$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06

Thanks for your help Simon!

– Lina
Apr 5 at 14:06

add a comment |

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.

This is how it looks

>> _data.head(5)

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 MCP.1 Classification 
0 417.114 1 
1 468.786 1 
2 554.697 1 
3 928.220 1 
4 773.920 1

As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.

### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]

Now, perform the actual Clustering, simple as that.

clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)

There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 cluster 
0 0 
1 0 
2 0 
3 0 
4 0

### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()

Here are the imports I used

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

And this is the visualization

K-Means after PCA

answered Apr 6 at 12:50

Tasos

1,62011138

$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34

$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44

$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56

$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04

$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39

|
show 3 more comments

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.

This is how it looks

>> _data.head(5)

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 MCP.1 Classification 
0 417.114 1 
1 468.786 1 
2 554.697 1 
3 928.220 1 
4 773.920 1

As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.

### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]

Now, perform the actual Clustering, simple as that.

clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)

There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 cluster 
0 0 
1 0 
2 0 
3 0 
4 0

### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()

Here are the imports I used

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

And this is the visualization

K-Means after PCA

answered Apr 6 at 12:50

Tasos

1,62011138

$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34

$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44

$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56

$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04

$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39

|
show 3 more comments

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.

This is how it looks

>> _data.head(5)

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 MCP.1 Classification 
0 417.114 1 
1 468.786 1 
2 554.697 1 
3 928.220 1 
4 773.920 1

As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.

### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]

Now, perform the actual Clustering, simple as that.

clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)

There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 cluster 
0 0 
1 0 
2 0 
3 0 
4 0

### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()

Here are the imports I used

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

And this is the visualization

K-Means after PCA

answered Apr 6 at 12:50

Tasos

1,62011138

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.

This is how it looks

>> _data.head(5)

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 MCP.1 Classification 
0 417.114 1 
1 468.786 1 
2 554.697 1 
3 928.220 1 
4 773.920 1

As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.

### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]

Now, perform the actual Clustering, simple as that.

clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)

There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.

 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585 
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405 
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715 
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600 
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635 

 cluster 
0 0 
1 0 
2 0 
3 0 
4 0

### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()

Here are the imports I used

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

And this is the visualization

K-Means after PCA

answered Apr 6 at 12:50

Tasos

1,62011138

answered Apr 6 at 12:50

Tasos

1,62011138

answered Apr 6 at 12:50

Tasos

1,62011138

answered Apr 6 at 12:50

Tasos

1,62011138

$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34

$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44

$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56

$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04

$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39

|
show 3 more comments

$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34

$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44

$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56

$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04

$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39

Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?

– Lina
Apr 6 at 13:34

The dataset I used has just 9 columns. So, that's all of it.

– Tasos
Apr 6 at 13:44

Pardon me, I meant rows, not columns.

– Lina
Apr 6 at 13:56

I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?

– Tasos
Apr 6 at 14:04

Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?

– Lina
Apr 6 at 14:39

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

cOxxJ,9Ai,pjtXG5CKEPoBOwqiZERVCm9s 2CKK,qc9TrDJfDIp4tWWu cmXQ7J1T x8zJaP2Ookebyz7,tCX,Vy4rLokQ1 WgYY

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2