Perform k-means clustering over multiple columns Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHandling covariance matrices of changing sizeNumerical data and different algorithmsCan I apply Clustering algorithms to the result of Manifold Visualization Methods?Clusering based on categorical variables?How to cluster multiple time-series from one data frameUnsupervised Classification for documentsClustering with multiple distance measuresclustering 2-dimensional euclidean vectors - appropriate dissimilarity measureNLP - How to perform semantic analysis?How to calculate a weighted Hierarchical clustering in Orange

Why did Israel vote against lifting the American embargo on Cuba?

yticklabels on the right side of yaxis

Putting Ant-Man on house arrest

Where to find documentation for `whois` command options?

Israeli soda type drink

What's parked in Mil Moscow helicopter plant?

What is a good proxy for government quality?

What is the evidence that custom checks in Northern Ireland are going to result in violence?

Does Prince Arnaud cause someone holding the Princess to lose?

What is good way to write CSS for multiple borders?

How would it unbalance gameplay to rule that Weapon Master allows for picking a fighting style?

Where can I find how to tex symbols for different fonts?

Why do people think Winterfell crypts is the safest place for women, children and old people?

Array Dynamic resize in heap

Why does the Cisco show run command not show the full version, while the show version command does?

Are `mathfont` and `mathspec` intended for same purpose?

Bright yellow or light yellow?

Can I criticise the more senior developers around me for not writing clean code?

Errors in solving coupled pdes

Writing a T-SQL stored procedure to receive 4 numbers and insert them into a table

What *exactly* is electrical current, voltage, and resistance?

Why isn't everyone flabbergasted about Bran's "gift"?

Could a cockatrice have parasitic embryos?

When speaking, how do you change your mind mid-sentence?



Perform k-means clustering over multiple columns



Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHandling covariance matrices of changing sizeNumerical data and different algorithmsCan I apply Clustering algorithms to the result of Manifold Visualization Methods?Clusering based on categorical variables?How to cluster multiple time-series from one data frameUnsupervised Classification for documentsClustering with multiple distance measuresclustering 2-dimensional euclidean vectors - appropriate dissimilarity measureNLP - How to perform semantic analysis?How to calculate a weighted Hierarchical clustering in Orange










0












$begingroup$


I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.



I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.



I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.



Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :enter image description here










share|improve this question









$endgroup$











  • $begingroup$
    Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
    $endgroup$
    – Anony-Mousse
    Apr 5 at 21:16















0












$begingroup$


I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.



I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.



I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.



Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :enter image description here










share|improve this question









$endgroup$











  • $begingroup$
    Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
    $endgroup$
    – Anony-Mousse
    Apr 5 at 21:16













0












0








0





$begingroup$


I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.



I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.



I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.



Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :enter image description here










share|improve this question









$endgroup$




I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.



I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.



I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.



Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :enter image description here







python clustering pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Apr 5 at 13:20









LinaLina

6




6











  • $begingroup$
    Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
    $endgroup$
    – Anony-Mousse
    Apr 5 at 21:16
















  • $begingroup$
    Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
    $endgroup$
    – Anony-Mousse
    Apr 5 at 21:16















$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16




$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16










2 Answers
2






active

oldest

votes


















2












$begingroup$

There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:



import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())


Which outputs:



 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0





share|improve this answer











$endgroup$












  • $begingroup$
    Thanks for your help Simon!
    $endgroup$
    – Lina
    Apr 5 at 14:06


















0












$begingroup$

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.



This is how it looks



>> _data.head(5)

Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1


As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.



### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]


Now, perform the actual Clustering, simple as that.



clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)


There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.



 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

cluster
0 0
1 0
2 0
3 0
4 0


How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.



### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()


Here are the imports I used



import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


And this is the visualization



K-Means after PCA






share|improve this answer









$endgroup$












  • $begingroup$
    Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
    $endgroup$
    – Lina
    Apr 6 at 13:34










  • $begingroup$
    The dataset I used has just 9 columns. So, that's all of it.
    $endgroup$
    – Tasos
    Apr 6 at 13:44










  • $begingroup$
    Pardon me, I meant rows, not columns.
    $endgroup$
    – Lina
    Apr 6 at 13:56










  • $begingroup$
    I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
    $endgroup$
    – Tasos
    Apr 6 at 14:04










  • $begingroup$
    Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
    $endgroup$
    – Lina
    Apr 6 at 14:39











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48693%2fperform-k-means-clustering-over-multiple-columns%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









2












$begingroup$

There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:



import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())


Which outputs:



 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0





share|improve this answer











$endgroup$












  • $begingroup$
    Thanks for your help Simon!
    $endgroup$
    – Lina
    Apr 5 at 14:06















2












$begingroup$

There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:



import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())


Which outputs:



 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0





share|improve this answer











$endgroup$












  • $begingroup$
    Thanks for your help Simon!
    $endgroup$
    – Lina
    Apr 5 at 14:06













2












2








2





$begingroup$

There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:



import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())


Which outputs:



 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0





share|improve this answer











$endgroup$



There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:



import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=10, centers=3, n_features=4)

df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])

kmeans = KMeans(n_clusters=3)

y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])

df['Cluster'] = y

print(df.head())


Which outputs:



 Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0






share|improve this answer














share|improve this answer



share|improve this answer








edited Apr 5 at 14:12

























answered Apr 5 at 13:56









Simon LarssonSimon Larsson

1,130215




1,130215











  • $begingroup$
    Thanks for your help Simon!
    $endgroup$
    – Lina
    Apr 5 at 14:06
















  • $begingroup$
    Thanks for your help Simon!
    $endgroup$
    – Lina
    Apr 5 at 14:06















$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06




$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06











0












$begingroup$

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.



This is how it looks



>> _data.head(5)

Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1


As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.



### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]


Now, perform the actual Clustering, simple as that.



clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)


There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.



 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

cluster
0 0
1 0
2 0
3 0
4 0


How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.



### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()


Here are the imports I used



import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


And this is the visualization



K-Means after PCA






share|improve this answer









$endgroup$












  • $begingroup$
    Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
    $endgroup$
    – Lina
    Apr 6 at 13:34










  • $begingroup$
    The dataset I used has just 9 columns. So, that's all of it.
    $endgroup$
    – Tasos
    Apr 6 at 13:44










  • $begingroup$
    Pardon me, I meant rows, not columns.
    $endgroup$
    – Lina
    Apr 6 at 13:56










  • $begingroup$
    I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
    $endgroup$
    – Tasos
    Apr 6 at 14:04










  • $begingroup$
    Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
    $endgroup$
    – Lina
    Apr 6 at 14:39















0












$begingroup$

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.



This is how it looks



>> _data.head(5)

Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1


As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.



### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]


Now, perform the actual Clustering, simple as that.



clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)


There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.



 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

cluster
0 0
1 0
2 0
3 0
4 0


How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.



### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()


Here are the imports I used



import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


And this is the visualization



K-Means after PCA






share|improve this answer









$endgroup$












  • $begingroup$
    Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
    $endgroup$
    – Lina
    Apr 6 at 13:34










  • $begingroup$
    The dataset I used has just 9 columns. So, that's all of it.
    $endgroup$
    – Tasos
    Apr 6 at 13:44










  • $begingroup$
    Pardon me, I meant rows, not columns.
    $endgroup$
    – Lina
    Apr 6 at 13:56










  • $begingroup$
    I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
    $endgroup$
    – Tasos
    Apr 6 at 14:04










  • $begingroup$
    Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
    $endgroup$
    – Lina
    Apr 6 at 14:39













0












0








0





$begingroup$

Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.



This is how it looks



>> _data.head(5)

Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1


As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.



### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]


Now, perform the actual Clustering, simple as that.



clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)


There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.



 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

cluster
0 0
1 0
2 0
3 0
4 0


How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.



### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()


Here are the imports I used



import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


And this is the visualization



K-Means after PCA






share|improve this answer









$endgroup$



Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.



This is how it looks



>> _data.head(5)

Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1


As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.



### Get all the features columns except the class
features = list(_data.columns)[:-2]

### Get the features data
data = _data[features]


Now, perform the actual Clustering, simple as that.



clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)


There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.



 Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin 
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635

cluster
0 0
1 0
2 0
3 0
4 0


How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.



### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])

sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()


Here are the imports I used



import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


And this is the visualization



K-Means after PCA







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 6 at 12:50









TasosTasos

1,62011138




1,62011138











  • $begingroup$
    Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
    $endgroup$
    – Lina
    Apr 6 at 13:34










  • $begingroup$
    The dataset I used has just 9 columns. So, that's all of it.
    $endgroup$
    – Tasos
    Apr 6 at 13:44










  • $begingroup$
    Pardon me, I meant rows, not columns.
    $endgroup$
    – Lina
    Apr 6 at 13:56










  • $begingroup$
    I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
    $endgroup$
    – Tasos
    Apr 6 at 14:04










  • $begingroup$
    Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
    $endgroup$
    – Lina
    Apr 6 at 14:39
















  • $begingroup$
    Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
    $endgroup$
    – Lina
    Apr 6 at 13:34










  • $begingroup$
    The dataset I used has just 9 columns. So, that's all of it.
    $endgroup$
    – Tasos
    Apr 6 at 13:44










  • $begingroup$
    Pardon me, I meant rows, not columns.
    $endgroup$
    – Lina
    Apr 6 at 13:56










  • $begingroup$
    I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
    $endgroup$
    – Tasos
    Apr 6 at 14:04










  • $begingroup$
    Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
    $endgroup$
    – Lina
    Apr 6 at 14:39















$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34




$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34












$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44




$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44












$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56




$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56












$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04




$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04












$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39




$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39

















draft saved

draft discarded
















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48693%2fperform-k-means-clustering-over-multiple-columns%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High