Perform k-means clustering over multiple columns Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsHandling covariance matrices of changing sizeNumerical data and different algorithmsCan I apply Clustering algorithms to the result of Manifold Visualization Methods?Clusering based on categorical variables?How to cluster multiple time-series from one data frameUnsupervised Classification for documentsClustering with multiple distance measuresclustering 2-dimensional euclidean vectors - appropriate dissimilarity measureNLP - How to perform semantic analysis?How to calculate a weighted Hierarchical clustering in Orange
Why did Israel vote against lifting the American embargo on Cuba?
yticklabels on the right side of yaxis
Putting Ant-Man on house arrest
Where to find documentation for `whois` command options?
Israeli soda type drink
What's parked in Mil Moscow helicopter plant?
What is a good proxy for government quality?
What is the evidence that custom checks in Northern Ireland are going to result in violence?
Does Prince Arnaud cause someone holding the Princess to lose?
What is good way to write CSS for multiple borders?
How would it unbalance gameplay to rule that Weapon Master allows for picking a fighting style?
Where can I find how to tex symbols for different fonts?
Why do people think Winterfell crypts is the safest place for women, children and old people?
Array Dynamic resize in heap
Why does the Cisco show run command not show the full version, while the show version command does?
Are `mathfont` and `mathspec` intended for same purpose?
Bright yellow or light yellow?
Can I criticise the more senior developers around me for not writing clean code?
Errors in solving coupled pdes
Writing a T-SQL stored procedure to receive 4 numbers and insert them into a table
What *exactly* is electrical current, voltage, and resistance?
Why isn't everyone flabbergasted about Bran's "gift"?
Could a cockatrice have parasitic embryos?
When speaking, how do you change your mind mid-sentence?
Perform k-means clustering over multiple columns
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsHandling covariance matrices of changing sizeNumerical data and different algorithmsCan I apply Clustering algorithms to the result of Manifold Visualization Methods?Clusering based on categorical variables?How to cluster multiple time-series from one data frameUnsupervised Classification for documentsClustering with multiple distance measuresclustering 2-dimensional euclidean vectors - appropriate dissimilarity measureNLP - How to perform semantic analysis?How to calculate a weighted Hierarchical clustering in Orange
$begingroup$
I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.
I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.
I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.
Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :
python clustering pandas
$endgroup$
add a comment |
$begingroup$
I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.
I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.
I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.
Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :
python clustering pandas
$endgroup$
$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16
add a comment |
$begingroup$
I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.
I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.
I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.
Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :
python clustering pandas
$endgroup$
I am trying to perform k-means clustering on multiple columns. My data set is composed of 4 numerical columns and 1 categorical column. I already researched previous questions but the answers are not satisfactory.
I know how to perform the algorithm on two columns, but I'm finding it quite difficult to apply the same algorithm on 4 numerical columns.
I am not really interested in visualizing the data for now, but in having the clusters displayed in the table.The picture shows that the first row belongs to cluster number 2, and so on. That is exactly what I need to achieve, but using 4 numerical columns, therefore each row must belong to a certain cluster.
Do you have any idea on how to achieve this? Any idea would be of great help. Thanks in advance! :
python clustering pandas
python clustering pandas
asked Apr 5 at 13:20
LinaLina
6
6
$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16
add a comment |
$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16
$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16
$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=10, centers=3, n_features=4)
df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])
df['Cluster'] = y
print(df.head())
Which outputs:
Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0
$endgroup$
$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06
add a comment |
$begingroup$
Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.
This is how it looks
>> _data.head(5)
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1
As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.
### Get all the features columns except the class
features = list(_data.columns)[:-2]
### Get the features data
data = _data[features]
Now, perform the actual Clustering, simple as that.
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
cluster
0 0
1 0
2 0
3 0
4 0
How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.
### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
Here are the imports I used
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
And this is the visualization
$endgroup$
$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34
$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44
$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56
$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04
$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39
|
show 3 more comments
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48693%2fperform-k-means-clustering-over-multiple-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=10, centers=3, n_features=4)
df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])
df['Cluster'] = y
print(df.head())
Which outputs:
Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0
$endgroup$
$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06
add a comment |
$begingroup$
There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=10, centers=3, n_features=4)
df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])
df['Cluster'] = y
print(df.head())
Which outputs:
Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0
$endgroup$
$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06
add a comment |
$begingroup$
There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=10, centers=3, n_features=4)
df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])
df['Cluster'] = y
print(df.head())
Which outputs:
Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0
$endgroup$
There is no difference in methodology between 2 and 4 columns. If you have issues then they are probably due to the contents of your columns. K-Means wants numerical columns, with no null/infinite values and avoid categorical data. Here I do it with 4 numerical features:
import pandas as pd
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
X, _ = make_blobs(n_samples=10, centers=3, n_features=4)
df = pd.DataFrame(X, columns=['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4'])
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(df[['Feat_1', 'Feat_2', 'Feat_3', 'Feat_4']])
df['Cluster'] = y
print(df.head())
Which outputs:
Feat_1 Feat_2 Feat_3 Feat_4 Cluster
0 0.005875 4.387241 -1.093308 8.213623 2
1 8.763603 -2.769244 4.581705 1.355389 1
2 -0.296613 4.120262 -1.635583 7.533157 2
3 -1.576720 4.957406 2.919704 0.155499 0
4 2.470349 4.098629 2.368335 0.043568 0
edited Apr 5 at 14:12
answered Apr 5 at 13:56
Simon LarssonSimon Larsson
1,130215
1,130215
$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06
add a comment |
$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06
$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06
$begingroup$
Thanks for your help Simon!
$endgroup$
– Lina
Apr 5 at 14:06
add a comment |
$begingroup$
Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.
This is how it looks
>> _data.head(5)
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1
As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.
### Get all the features columns except the class
features = list(_data.columns)[:-2]
### Get the features data
data = _data[features]
Now, perform the actual Clustering, simple as that.
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
cluster
0 0
1 0
2 0
3 0
4 0
How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.
### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
Here are the imports I used
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
And this is the visualization
$endgroup$
$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34
$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44
$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56
$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04
$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39
|
show 3 more comments
$begingroup$
Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.
This is how it looks
>> _data.head(5)
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1
As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.
### Get all the features columns except the class
features = list(_data.columns)[:-2]
### Get the features data
data = _data[features]
Now, perform the actual Clustering, simple as that.
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
cluster
0 0
1 0
2 0
3 0
4 0
How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.
### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
Here are the imports I used
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
And this is the visualization
$endgroup$
$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34
$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44
$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56
$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04
$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39
|
show 3 more comments
$begingroup$
Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.
This is how it looks
>> _data.head(5)
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1
As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.
### Get all the features columns except the class
features = list(_data.columns)[:-2]
### Get the features data
data = _data[features]
Now, perform the actual Clustering, simple as that.
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
cluster
0 0
1 0
2 0
3 0
4 0
How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.
### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
Here are the imports I used
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
And this is the visualization
$endgroup$
Let's take as an example the Breast Cancer Dataset from the UCI Machine Learning.
This is how it looks
>> _data.head(5)
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
MCP.1 Classification
0 417.114 1
1 468.786 1
2 554.697 1
3 928.220 1
4 773.920 1
As you can see, all the columns are numerical. Let's see now, how we can cluster the dataset with K-Means. We don't need the last column which is the Label.
### Get all the features columns except the class
features = list(_data.columns)[:-2]
### Get the features data
data = _data[features]
Now, perform the actual Clustering, simple as that.
clustering_kmeans = KMeans(n_clusters=2, precompute_distances="auto", n_jobs=-1)
data['clusters'] = clustering_kmeans.fit_predict(data)
There is no difference at all with 2 or more features. I just pass the Dataframe with all my numeric columns.
Age BMI Glucose Insulin HOMA Leptin Adiponectin Resistin
0 48 23.500000 70 2.707 0.467409 8.8071 9.702400 7.99585
1 83 20.690495 92 3.115 0.706897 8.8438 5.429285 4.06405
2 82 23.124670 91 4.498 1.009651 17.9393 22.432040 9.27715
3 68 21.367521 77 3.226 0.612725 9.8827 7.169560 12.76600
4 86 21.111111 92 3.549 0.805386 6.6994 4.819240 10.57635
cluster
0 0
1 0
2 0
3 0
4 0
How you can visualize the clustering now? Well, you cannot do it directly if you have more than 3 columns. However, you can apply a Principal Component Analysis to reduce the space in 2 columns and visualize this instead.
### Run PCA on the data and reduce the dimensions in pca_num_components dimensions
reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
results = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
sns.scatterplot(x="pca1", y="pca2", hue=data['clusters'], data=results)
plt.title('K-means Clustering with 2 dimensions')
plt.show()
Here are the imports I used
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
And this is the visualization
answered Apr 6 at 12:50
TasosTasos
1,62011138
1,62011138
$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34
$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44
$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56
$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04
$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39
|
show 3 more comments
$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34
$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44
$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56
$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04
$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39
$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34
$begingroup$
Thank you Tasos for your response, very helpful! As I'm still quite new to this, I was wondering if it's normal for Pandas to display just the first 9 columns when computing the clustering?
$endgroup$
– Lina
Apr 6 at 13:34
$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44
$begingroup$
The dataset I used has just 9 columns. So, that's all of it.
$endgroup$
– Tasos
Apr 6 at 13:44
$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56
$begingroup$
Pardon me, I meant rows, not columns.
$endgroup$
– Lina
Apr 6 at 13:56
$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04
$begingroup$
I am not sure what you mean. When you cluster, the whole dataset is used. Not just the first X rows. Maybe you can elaborate more?
$endgroup$
– Tasos
Apr 6 at 14:04
$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39
$begingroup$
Sure. In my data set I have 4 columns composed of 64 rows each. Once I clustered, I expect a result of 64 rows, instead of just 9. I'm wondering if that is how it works or maybe I need to add a couple of lines of code to display the whole data frame?
$endgroup$
– Lina
Apr 6 at 14:39
|
show 3 more comments
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48693%2fperform-k-means-clustering-over-multiple-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Note that the age attribute is effectively ignored. You get the same result using only income. Because the data is not appropriately prepared for this analysis.
$endgroup$
– Anony-Mousse
Apr 5 at 21:16