Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
Tags
- Python
- 자연어분석
- deeplearning
- chatGPT
- programmablesearchengine
- 판다스 데이터정렬
- 챗gpt
- 파이토치기본
- pytorch
- langchain
- 판다스
- 파이썬
- fastapi
- konlpy
- 비지도학습
- 파이썬웹개발
- 사이킷런
- 머신러닝
- HTML
- 딥러닝
- 파이토치
- sklearn
- MachineLearning
- OpenAIAPI
- python 정렬
- NLP
- fastapi #python웹개발
- fastapi #파이썬웹개발
- pandas
- 랭체인
Archives
- Today
- Total
Data Navigator
[sklearn, KMeans] Kmeans Clustering 을 활용한 데이터 기반 고객 분류 본문
Machine Learning, Deep Learning
[sklearn, KMeans] Kmeans Clustering 을 활용한 데이터 기반 고객 분류
코딩하고분석하는돌스 2021. 2. 1. 16:33
Kmeans Clustering 을 활용한 데이터 기반 고객 분류¶
연령, 소득 수준, 성별에 따른 소비 패턴을 분석하고 분류¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [5]:
data = pd.read_csv('Mall_Customers.csv', index_col= 0)
In [6]:
data.head()
Out[6]:
Gender | Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|---|
CustomerID | ||||
1 | Male | 19 | 15 | 39 |
2 | Male | 21 | 15 | 81 |
3 | Female | 20 | 16 | 6 |
4 | Female | 23 | 16 | 77 |
5 | Female | 31 | 17 | 40 |
In [7]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 200 non-null object
1 Age 200 non-null int64
2 Annual Income (k$) 200 non-null int64
3 Spending Score (1-100) 200 non-null int64
dtypes: int64(3), object(1)
memory usage: 7.8+ KB
In [8]:
data.describe()
Out[8]:
Age | Annual Income (k$) | Spending Score (1-100) | |
---|---|---|---|
count | 200.000000 | 200.000000 | 200.000000 |
mean | 38.850000 | 60.560000 | 50.200000 |
std | 13.969007 | 26.264721 | 25.823522 |
min | 18.000000 | 15.000000 | 1.000000 |
25% | 28.750000 | 41.500000 | 34.750000 |
50% | 36.000000 | 61.500000 | 50.000000 |
75% | 49.000000 | 78.000000 | 73.000000 |
max | 70.000000 | 137.000000 | 99.000000 |
Gender 컬럼을 get_dummies를 이용해서 숫자로 변환¶
In [11]:
data = pd.get_dummies(data, columns=['Gender'], drop_first=True)
In [12]:
data
Out[12]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | |
---|---|---|---|---|
CustomerID | ||||
1 | 19 | 15 | 39 | 1 |
2 | 21 | 15 | 81 | 1 |
3 | 20 | 16 | 6 | 0 |
4 | 23 | 16 | 77 | 0 |
5 | 31 | 17 | 40 | 0 |
... | ... | ... | ... | ... |
196 | 35 | 120 | 79 | 0 |
197 | 45 | 126 | 28 | 0 |
198 | 32 | 126 | 74 | 1 |
199 | 32 | 137 | 18 | 1 |
200 | 30 | 137 | 83 | 1 |
200 rows × 4 columns
분석 툴로 싸이킷런의 KMeans 사용¶
In [13]:
from sklearn.cluster import KMeans
In [14]:
model = KMeans(n_clusters =3)
In [15]:
model.fit(data)
Out[15]:
KMeans(n_clusters=3)
In [16]:
model.labels_
Out[16]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 2, 1, 2, 1, 2, 1,
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1,
2, 1])
In [17]:
result_df=data.copy()
In [18]:
result_df['label'] = model.labels_
In [19]:
result_df
Out[19]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | label | |
---|---|---|---|---|---|
CustomerID | |||||
1 | 19 | 15 | 39 | 1 | 0 |
2 | 21 | 15 | 81 | 1 | 0 |
3 | 20 | 16 | 6 | 0 | 0 |
4 | 23 | 16 | 77 | 0 | 0 |
5 | 31 | 17 | 40 | 0 | 0 |
... | ... | ... | ... | ... | ... |
196 | 35 | 120 | 79 | 0 | 1 |
197 | 45 | 126 | 28 | 0 | 2 |
198 | 32 | 126 | 74 | 1 | 1 |
199 | 32 | 137 | 18 | 1 | 2 |
200 | 30 | 137 | 83 | 1 | 1 |
200 rows × 5 columns
In [20]:
result_df.groupby('label').mean()
Out[20]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | |
---|---|---|---|---|
label | ||||
0 | 40.325203 | 44.154472 | 49.829268 | 0.406504 |
1 | 32.692308 | 86.538462 | 82.128205 | 0.461538 |
2 | 40.394737 | 87.000000 | 18.631579 | 0.526316 |
40대의 나이에서 소득수준이 높은 사람이 낮은 사람보다 지출이 적음을 알수 있다.¶
In [21]:
result_df['label'].value_counts()
Out[21]:
0 123
1 39
2 38
Name: label, dtype: int64
In [23]:
distance = []
for i in range(2,11):
model = KMeans(n_clusters = i)
model.fit(data)
distance.append(model.inertia_)
In [24]:
distance
Out[24]:
[212889.44245524294,
143391.59236035674,
104414.67534220174,
75399.61541401486,
58348.64136331504,
51130.69008126375,
44392.11566567933,
40874.585811688296,
37165.67927209666]
In [25]:
sns.lineplot(x=list(range(2,11)), y=distance)
Out[25]:
<AxesSubplot:>
곡선이 너무 완만해서 최적의 값을 찾기가 어려움¶
In [ ]:
실루엣 스코어(silhouette_score)로 최적의 k값 찾기¶
In [26]:
from sklearn.metrics import silhouette_score
In [27]:
silhouette_score(data, model.labels_)
Out[27]:
0.38468263023102545
Elbow method처럼 k값을 변화시키며 실루엣 스코어를 찾음¶
In [29]:
sil = []
for i in range(2,11):
model=KMeans(n_clusters=i)
model.fit(data)
sil.append(silhouette_score(data,model.labels_))
In [30]:
sil
Out[30]:
[0.29307334005502633,
0.383798873822341,
0.4052954330641215,
0.443430209791173,
0.45205475380756527,
0.44096462877395787,
0.425945425758392,
0.40878413366295513,
0.38201656603451434]
In [31]:
sns.lineplot(x=list(range(2,11)), y=sil)
Out[31]:
<AxesSubplot:>
silhouette_score 측정결과 k값이 6일때가 최적¶
In [32]:
model = KMeans(n_clusters=6)
In [33]:
model.fit(data)
Out[33]:
KMeans(n_clusters=6)
In [34]:
data['label']= model.labels_
In [35]:
data.groupby('label').mean()
Out[35]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | |
---|---|---|---|---|
label | ||||
0 | 32.692308 | 86.538462 | 82.128205 | 0.461538 |
1 | 41.685714 | 88.228571 | 17.285714 | 0.571429 |
2 | 25.272727 | 25.727273 | 79.363636 | 0.409091 |
3 | 56.155556 | 53.377778 | 49.088889 | 0.444444 |
4 | 27.000000 | 56.657895 | 49.131579 | 0.342105 |
5 | 44.142857 | 25.142857 | 19.523810 | 0.380952 |
In [36]:
sns.boxplot(x='label', y='Age', data=data)
Out[36]:
<AxesSubplot:xlabel='label', ylabel='Age'>
In [37]:
sns.boxplot(x='label', y='Annual Income (k$)', data=data)
Out[37]:
<AxesSubplot:xlabel='label', ylabel='Annual Income (k$)'>
In [38]:
sns.boxplot(x='label', y='Spending Score (1-100)', data=data)
Out[38]:
<AxesSubplot:xlabel='label', ylabel='Spending Score (1-100)'>
In [40]:
model.labels_
Out[40]:
array([5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2,
5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 5, 2, 3, 2, 3, 4,
5, 2, 3, 4, 4, 4, 3, 4, 4, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4,
3, 3, 4, 4, 3, 3, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 3, 3, 4, 3, 3, 4,
4, 3, 3, 4, 3, 4, 4, 4, 3, 4, 3, 4, 4, 3, 3, 4, 3, 4, 3, 3, 3, 3,
3, 4, 4, 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 0, 4, 0, 1, 0, 1, 0, 1, 0,
4, 0, 1, 0, 1, 0, 1, 0, 1, 0, 4, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0,
1, 0])
In [41]:
data.drop('label', axis=1, inplace=True)
In [42]:
data
Out[42]:
Age | Annual Income (k$) | Spending Score (1-100) | Gender_Male | |
---|---|---|---|---|
CustomerID | ||||
1 | 19 | 15 | 39 | 1 |
2 | 21 | 15 | 81 | 1 |
3 | 20 | 16 | 6 | 0 |
4 | 23 | 16 | 77 | 0 |
5 | 31 | 17 | 40 | 0 |
... | ... | ... | ... | ... |
196 | 35 | 120 | 79 | 0 |
197 | 45 | 126 | 28 | 0 |
198 | 32 | 126 | 74 | 1 |
199 | 32 | 137 | 18 | 1 |
200 | 30 | 137 | 83 | 1 |
200 rows × 4 columns
In [44]:
from sklearn.decomposition import PCA
In [45]:
pca =PCA(n_components = 2)
In [46]:
pca.fit(data)
Out[46]:
PCA(n_components=2)
In [49]:
pca_df = pca.transform(data)
In [51]:
pca_df = pd.DataFrame(pca_df, columns=['PC1', 'PC2'])
In [55]:
plt.figure(figsize=(10,10))
sns.scatterplot(x=pca_df['PC1'], y=pca_df['PC2'], hue=model.labels_,palette='Set2')
Out[55]:
<AxesSubplot:xlabel='PC1', ylabel='PC2'>
In [ ]:
In [ ]:
In [ ]:
In [ ]: