Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
Tags
- Python
- MachineLearning
- deeplearning
- langchain
- NLP
- 딥러닝
- pytorch
- 챗gpt
- python 정렬
- fastapi #파이썬웹개발
- pandas
- 파이토치기본
- programmablesearchengine
- 비지도학습
- fastapi #python웹개발
- 자연어분석
- 랭체인
- 머신러닝
- HTML
- OpenAIAPI
- 파이썬
- 판다스 데이터정렬
- fastapi
- chatGPT
- 파이토치
- sklearn
- 사이킷런
- konlpy
- 파이썬웹개발
- 판다스
Archives
- Today
- Total
Data Navigator
[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기 본문
Machine Learning, Deep Learning
[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기
코딩하고분석하는돌스 2021. 1. 27. 23:14
KNN 알고리즘을 이용하여 고객이탈 예측하기¶
In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
KNN (K 와 가까운 거리의 그룹으로 계산, K를 어떻게 설정하느냐에 따라서 결과가 달라짐)
통신사 고객 데이터 로딩¶
In [136]:
data = pd.read_csv('churn.csv')
pd.set_option('display.max_columns', 100) 을 사용하여 100개의 컬럼까지 탐색¶
In [244]:
pd.set_option('display.max_columns',100)
In [245]:
data.head(100)
Out[245]:
SeniorCitizen | tenure | MonthlyCharges | TotalCharges | gender_Male | Partner_Yes | Dependents_Yes | PhoneService_Yes | MultipleLines_No phone service | MultipleLines_Yes | InternetService_Fiber optic | InternetService_No | OnlineSecurity_No internet service | OnlineSecurity_Yes | OnlineBackup_No internet service | OnlineBackup_Yes | DeviceProtection_No internet service | DeviceProtection_Yes | TechSupport_No internet service | TechSupport_Yes | StreamingTV_No internet service | StreamingTV_Yes | StreamingMovies_No internet service | StreamingMovies_Yes | Contract_One year | Contract_Two year | PaperlessBilling_Yes | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 29.85 | 29.85 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 0 | 34 | 56.95 | 1889.50 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 2 | 53.85 | 108.15 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
3 | 0 | 45 | 42.30 | 1840.75 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 2 | 70.70 | 151.65 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 0 | 12 | 78.95 | 927.35 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
96 | 0 | 71 | 66.85 | 4748.70 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
97 | 0 | 5 | 21.05 | 113.85 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
98 | 0 | 52 | 21.00 | 1107.20 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
99 | 1 | 25 | 98.50 | 2514.50 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
100 rows × 31 columns
EDA 데이터 탐색¶
7043개의 행과 21개의 컬럼으로 구성 대부분의 데이터가 object 즉 문자열 데이터임
In [138]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 customerID 7043 non-null object
1 gender 7043 non-null object
2 SeniorCitizen 7043 non-null int64
3 Partner 7043 non-null object
4 Dependents 7043 non-null object
5 tenure 7043 non-null int64
6 PhoneService 7043 non-null object
7 MultipleLines 7043 non-null object
8 InternetService 7043 non-null object
9 OnlineSecurity 7043 non-null object
10 OnlineBackup 7043 non-null object
11 DeviceProtection 7043 non-null object
12 TechSupport 7043 non-null object
13 StreamingTV 7043 non-null object
14 StreamingMovies 7043 non-null object
15 Contract 7043 non-null object
16 PaperlessBilling 7043 non-null object
17 PaymentMethod 7043 non-null object
18 MonthlyCharges 7043 non-null float64
19 TotalCharges 7043 non-null object
20 Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
TotalCharges는 숫자형 데이터여야 하므로 pd.to_numeric을 사용해 숫자로 변환¶
In [139]:
pd.to_numeric(data['TotalCharges'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " "
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-139-ff40986ab5f9> in <module>
----> 1 pd.to_numeric(data['TotalCharges'])
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
152 coerce_numeric = errors not in ("ignore", "raise")
153 try:
--> 154 values = lib.maybe_convert_numeric(
155 values, set(), coerce_numeric=coerce_numeric
156 )
pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string " " at position 488
에러가 난 488번째 행 검사¶
In [140]:
data.iloc[488]
Out[140]:
customerID 4472-LVYGI
gender Female
SeniorCitizen 0
Partner Yes
Dependents Yes
tenure 0
PhoneService No
MultipleLines No phone service
InternetService DSL
OnlineSecurity Yes
OnlineBackup No
DeviceProtection Yes
TechSupport Yes
StreamingTV Yes
StreamingMovies No
Contract Two year
PaperlessBilling Yes
PaymentMethod Bank transfer (automatic)
MonthlyCharges 52.55
TotalCharges
Churn No
Name: 488, dtype: object
488 번째 행의 TotalCharges 값이 공백이므로 NaN으로 만들어 준다.¶
In [143]:
data['TotalCharges']= data['TotalCharges'].replace(" ","")
다시 numeric으로 변환¶
In [144]:
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])
In [145]:
data. describe()
Out[145]:
SeniorCitizen | tenure | MonthlyCharges | TotalCharges | |
---|---|---|---|---|
count | 7043.000000 | 7043.000000 | 7043.000000 | 7032.000000 |
mean | 0.162147 | 32.371149 | 64.761692 | 2283.300441 |
std | 0.368612 | 24.559481 | 30.090047 | 2266.771362 |
min | 0.000000 | 0.000000 | 18.250000 | 18.800000 |
25% | 0.000000 | 9.000000 | 35.500000 | 401.450000 |
50% | 0.000000 | 29.000000 | 70.350000 | 1397.475000 |
75% | 0.000000 | 55.000000 | 89.850000 | 3794.737500 |
max | 1.000000 | 72.000000 | 118.750000 | 8684.800000 |
In [146]:
sns.distplot(data['TotalCharges'])
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[146]:
<AxesSubplot:xlabel='TotalCharges', ylabel='Density'>
명목척도의 데이터를 가진 컬럼을 숫자형으로 변형¶
In [ ]:
데이터 프레임에서 gender 컬럼을 male, female에서 0/1로 변환하기¶
In [148]:
data['gender'].nunique()
Out[148]:
2
pd.get_dummies를 사용하여 male 0/1, female 0/1 컬럼을 만들고 둘 중 하나를 드롭¶
In [150]:
pd.get_dummies(data, columns=['gender'], drop_first =True)
Out[150]:
customerID | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | ... | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | gender_Male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | ... | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | 0 |
1 | 5575-GNVDE | 0 | No | No | 34 | Yes | No | DSL | Yes | No | ... | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No | 1 |
2 | 3668-QPYBK | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | ... | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | 1 |
3 | 7795-CFOCW | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | ... | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No | 1 |
4 | 9237-HQITU | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | ... | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7038 | 6840-RESVB | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | ... | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No | 1 |
7039 | 2234-XADUH | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | ... | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No | 0 |
7040 | 4801-JZAZL | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | ... | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No | 0 |
7041 | 8361-LTMKD | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | ... | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes | 1 |
7042 | 3186-AJIEK | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | ... | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No | 1 |
7043 rows × 21 columns
In [ ]:
데이터 타입으로 필터링 해서 0/1 형태로 변환할 컬럼 목록 추출¶
In [151]:
data['gender'].dtype == 'O'
Out[151]:
True
In [152]:
data.columns
Out[152]:
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
dtype='object')
In [153]:
for i in data.columns:
print(data[i].dtype)
object
object
int64
object
object
int64
object
object
object
object
object
object
object
object
object
object
object
object
float64
float64
object
In [154]:
col_list = []
for i in data.columns:
if data[i].dtype =='O':
col_list.append(i)
In [155]:
col_list
Out[155]:
['customerID',
'gender',
'Partner',
'Dependents',
'PhoneService',
'MultipleLines',
'InternetService',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtection',
'TechSupport',
'StreamingTV',
'StreamingMovies',
'Contract',
'PaperlessBilling',
'PaymentMethod',
'Churn']
In [156]:
for i in col_list:
print(i, data[i].nunique())
customerID 7043
gender 2
Partner 2
Dependents 2
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
Churn 2
In [157]:
col_list = col_list[1:]
col_list
Out[157]:
['gender',
'Partner',
'Dependents',
'PhoneService',
'MultipleLines',
'InternetService',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtection',
'TechSupport',
'StreamingTV',
'StreamingMovies',
'Contract',
'PaperlessBilling',
'PaymentMethod',
'Churn']
In [ ]:
col_list를 이용해서 여러개의 컬럼을 0/1형태로 변환¶
In [158]:
data = pd.get_dummies(data, columns=col_list, drop_first =True)
In [159]:
data
Out[159]:
customerID | SeniorCitizen | tenure | MonthlyCharges | TotalCharges | gender_Male | Partner_Yes | Dependents_Yes | PhoneService_Yes | MultipleLines_No phone service | ... | StreamingTV_Yes | StreamingMovies_No internet service | StreamingMovies_Yes | Contract_One year | Contract_Two year | PaperlessBilling_Yes | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | 0 | 1 | 29.85 | 29.85 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 5575-GNVDE | 0 | 34 | 56.95 | 1889.50 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 3668-QPYBK | 0 | 2 | 53.85 | 108.15 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
3 | 7795-CFOCW | 0 | 45 | 42.30 | 1840.75 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 9237-HQITU | 0 | 2 | 70.70 | 151.65 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7038 | 6840-RESVB | 0 | 24 | 84.80 | 1990.50 | 1 | 1 | 1 | 1 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
7039 | 2234-XADUH | 0 | 72 | 103.20 | 7362.90 | 0 | 1 | 1 | 1 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
7040 | 4801-JZAZL | 0 | 11 | 29.60 | 346.45 | 0 | 1 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
7041 | 8361-LTMKD | 1 | 4 | 74.40 | 306.60 | 1 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
7042 | 3186-AJIEK | 0 | 66 | 105.65 | 6844.50 | 1 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
7043 rows × 32 columns
In [ ]:
결측치 확인 및 처리¶
In [160]:
data.isna().sum()
Out[160]:
customerID 0
SeniorCitizen 0
tenure 0
MonthlyCharges 0
TotalCharges 11
gender_Male 0
Partner_Yes 0
Dependents_Yes 0
PhoneService_Yes 0
MultipleLines_No phone service 0
MultipleLines_Yes 0
InternetService_Fiber optic 0
InternetService_No 0
OnlineSecurity_No internet service 0
OnlineSecurity_Yes 0
OnlineBackup_No internet service 0
OnlineBackup_Yes 0
DeviceProtection_No internet service 0
DeviceProtection_Yes 0
TechSupport_No internet service 0
TechSupport_Yes 0
StreamingTV_No internet service 0
StreamingTV_Yes 0
StreamingMovies_No internet service 0
StreamingMovies_Yes 0
Contract_One year 0
Contract_Two year 0
PaperlessBilling_Yes 0
PaymentMethod_Credit card (automatic) 0
PaymentMethod_Electronic check 0
PaymentMethod_Mailed check 0
Churn_Yes 0
dtype: int64
In [161]:
data['TotalCharges'].mean()
Out[161]:
2283.3004408418656
In [162]:
data['TotalCharges'].median()
Out[162]:
1397.475
In [163]:
sns.distplot(data['TotalCharges'])
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[163]:
<AxesSubplot:xlabel='TotalCharges', ylabel='Density'>
In [164]:
data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].median())
In [ ]:
데이터 정규화의 종류와 특징¶
Different Scale?¶
In [ ]:
변수간 값의 차이가 심할 때 스케일을 맞춰주어야 비교가 가능함¶
Standard Scaler = 평균과 표준편차를 이용해서 정규분포로 만들어주는 것¶
Robust Scaler = Quantile 25%, 75% 지점의 값을 이용해서 정규화 해주는 것 아웃라이어의 영향을 덜 받음¶
데이터 분포 특성을 그대로 가져가기 때문에 Min-Max scaling을 사용하는 것이 좋음¶
In [ ]:
사이킷런의 preprocessing에서 MinMaxScaler를 가져와 정규화¶
In [165]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
In [166]:
minmax = MinMaxScaler()
customerID 드롭¶
In [167]:
minmax.fit(data)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-167-4b9c9ec5942c> in <module>
----> 1 minmax.fit(data)
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\preprocessing\_data.py in fit(self, X, y)
361 # Reset internal state before fitting
362 self._reset()
--> 363 return self.partial_fit(X, y)
364
365 def partial_fit(self, X, y=None):
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\preprocessing\_data.py in partial_fit(self, X, y)
394
395 first_pass = not hasattr(self, 'n_samples_seen_')
--> 396 X = self._validate_data(X, reset=first_pass,
397 estimator=self, dtype=FLOAT_DTYPES,
398 force_all_finite="allow-nan")
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
419 out = X
420 elif isinstance(y, str) and y == 'no_validation':
--> 421 X = check_array(X, **check_params)
422 out = X
423 else:
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
614 array = array.astype(dtype, casting="unsafe", copy=False)
615 else:
--> 616 array = np.asarray(array, order=order, dtype=dtype)
617 except ComplexWarning as complex_warning:
618 raise ValueError("Complex data not supported\n"
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\pandas\core\generic.py in __array__(self, dtype)
1894
1895 def __array__(self, dtype=None) -> np.ndarray:
-> 1896 return np.asarray(self._values, dtype=dtype)
1897
1898 def __array_wrap__(
d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
ValueError: could not convert string to float: '7590-VHVEG'
In [168]:
data.drop('customerID', axis =1, inplace=True)
In [169]:
minmax.fit(data)
Out[169]:
MinMaxScaler()
In [170]:
scaled_data = minmax.transform(data)
In [171]:
data.columns
Out[171]:
Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
'MultipleLines_No phone service', 'MultipleLines_Yes',
'InternetService_Fiber optic', 'InternetService_No',
'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
'OnlineBackup_No internet service', 'OnlineBackup_Yes',
'DeviceProtection_No internet service', 'DeviceProtection_Yes',
'TechSupport_No internet service', 'TechSupport_Yes',
'StreamingTV_No internet service', 'StreamingTV_Yes',
'StreamingMovies_No internet service', 'StreamingMovies_Yes',
'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
'PaymentMethod_Credit card (automatic)',
'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check',
'Churn_Yes'],
dtype='object')
In [173]:
scaled_data = pd.DataFrame(scaled_data, columns=data.columns)
In [174]:
scaled_data
Out[174]:
SeniorCitizen | tenure | MonthlyCharges | TotalCharges | gender_Male | Partner_Yes | Dependents_Yes | PhoneService_Yes | MultipleLines_No phone service | MultipleLines_Yes | ... | StreamingTV_Yes | StreamingMovies_No internet service | StreamingMovies_Yes | Contract_One year | Contract_Two year | PaperlessBilling_Yes | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | Churn_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.013889 | 0.115423 | 0.001275 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.472222 | 0.385075 | 0.215867 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 0.027778 | 0.354229 | 0.010310 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
3 | 0.0 | 0.625000 | 0.239303 | 0.210241 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.027778 | 0.521891 | 0.015330 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7038 | 0.0 | 0.333333 | 0.662189 | 0.227521 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
7039 | 0.0 | 1.000000 | 0.845274 | 0.847461 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
7040 | 0.0 | 0.152778 | 0.112935 | 0.037809 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
7041 | 1.0 | 0.055556 | 0.558706 | 0.033210 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
7042 | 0.0 | 0.916667 | 0.869652 | 0.787641 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
7043 rows × 31 columns
In [ ]:
train_test_split 을 사용해서 학습 데이터와 테스트 데이터를 나눔¶
In [178]:
from sklearn.model_selection import train_test_split
In [179]:
X = scaled_data.drop('Churn_Yes', axis=1)
y = scaled_data['Churn_Yes']
In [192]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)
In [ ]:
KNN 모듈(from sklearn.neighbors import KNeighborsClassifier) 불러와 분석¶
In [196]:
from sklearn.neighbors import KNeighborsClassifier
In [200]:
knn = KNeighborsClassifier(n_neighbors=10)
In [201]:
knn.fit(X_train, y_train)
Out[201]:
KNeighborsClassifier(n_neighbors=10)
In [204]:
pred = knn.predict(X_test)
In [207]:
pd.DataFrame({'actual value':y_test, 'pred_value':pred}).head(30)
Out[207]:
actual value | pred_value | |
---|---|---|
4880 | 0.0 | 0.0 |
1541 | 0.0 | 0.0 |
1289 | 0.0 | 0.0 |
5745 | 0.0 | 0.0 |
4873 | 0.0 | 0.0 |
4168 | 0.0 | 0.0 |
1557 | 0.0 | 0.0 |
2892 | 0.0 | 0.0 |
664 | 0.0 | 0.0 |
1588 | 0.0 | 0.0 |
1338 | 1.0 | 0.0 |
6000 | 0.0 | 0.0 |
2310 | 0.0 | 0.0 |
3294 | 1.0 | 1.0 |
290 | 1.0 | 1.0 |
2505 | 0.0 | 0.0 |
3171 | 0.0 | 0.0 |
1366 | 1.0 | 1.0 |
6560 | 0.0 | 0.0 |
2420 | 0.0 | 0.0 |
5210 | 1.0 | 0.0 |
2836 | 0.0 | 1.0 |
1325 | 1.0 | 1.0 |
4900 | 1.0 | 0.0 |
6311 | 0.0 | 0.0 |
1025 | 0.0 | 0.0 |
2031 | 0.0 | 0.0 |
4459 | 1.0 | 1.0 |
5324 | 0.0 | 0.0 |
3441 | 0.0 | 0.0 |
In [ ]:
In [213]:
from sklearn.metrics import accuracy_score, confusion_matrix
from tqdm import tqdm
정확도 75.81%¶
In [210]:
accuracy_score(y_test, pred)
Out[210]:
0.7581637482252721
In [ ]:
FP =194, FN = 317¶
In [211]:
confusion_matrix(y_test, pred)
Out[211]:
array([[1353, 194],
[ 317, 249]], dtype=int64)
In [212]:
error_list = []
In [215]:
for i in tqdm(range(1,101)):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
error_list.append(accuracy_score(y_test, pred))
100%|█████████████████████████████████████████████████████| 100/100 [00:30<00:00, 3.33it/s]
In [217]:
len(error_list)
Out[217]:
100
In [ ]:
lineplot을 이용하여 k값에 따른 정확도 변화를 점검¶
In [228]:
plt.figure(figsize=(20,10))
sns.lineplot(x = range(1, 101), y= error_list, marker='o', markersize = 10, markerfacecolor='red')
Out[228]:
<AxesSubplot:>
In [ ]:
Max() 함수를 이용해서 정확도가 가장 높은 값 탐색¶
In [230]:
max(error_list)
Out[230]:
0.780407004259347
.index에 max(error_list)를 넣어 몇 번째 값인지 탐색: 54번째¶
In [247]:
error_list.index(max(error_list))
Out[247]:
53
In [ ]:
Numpy의 array로 리스트를 변환하고 .argmax() 를 이용하면 쉽게 index를 찾을 수 있음¶
In [233]:
np.array(error_list)
Out[233]:
array([0.71367724, 0.74917179, 0.74396593, 0.75579744, 0.74822527,
0.75816375, 0.74585897, 0.75911027, 0.75295788, 0.75816375,
0.7624231 , 0.76100331, 0.76005679, 0.7671557 , 0.76620918,
0.76810222, 0.76904875, 0.77283483, 0.76810222, 0.77236157,
0.76810222, 0.77141505, 0.76999527, 0.76526266, 0.76762896,
0.77141505, 0.77330809, 0.77283483, 0.77330809, 0.77141505,
0.76999527, 0.76762896, 0.77094179, 0.77046853, 0.77425461,
0.77378135, 0.77330809, 0.77520114, 0.77188831, 0.77378135,
0.77236157, 0.77756744, 0.77662092, 0.77472788, 0.7756744 ,
0.77662092, 0.77756744, 0.77756744, 0.77425461, 0.77094179,
0.77141505, 0.7756744 , 0.77520114, 0.780407 , 0.77425461,
0.77709418, 0.77614766, 0.77520114, 0.77472788, 0.76999527,
0.77046853, 0.76952201, 0.77094179, 0.7671557 , 0.76904875,
0.76810222, 0.77094179, 0.77283483, 0.77094179, 0.76952201,
0.77425461, 0.77141505, 0.77094179, 0.77188831, 0.76904875,
0.77236157, 0.77188831, 0.77425461, 0.77188831, 0.77236157,
0.77330809, 0.76904875, 0.77094179, 0.77141505, 0.76857549,
0.76526266, 0.76952201, 0.76857549, 0.76810222, 0.77094179,
0.76904875, 0.76999527, 0.76857549, 0.77188831, 0.76668244,
0.76904875, 0.76762896, 0.76952201, 0.76810222, 0.76904875])
In [234]:
np.array(error_list).argmax()
Out[234]:
53
In [ ]:
정확도가 가장 높은 54번째 값을 넣고 다시 KNN 분석 시작¶
In [238]:
knn = KNeighborsClassifier(n_neighbors=54)
In [239]:
knn.fit(X_train, y_train)
Out[239]:
KNeighborsClassifier(n_neighbors=54)
In [240]:
pred = knn.predict(X_test)
정확도 78.04%로 약 2.23% 향상¶
In [242]:
accuracy_score(y_test, pred)
Out[242]:
0.780407004259347
In [ ]:
In [243]:
confusion_matrix(y_test, pred)
Out[243]:
array([[1332, 215],
[ 249, 317]], dtype=int64)
KNN과 Logistic Regression의 차이¶
Logistic Regression은 기본적으로 변수들이 선형관계에 있다고 가정, 빠름 다양한 정보를 주기 때문에 전달력이 좋다.
KNN은 어떤 관계가 있다고 가정하지 않음. 느림
In [ ]:
In [ ]:
In [ ]:
In [ ]:
'Machine Learning, Deep Learning' 카테고리의 다른 글
[sklearn, KMeans] Kmeans Clustering 을 활용한 데이터 기반 고객 분류 (0) | 2021.02.01 |
---|---|
[sklearn, statsmodels] Linear Regression 한국 환경 공단 실내 공기질 분석 (0) | 2021.01.28 |
[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier (0) | 2021.01.24 |
[sklearn] Logistic Regression을 활용한 소비자 광고 반응률 예측 (0) | 2021.01.22 |
[sklearn,statsmodels] Linear Regression을 이용한 고객별 연간 지출액 예측 (0) | 2021.01.22 |