Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

Data Navigator

[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기 본문

Machine Learning, Deep Learning

[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기

코딩하고분석하는돌스 2021. 1. 27. 23:14

KNN 알고리즘을 이용하여 고객이탈 예측하기¶

In [135]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

KNN (K 와 가까운 거리의 그룹으로 계산, K를 어떻게 설정하느냐에 따라서 결과가 달라짐)

통신사 고객 데이터 로딩¶

In [136]:

data = pd.read_csv('churn.csv')

pd.set_option('display.max_columns', 100) 을 사용하여 100개의 컬럼까지 탐색¶

In [244]:

pd.set_option('display.max_columns',100)

In [245]:

data.head(100)

Out[245]:

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges	gender_Male	Partner_Yes	Dependents_Yes	PhoneService_Yes	MultipleLines_No phone service	MultipleLines_Yes	InternetService_Fiber optic	InternetService_No	OnlineSecurity_No internet service	OnlineSecurity_Yes	OnlineBackup_No internet service	OnlineBackup_Yes	DeviceProtection_No internet service	DeviceProtection_Yes	TechSupport_No internet service	TechSupport_Yes	StreamingTV_No internet service	StreamingTV_Yes	StreamingMovies_No internet service	StreamingMovies_Yes	Contract_One year	Contract_Two year	PaperlessBilling_Yes	PaymentMethod_Credit card (automatic)	PaymentMethod_Electronic check	PaymentMethod_Mailed check	Churn_Yes
0	0	1	29.85	29.85	0	1	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	1	0	0
1	0	34	56.95	1889.50	1	0	0	1	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0
2	0	2	53.85	108.15	1	0	0	1	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	1	1
3	0	45	42.30	1840.75	1	0	0	0	1	0	0	0	0	1	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0
4	0	2	70.70	151.65	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
95	0	12	78.95	927.35	0	0	0	1	0	1	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	1	0	1
96	0	71	66.85	4748.70	1	1	1	1	0	1	0	0	0	1	0	1	0	0	0	1	0	0	0	0	1	0	1	1	0	0	0
97	0	5	21.05	113.85	1	0	0	1	0	0	0	1	1	0	1	0	1	0	1	0	1	0	1	0	0	0	0	0	0	1	1
98	0	52	21.00	1107.20	1	0	0	1	0	0	0	1	1	0	1	0	1	0	1	0	1	0	1	0	0	1	0	0	0	0	0
99	1	25	98.50	2514.50	0	1	0	1	0	0	1	0	0	0	0	1	0	1	0	0	0	1	0	1	0	0	1	0	1	0	1

100 rows × 31 columns

EDA 데이터 탐색¶

7043개의 행과 21개의 컬럼으로 구성 대부분의 데이터가 object 즉 문자열 데이터임

In [138]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

TotalCharges는 숫자형 데이터여야 하므로 pd.to_numeric을 사용해 숫자로 변환¶

In [139]:

pd.to_numeric(data['TotalCharges'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string " "

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-139-ff40986ab5f9> in <module>
----> 1 pd.to_numeric(data['TotalCharges'])

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
    152         coerce_numeric = errors not in ("ignore", "raise")
    153         try:
--> 154             values = lib.maybe_convert_numeric(
    155                 values, set(), coerce_numeric=coerce_numeric
    156             )

pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string " " at position 488

에러가 난 488번째 행 검사¶

In [140]:

data.iloc[488]

Out[140]:

customerID                         4472-LVYGI
gender                                 Female
SeniorCitizen                               0
Partner                                   Yes
Dependents                                Yes
tenure                                      0
PhoneService                               No
MultipleLines                No phone service
InternetService                           DSL
OnlineSecurity                            Yes
OnlineBackup                               No
DeviceProtection                          Yes
TechSupport                               Yes
StreamingTV                               Yes
StreamingMovies                            No
Contract                             Two year
PaperlessBilling                          Yes
PaymentMethod       Bank transfer (automatic)
MonthlyCharges                          52.55
TotalCharges                                 
Churn                                      No
Name: 488, dtype: object

488 번째 행의 TotalCharges 값이 공백이므로 NaN으로 만들어 준다.¶

In [143]:

data['TotalCharges']= data['TotalCharges'].replace(" ","")

다시 numeric으로 변환¶

In [144]:

data['TotalCharges'] = pd.to_numeric(data['TotalCharges'])

In [145]:

data. describe()

Out[145]:

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges
count	7043.000000	7043.000000	7043.000000	7032.000000
mean	0.162147	32.371149	64.761692	2283.300441
std	0.368612	24.559481	30.090047	2266.771362
min	0.000000	0.000000	18.250000	18.800000
25%	0.000000	9.000000	35.500000	401.450000
50%	0.000000	29.000000	70.350000	1397.475000
75%	0.000000	55.000000	89.850000	3794.737500
max	1.000000	72.000000	118.750000	8684.800000

seaborn의 distplot을 이용해 TotalCharges의 데이터 분포 탐색¶

왼쪽에 값이 집중되어 있고 오른쪽으로는 꼬리가 긴 형태¶

In [146]:

sns.distplot(data['TotalCharges'])

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Out[146]:

<AxesSubplot:xlabel='TotalCharges', ylabel='Density'>

명목척도의 데이터를 가진 컬럼을 숫자형으로 변형¶

In [ ]:

서열이나 순위가 없는 명목척도의 경우 단순하게 1,2,3,4로 바꿔서는 안된다.¶

컴퓨터는 1,4의 관계를 서열로 인식하기 때문.¶

명목척도는 컬럼을 따로 만들어서 True/False로 체트해야 한다.¶

데이터 프레임에서 gender 컬럼을 male, female에서 0/1로 변환하기¶

In [148]:

data['gender'].nunique()

Out[148]:

pd.get_dummies를 사용하여 male 0/1, female 0/1 컬럼을 만들고 둘 중 하나를 드롭¶

In [150]:

pd.get_dummies(data, columns=['gender'], drop_first =True)

Out[150]:

	customerID	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	...	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn	gender_Male
0	7590-VHVEG	0	Yes	No	1	No	No phone service	DSL	No	Yes	...	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No	0
1	5575-GNVDE	0	No	No	34	Yes	No	DSL	Yes	No	...	No	No	No	One year	No	Mailed check	56.95	1889.50	No	1
2	3668-QPYBK	0	No	No	2	Yes	No	DSL	Yes	Yes	...	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes	1
3	7795-CFOCW	0	No	No	45	No	No phone service	DSL	Yes	No	...	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No	1
4	9237-HQITU	0	No	No	2	Yes	No	Fiber optic	No	No	...	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	6840-RESVB	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	...	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	No	1
7039	2234-XADUH	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	...	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	No	0
7040	4801-JZAZL	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	...	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No	0
7041	8361-LTMKD	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	...	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	Yes	1
7042	3186-AJIEK	0	No	No	66	Yes	No	Fiber optic	Yes	No	...	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	No	1

7043 rows × 21 columns

In [ ]:

데이터 타입으로 필터링 해서 0/1 형태로 변환할 컬럼 목록 추출¶

In [151]:

data['gender'].dtype == 'O'

Out[151]:

True

In [152]:

data.columns

Out[152]:

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [153]:

for i in data.columns:
    print(data[i].dtype)

object
object
int64
object
object
int64
object
object
object
object
object
object
object
object
object
object
object
object
float64
float64
object

In [154]:

col_list = []
for i in data.columns:
    if data[i].dtype =='O':
        col_list.append(i)

In [155]:

col_list

Out[155]:

['customerID',
 'gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'Churn']

In [156]:

for i in col_list:
    print(i, data[i].nunique())

customerID 7043
gender 2
Partner 2
Dependents 2
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
Churn 2

In [157]:

col_list = col_list[1:]
col_list

Out[157]:

['gender',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'Churn']

In [ ]:

col_list를 이용해서 여러개의 컬럼을 0/1형태로 변환¶

In [158]:

data = pd.get_dummies(data, columns=col_list, drop_first =True)

In [159]:

data

Out[159]:

	customerID	SeniorCitizen	tenure	MonthlyCharges	TotalCharges	gender_Male	Partner_Yes	Dependents_Yes	PhoneService_Yes	MultipleLines_No phone service	...	StreamingTV_Yes	StreamingMovies_No internet service	StreamingMovies_Yes	Contract_One year	Contract_Two year	PaperlessBilling_Yes	PaymentMethod_Credit card (automatic)	PaymentMethod_Electronic check	PaymentMethod_Mailed check	Churn_Yes
0	7590-VHVEG	0	1	29.85	29.85	0	1	0	0	1	...	0	0	0	0	0	1	0	1	0	0
1	5575-GNVDE	0	34	56.95	1889.50	1	0	0	1	0	...	0	0	0	1	0	0	0	0	1	0
2	3668-QPYBK	0	2	53.85	108.15	1	0	0	1	0	...	0	0	0	0	0	1	0	0	1	1
3	7795-CFOCW	0	45	42.30	1840.75	1	0	0	0	1	...	0	0	0	1	0	0	0	0	0	0
4	9237-HQITU	0	2	70.70	151.65	0	0	0	1	0	...	0	0	0	0	0	1	0	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	6840-RESVB	0	24	84.80	1990.50	1	1	1	1	0	...	1	0	1	1	0	1	0	0	1	0
7039	2234-XADUH	0	72	103.20	7362.90	0	1	1	1	0	...	1	0	1	1	0	1	1	0	0	0
7040	4801-JZAZL	0	11	29.60	346.45	0	1	1	0	1	...	0	0	0	0	0	1	0	1	0	0
7041	8361-LTMKD	1	4	74.40	306.60	1	1	0	1	0	...	0	0	0	0	0	1	0	0	1	1
7042	3186-AJIEK	0	66	105.65	6844.50	1	0	0	1	0	...	1	0	1	0	1	1	0	0	0	0

7043 rows × 32 columns

In [ ]:

결측치 확인 및 처리¶

In [160]:

data.isna().sum()

Out[160]:

customerID                                0
SeniorCitizen                             0
tenure                                    0
MonthlyCharges                            0
TotalCharges                             11
gender_Male                               0
Partner_Yes                               0
Dependents_Yes                            0
PhoneService_Yes                          0
MultipleLines_No phone service            0
MultipleLines_Yes                         0
InternetService_Fiber optic               0
InternetService_No                        0
OnlineSecurity_No internet service        0
OnlineSecurity_Yes                        0
OnlineBackup_No internet service          0
OnlineBackup_Yes                          0
DeviceProtection_No internet service      0
DeviceProtection_Yes                      0
TechSupport_No internet service           0
TechSupport_Yes                           0
StreamingTV_No internet service           0
StreamingTV_Yes                           0
StreamingMovies_No internet service       0
StreamingMovies_Yes                       0
Contract_One year                         0
Contract_Two year                         0
PaperlessBilling_Yes                      0
PaymentMethod_Credit card (automatic)     0
PaymentMethod_Electronic check            0
PaymentMethod_Mailed check                0
Churn_Yes                                 0
dtype: int64

In [161]:

data['TotalCharges'].mean()

Out[161]:

2283.3004408418656

In [162]:

data['TotalCharges'].median()

Out[162]:

1397.475

In [163]:

sns.distplot(data['TotalCharges'])

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Out[163]:

<AxesSubplot:xlabel='TotalCharges', ylabel='Density'>

데이터의 비중이 한곳에 너무 몰려있고 데이터의 분포도 넓다.¶

따라서 평균보다는 중간값이 데이터의 평균을 잘 대표한다고 볼 수 있다.¶

na값을 median 값으로 inpute¶

In [164]:

data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].median())

In [ ]:

데이터 정규화의 종류와 특징¶

Different Scale?¶

In [ ]:

변수간 값의 차이가 심할 때 스케일을 맞춰주어야 비교가 가능함¶

Standard Scaler = 평균과 표준편차를 이용해서 정규분포로 만들어주는 것¶

Robust Scaler = Quantile 25%, 75% 지점의 값을 이용해서 정규화 해주는 것 아웃라이어의 영향을 덜 받음¶

Min-Max Scaler = 최소값과 최대값을 이용해서 스케일링을 해줌. 0~1사이의 값이 나옴.¶

데이터의 분포 특성을 어느정도 가지고 감¶

데이터 분포 특성을 그대로 가져가기 때문에 Min-Max scaling을 사용하는 것이 좋음¶

In [ ]:

사이킷런의 preprocessing에서 MinMaxScaler를 가져와 정규화¶

In [165]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

In [166]:

minmax = MinMaxScaler()

customerID 드롭¶

In [167]:

minmax.fit(data)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-167-4b9c9ec5942c> in <module>
----> 1 minmax.fit(data)

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\preprocessing\_data.py in fit(self, X, y)
    361         # Reset internal state before fitting
    362         self._reset()
--> 363         return self.partial_fit(X, y)
    364 
    365     def partial_fit(self, X, y=None):

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\preprocessing\_data.py in partial_fit(self, X, y)
    394 
    395         first_pass = not hasattr(self, 'n_samples_seen_')
--> 396         X = self._validate_data(X, reset=first_pass,
    397                                 estimator=self, dtype=FLOAT_DTYPES,
    398                                 force_all_finite="allow-nan")

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    419             out = X
    420         elif isinstance(y, str) and y == 'no_validation':
--> 421             X = check_array(X, **check_params)
    422             out = X
    423         else:

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    614                     array = array.astype(dtype, casting="unsafe", copy=False)
    615                 else:
--> 616                     array = np.asarray(array, order=order, dtype=dtype)
    617             except ComplexWarning as complex_warning:
    618                 raise ValueError("Complex data not supported\n"

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\pandas\core\generic.py in __array__(self, dtype)
   1894 
   1895     def __array__(self, dtype=None) -> np.ndarray:
-> 1896         return np.asarray(self._values, dtype=dtype)
   1897 
   1898     def __array_wrap__(

d:\ProgramData\Anaconda3\envs\mdai\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
     81 
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84 
     85 

ValueError: could not convert string to float: '7590-VHVEG'

In [168]:

data.drop('customerID', axis =1, inplace=True)

In [169]:

minmax.fit(data)

Out[169]:

MinMaxScaler()

In [170]:

scaled_data = minmax.transform(data)

In [171]:

data.columns

Out[171]:

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges',
       'gender_Male', 'Partner_Yes', 'Dependents_Yes', 'PhoneService_Yes',
       'MultipleLines_No phone service', 'MultipleLines_Yes',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service', 'OnlineBackup_Yes',
       'DeviceProtection_No internet service', 'DeviceProtection_Yes',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service', 'StreamingTV_Yes',
       'StreamingMovies_No internet service', 'StreamingMovies_Yes',
       'Contract_One year', 'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Credit card (automatic)',
       'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check',
       'Churn_Yes'],
      dtype='object')

In [173]:

scaled_data = pd.DataFrame(scaled_data, columns=data.columns)

In [174]:

scaled_data

Out[174]:

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges	gender_Male	Partner_Yes	Dependents_Yes	PhoneService_Yes	MultipleLines_No phone service	MultipleLines_Yes	...	StreamingTV_Yes	StreamingMovies_No internet service	StreamingMovies_Yes	Contract_One year	Contract_Two year	PaperlessBilling_Yes	PaymentMethod_Credit card (automatic)	PaymentMethod_Electronic check	PaymentMethod_Mailed check	Churn_Yes
0	0.0	0.013889	0.115423	0.001275	0.0	1.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
1	0.0	0.472222	0.385075	0.215867	1.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0
2	0.0	0.027778	0.354229	0.010310	1.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	1.0
3	0.0	0.625000	0.239303	0.210241	1.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.027778	0.521891	0.015330	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	0.0	0.333333	0.662189	0.227521	1.0	1.0	1.0	1.0	0.0	1.0	...	1.0	0.0	1.0	1.0	0.0	1.0	0.0	0.0	1.0	0.0
7039	0.0	1.000000	0.845274	0.847461	0.0	1.0	1.0	1.0	0.0	1.0	...	1.0	0.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0
7040	0.0	0.152778	0.112935	0.037809	0.0	1.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
7041	1.0	0.055556	0.558706	0.033210	1.0	1.0	0.0	1.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0	1.0
7042	0.0	0.916667	0.869652	0.787641	1.0	0.0	0.0	1.0	0.0	0.0	...	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0

7043 rows × 31 columns

In [ ]:

train_test_split 을 사용해서 학습 데이터와 테스트 데이터를 나눔¶

In [178]:

from sklearn.model_selection import train_test_split

In [179]:

X = scaled_data.drop('Churn_Yes', axis=1)
y = scaled_data['Churn_Yes']

In [192]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=100)

In [ ]:

KNN 모듈(from sklearn.neighbors import KNeighborsClassifier) 불러와 분석¶

In [196]:

from sklearn.neighbors import KNeighborsClassifier

In [200]:

knn = KNeighborsClassifier(n_neighbors=10)

In [201]:

knn.fit(X_train, y_train)

Out[201]:

KNeighborsClassifier(n_neighbors=10)

In [204]:

pred = knn.predict(X_test)

In [207]:

pd.DataFrame({'actual value':y_test, 'pred_value':pred}).head(30)

Out[207]:

	actual value	pred_value
4880	0.0	0.0
1541	0.0	0.0
1289	0.0	0.0
5745	0.0	0.0
4873	0.0	0.0
4168	0.0	0.0
1557	0.0	0.0
2892	0.0	0.0
664	0.0	0.0
1588	0.0	0.0
1338	1.0	0.0
6000	0.0	0.0
2310	0.0	0.0
3294	1.0	1.0
290	1.0	1.0
2505	0.0	0.0
3171	0.0	0.0
1366	1.0	1.0
6560	0.0	0.0
2420	0.0	0.0
5210	1.0	0.0
2836	0.0	1.0
1325	1.0	1.0
4900	1.0	0.0
6311	0.0	0.0
1025	0.0	0.0
2031	0.0	0.0
4459	1.0	1.0
5324	0.0	0.0
3441	0.0	0.0

In [ ]:

학습값과 예측값 정확도 검증¶

accuracy_score, confusion_matrix¶

In [213]:

from sklearn.metrics import accuracy_score, confusion_matrix
from tqdm import tqdm

정확도 75.81%¶

In [210]:

accuracy_score(y_test, pred)

Out[210]:

0.7581637482252721

In [ ]:

FP =194, FN = 317¶

In [211]:

confusion_matrix(y_test, pred)

Out[211]:

array([[1353,  194],
       [ 317,  249]], dtype=int64)

KNN의 정확도를 높이기¶

에러율이 낮은 K 값을 찾기 위해 K를 1~100까지 바꿔가며 테스트¶

In [212]:

error_list = []

In [215]:

for i in tqdm(range(1,101)):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred = knn.predict(X_test)
    error_list.append(accuracy_score(y_test, pred))

100%|█████████████████████████████████████████████████████| 100/100 [00:30<00:00,  3.33it/s]

In [217]:

len(error_list)

Out[217]:

In [ ]:

lineplot을 이용하여 k값에 따른 정확도 변화를 점검¶

In [228]:

plt.figure(figsize=(20,10))
sns.lineplot(x = range(1, 101), y= error_list, marker='o', markersize = 10, markerfacecolor='red')

Out[228]:

<AxesSubplot:>

In [ ]:

Max() 함수를 이용해서 정확도가 가장 높은 값 탐색¶

In [230]:

max(error_list)

Out[230]:

0.780407004259347

.index에 max(error_list)를 넣어 몇 번째 값인지 탐색: 54번째¶

In [247]:

error_list.index(max(error_list))

Out[247]:

In [ ]:

Numpy의 array로 리스트를 변환하고 .argmax() 를 이용하면 쉽게 index를 찾을 수 있음¶

In [233]:

np.array(error_list)

Out[233]:

array([0.71367724, 0.74917179, 0.74396593, 0.75579744, 0.74822527,
       0.75816375, 0.74585897, 0.75911027, 0.75295788, 0.75816375,
       0.7624231 , 0.76100331, 0.76005679, 0.7671557 , 0.76620918,
       0.76810222, 0.76904875, 0.77283483, 0.76810222, 0.77236157,
       0.76810222, 0.77141505, 0.76999527, 0.76526266, 0.76762896,
       0.77141505, 0.77330809, 0.77283483, 0.77330809, 0.77141505,
       0.76999527, 0.76762896, 0.77094179, 0.77046853, 0.77425461,
       0.77378135, 0.77330809, 0.77520114, 0.77188831, 0.77378135,
       0.77236157, 0.77756744, 0.77662092, 0.77472788, 0.7756744 ,
       0.77662092, 0.77756744, 0.77756744, 0.77425461, 0.77094179,
       0.77141505, 0.7756744 , 0.77520114, 0.780407  , 0.77425461,
       0.77709418, 0.77614766, 0.77520114, 0.77472788, 0.76999527,
       0.77046853, 0.76952201, 0.77094179, 0.7671557 , 0.76904875,
       0.76810222, 0.77094179, 0.77283483, 0.77094179, 0.76952201,
       0.77425461, 0.77141505, 0.77094179, 0.77188831, 0.76904875,
       0.77236157, 0.77188831, 0.77425461, 0.77188831, 0.77236157,
       0.77330809, 0.76904875, 0.77094179, 0.77141505, 0.76857549,
       0.76526266, 0.76952201, 0.76857549, 0.76810222, 0.77094179,
       0.76904875, 0.76999527, 0.76857549, 0.77188831, 0.76668244,
       0.76904875, 0.76762896, 0.76952201, 0.76810222, 0.76904875])

In [234]:

np.array(error_list).argmax()

Out[234]:

In [ ]:

정확도가 가장 높은 54번째 값을 넣고 다시 KNN 분석 시작¶

In [238]:

knn = KNeighborsClassifier(n_neighbors=54)

In [239]:

knn.fit(X_train, y_train)

Out[239]:

KNeighborsClassifier(n_neighbors=54)

In [240]:

pred = knn.predict(X_test)

정확도 78.04%로 약 2.23% 향상¶

In [242]:

accuracy_score(y_test, pred)

Out[242]:

0.780407004259347

In [ ]:

FP =215, FN = 249 (이전 값 : FP =194, FN = 317)¶

정확도는 향상되었으나 예측치에서 Type-2 에러율이 높아짐¶

In [243]:

confusion_matrix(y_test, pred)

Out[243]:

array([[1332,  215],
       [ 249,  317]], dtype=int64)

KNN과 Logistic Regression의 차이¶

Logistic Regression은 기본적으로 변수들이 선형관계에 있다고 가정, 빠름 다양한 정보를 주기 때문에 전달력이 좋다.

KNN은 어떤 관계가 있다고 가정하지 않음. 느림

In [ ]:

'Machine Learning, Deep Learning' 카테고리의 다른 글

[sklearn, KMeans] Kmeans Clustering 을 활용한 데이터 기반 고객 분류 (0)	2021.02.01
[sklearn, statsmodels] Linear Regression 한국 환경 공단 실내 공기질 분석 (0)	2021.01.28
[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier (0)	2021.01.24
[sklearn] Logistic Regression을 활용한 소비자 광고 반응률 예측 (0)	2021.01.22
[sklearn,statsmodels] Linear Regression을 이용한 고객별 연간 지출액 예측 (0)	2021.01.22

'Machine Learning, Deep Learning' Related Articles

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Data Navigator

[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기 본문

[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기

KNN 알고리즘을 이용하여 고객이탈 예측하기¶

통신사 고객 데이터 로딩¶

pd.set_option('display.max_columns', 100) 을 사용하여 100개의 컬럼까지 탐색¶

EDA 데이터 탐색¶

TotalCharges는 숫자형 데이터여야 하므로 pd.to_numeric을 사용해 숫자로 변환¶

에러가 난 488번째 행 검사¶

488 번째 행의 TotalCharges 값이 공백이므로 NaN으로 만들어 준다.¶

다시 numeric으로 변환¶

seaborn의 distplot을 이용해 TotalCharges의 데이터 분포 탐색¶

왼쪽에 값이 집중되어 있고 오른쪽으로는 꼬리가 긴 형태¶

명목척도의 데이터를 가진 컬럼을 숫자형으로 변형¶

서열이나 순위가 없는 명목척도의 경우 단순하게 1,2,3,4로 바꿔서는 안된다.¶

컴퓨터는 1,4의 관계를 서열로 인식하기 때문.¶

명목척도는 컬럼을 따로 만들어서 True/False로 체트해야 한다.¶

데이터 프레임에서 gender 컬럼을 male, female에서 0/1로 변환하기¶

pd.get_dummies를 사용하여 male 0/1, female 0/1 컬럼을 만들고 둘 중 하나를 드롭¶

데이터 타입으로 필터링 해서 0/1 형태로 변환할 컬럼 목록 추출¶

col_list를 이용해서 여러개의 컬럼을 0/1형태로 변환¶

결측치 확인 및 처리¶

데이터의 비중이 한곳에 너무 몰려있고 데이터의 분포도 넓다.¶

따라서 평균보다는 중간값이 데이터의 평균을 잘 대표한다고 볼 수 있다.¶

na값을 median 값으로 inpute¶

데이터 정규화의 종류와 특징¶

Different Scale?¶

변수간 값의 차이가 심할 때 스케일을 맞춰주어야 비교가 가능함¶

Standard Scaler = 평균과 표준편차를 이용해서 정규분포로 만들어주는 것¶

Robust Scaler = Quantile 25%, 75% 지점의 값을 이용해서 정규화 해주는 것 아웃라이어의 영향을 덜 받음¶

Min-Max Scaler = 최소값과 최대값을 이용해서 스케일링을 해줌. 0~1사이의 값이 나옴.¶

데이터의 분포 특성을 어느정도 가지고 감¶

데이터 분포 특성을 그대로 가져가기 때문에 Min-Max scaling을 사용하는 것이 좋음¶

사이킷런의 preprocessing에서 MinMaxScaler를 가져와 정규화¶

customerID 드롭¶

train_test_split 을 사용해서 학습 데이터와 테스트 데이터를 나눔¶

KNN 모듈(from sklearn.neighbors import KNeighborsClassifier) 불러와 분석¶

학습값과 예측값 정확도 검증¶

accuracy_score, confusion_matrix¶

정확도 75.81%¶

FP =194, FN = 317¶

KNN의 정확도를 높이기¶

에러율이 낮은 K 값을 찾기 위해 K를 1~100까지 바꿔가며 테스트¶

lineplot을 이용하여 k값에 따른 정확도 변화를 점검¶

Max() 함수를 이용해서 정확도가 가장 높은 값 탐색¶

.index에 max(error_list)를 넣어 몇 번째 값인지 탐색: 54번째¶

Numpy의 array로 리스트를 변환하고 .argmax() 를 이용하면 쉽게 index를 찾을 수 있음¶

정확도가 가장 높은 54번째 값을 넣고 다시 KNN 분석 시작¶

정확도 78.04%로 약 2.23% 향상¶

FP =215, FN = 249 (이전 값 : FP =194, FN = 317)¶

정확도는 향상되었으나 예측치에서 Type-2 에러율이 높아짐¶

KNN과 Logistic Regression의 차이¶

'Machine Learning, Deep Learning' 카테고리의 다른 글

티스토리툴바