Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

Data Navigator

[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier 본문

Machine Learning, Deep Learning

[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier

코딩하고분석하는돌스 2021. 1. 24. 00:09

NLP 상품 리뷰 분석¶

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:

data = pd.read_csv('./09. 상품 리뷰 분석(NLP)/yelp.csv', index_col =0)

In [4]:

data.head()

Out[4]:

	review_id	user_id	business_id	stars	date	text	useful	funny	cool
2967245	aMleVK0lQcOSNCs56_gSbg	miHaLnLanDKfZqZHet0uWw	Xp_cWXY5rxDLkX-wqUg-iQ	5	2015-09-30	LOVE the cheeses here. They are worth the pri...	0	0	1
4773684	Hs1f--t9JnVKW9A1U2uhKA	r_RUQSGZcd5bSgmTcS5IfQ	NuGZD3yBVqzpY1HuzT26mQ	5	2015-06-04	This has become our go-to sushi place. The sus...	0	0	0
1139855	i7aiPgNrNaFoM8J_j2OSyQ	zz7lojg6QdZbKFCJiHsj7w	ii8sAGBexBOJoYRFafF9XQ	1	2016-07-03	I was very disappointed with the hotel. The re...	2	1	1
3997153	uft6iMwNQh4I2UDpmbXggA	p_oXN3L9oi8nmmJigf8c9Q	r0j4IpUbcdC1-HfoMYae4w	5	2016-10-15	Love this place - super amazing - staff here i...	0	0	0
4262000	y9QmJ16mrfBZS6Td6Yqo0g	jovtGPaHAqP6XfG9BFwY7A	j6UwIfXrSkGTdVkRu7K6WA	5	2017-03-14	Thank you Dana!!!! Having dyed my hair black p...	0	0	0

In [5]:

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 2967245 to 838267
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    10000 non-null  object
 1   user_id      10000 non-null  object
 2   business_id  10000 non-null  object
 3   stars        10000 non-null  int64 
 4   date         10000 non-null  object
 5   text         10000 non-null  object
 6   useful       10000 non-null  int64 
 7   funny        10000 non-null  int64 
 8   cool         10000 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 781.2+ KB

In [6]:

data.describe()

Out[6]:

	stars	useful	funny	cool
count	10000.000000	10000.000000	10000.000000	10000.000000
mean	4.012800	1.498800	0.464200	0.542500
std	1.724684	6.339355	1.926523	2.010273
min	1.000000	0.000000	0.000000	0.000000
25%	5.000000	0.000000	0.000000	0.000000
50%	5.000000	0.000000	0.000000	0.000000
75%	5.000000	2.000000	0.000000	0.000000
max	5.000000	533.000000	83.000000	97.000000

In [7]:

data.head()

Out[7]:

	review_id	user_id	business_id	stars	date	text	useful	funny	cool
2967245	aMleVK0lQcOSNCs56_gSbg	miHaLnLanDKfZqZHet0uWw	Xp_cWXY5rxDLkX-wqUg-iQ	5	2015-09-30	LOVE the cheeses here. They are worth the pri...	0	0	1
4773684	Hs1f--t9JnVKW9A1U2uhKA	r_RUQSGZcd5bSgmTcS5IfQ	NuGZD3yBVqzpY1HuzT26mQ	5	2015-06-04	This has become our go-to sushi place. The sus...	0	0	0
1139855	i7aiPgNrNaFoM8J_j2OSyQ	zz7lojg6QdZbKFCJiHsj7w	ii8sAGBexBOJoYRFafF9XQ	1	2016-07-03	I was very disappointed with the hotel. The re...	2	1	1
3997153	uft6iMwNQh4I2UDpmbXggA	p_oXN3L9oi8nmmJigf8c9Q	r0j4IpUbcdC1-HfoMYae4w	5	2016-10-15	Love this place - super amazing - staff here i...	0	0	0
4262000	y9QmJ16mrfBZS6Td6Yqo0g	jovtGPaHAqP6XfG9BFwY7A	j6UwIfXrSkGTdVkRu7K6WA	5	2017-03-14	Thank you Dana!!!! Having dyed my hair black p...	0	0	0

In [10]:

data.drop(['review_id','user_id','business_id','date'], axis =1, inplace=True)

In [11]:

data.head()

Out[11]:

	stars	text	useful	funny	cool
2967245	5	LOVE the cheeses here. They are worth the pri...	0	0	1
4773684	5	This has become our go-to sushi place. The sus...	0	0	0
1139855	1	I was very disappointed with the hotel. The re...	2	1	1
3997153	5	Love this place - super amazing - staff here i...	0	0	0
4262000	5	Thank you Dana!!!! Having dyed my hair black p...	0	0	0

In [12]:

len(data.iloc[0]['text'])

Out[12]:

In [14]:

data['text_length'] = data['text'].apply(len)

In [15]:

data.head()

Out[15]:

	stars	text	useful	funny	cool	text_length
2967245	5	LOVE the cheeses here. They are worth the pri...	0	0	1	347
4773684	5	This has become our go-to sushi place. The sus...	0	0	0	377
1139855	1	I was very disappointed with the hotel. The re...	2	1	1	663
3997153	5	Love this place - super amazing - staff here i...	0	0	0	141
4262000	5	Thank you Dana!!!! Having dyed my hair black p...	0	0	0	455

In [16]:

data['stars'].value_counts()

Out[16]:

5    7532
1    2468
Name: stars, dtype: int64

In [17]:

sns.countplot(data['stars'])

d:\ProgramData\Anaconda3\envs\bigdata\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

Out[17]:

<AxesSubplot:xlabel='stars', ylabel='count'>

In [18]:

sns.distplot(data['text_length'])

d:\ProgramData\Anaconda3\envs\bigdata\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Out[18]:

<AxesSubplot:xlabel='text_length', ylabel='Density'>

In [19]:

data.corr()

Out[19]:

	stars	useful	funny	cool	text_length
stars	1.000000	-0.098825	-0.089860	0.060101	-0.221752
useful	-0.098825	1.000000	0.656630	0.525962	0.161592
funny	-0.089860	0.656630	1.000000	0.741797	0.215003
cool	0.060101	0.525962	0.741797	1.000000	0.193500
text_length	-0.221752	0.161592	0.215003	0.193500	1.000000

In [21]:

sns.heatmap(data.corr(), cmap = 'coolwarm')

Out[21]:

<AxesSubplot:>

In [22]:

data['text']

Out[22]:

2967245    LOVE the cheeses here.  They are worth the pri...
4773684    This has become our go-to sushi place. The sus...
1139855    I was very disappointed with the hotel. The re...
3997153    Love this place - super amazing - staff here i...
4262000    Thank you Dana!!!! Having dyed my hair black p...
                                 ...                        
1567641    I'm a sucker for places like this. Get me in f...
4910763    Extremely rude staff!  Was told 4 min on a lar...
1036315    I live in NYC and went to the RTR here in the ...
555962     If you are looking for a trainer, then look no...
838267     Awesome food. Awesome beer. Awesome service. N...
Name: text, Length: 10000, dtype: object

In [23]:

import string

In [24]:

string.punctuation

Out[24]:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [25]:

s = data.iloc[0]['text']

In [26]:

Out[26]:

'LOVE the cheeses here.  They are worth the price.  Great for finding treats for a special dinner or picnic.  Nice on sample days. Yum!!! Top quality meats. Nice selection of non brand frozen veggies.  Veggie chips are mega tasty.  Always quick and friendly check out.  Produce not as stellar as it once was, but also not finding better in Madison.'

In [33]:

def remove_punc(x):
    new_s = []
    for i in x:
        if i not in string.punctuation:
            new_s.append(i)
    new_s = ''.join(new_s)
    return new_s

In [35]:

data['text'].apply(remove_punc)

Out[35]:

'LOVE the cheeses here  They are worth the price  Great for finding treats for a special dinner or picnic  Nice on sample days Yum Top quality meats Nice selection of non brand frozen veggies  Veggie chips are mega tasty  Always quick and friendly check out  Produce not as stellar as it once was but also not finding better in Madison'

In [37]:

''.join([i for i in s if i not in string.punctuation])

Out[37]:

'LOVE the cheeses here  They are worth the price  Great for finding treats for a special dinner or picnic  Nice on sample days Yum Top quality meats Nice selection of non brand frozen veggies  Veggie chips are mega tasty  Always quick and friendly check out  Produce not as stellar as it once was but also not finding better in Madison'

In [40]:

data['text'] = data['text'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))

In [41]:

data['text']

Out[41]:

2967245    LOVE the cheeses here  They are worth the pric...
4773684    This has become our goto sushi place The sushi...
1139855    I was very disappointed with the hotel The res...
3997153    Love this place  super amazing  staff here is ...
4262000    Thank you Dana Having dyed my hair black previ...
                                 ...                        
1567641    Im a sucker for places like this Get me in fro...
4910763    Extremely rude staff  Was told 4 min on a larg...
1036315    I live in NYC and went to the RTR here in the ...
555962     If you are looking for a trainer then look no ...
838267     Awesome food Awesome beer Awesome service Need...
Name: text, Length: 10000, dtype: object

In [42]:

from nltk.corpus import stopwords

In [139]:

# stopwords.words('english')

In [85]:

#s.split(' ')

In [45]:

s.lower()

Out[45]:

'love the cheeses here.  they are worth the price.  great for finding treats for a special dinner or picnic.  nice on sample days. yum!!! top quality meats. nice selection of non brand frozen veggies.  veggie chips are mega tasty.  always quick and friendly check out.  produce not as stellar as it once was, but also not finding better in madison.'

In [54]:

def stop_w(x):
    new_s = []
    for i in x.split():
        if i.lower() not in stopwords.words('english'):
            new_s.append(i.lower())
    return new_s

In [55]:

data['text'].apply(stop_w)

Out[55]:

2967245    [love, cheeses, worth, price, great, finding, ...
4773684    [become, goto, sushi, place, sushi, always, fr...
1139855    [disappointed, hotel, restaurants, good, booke...
3997153    [love, place, super, amazing, staff, always, f...
4262000    [thank, dana, dyed, hair, black, previously, k...
                                 ...                        
1567641    [im, sucker, places, like, get, front, meat, c...
4910763    [extremely, rude, staff, told, 4, min, large, ...
1036315    [live, nyc, went, rtr, flatiron, didnt, select...
555962     [looking, trainer, look, moment, humberto, met...
838267     [awesome, food, awesome, beer, awesome, servic...
Name: text, Length: 10000, dtype: object

In [60]:

data['text'] = data['text'].apply(lambda x: [ i.lower() for i in x.split() if i.lower() not in stopwords.words('english')])

In [61]:

 data.head()

Out[61]:

	stars	text	useful	funny	cool	text_length
2967245	5	[love, cheeses, worth, price, great, finding, ...	0	0	1	347
4773684	5	[become, goto, sushi, place, sushi, always, fr...	0	0	0	377
1139855	1	[disappointed, hotel, restaurants, good, booke...	2	1	1	663
3997153	5	[love, place, super, amazing, staff, always, f...	0	0	0	141
4262000	5	[thank, dana, dyed, hair, black, previously, k...	0	0	0	455

In [66]:

word_split = []
for i in range(len(data)):
    for j in data.iloc[i]['text']:
        word_split.append(j)

In [68]:

len(word_split)

Out[68]:

In [70]:

from nltk.probability import FreqDist

In [72]:

plt.figure(figsize=(20,10))
FreqDist(word_split).plot(30)

Out[72]:

<AxesSubplot:xlabel='Samples', ylabel='Counts'>

In [73]:

from wordcloud import WordCloud

In [76]:

wc = WordCloud().generate(str(data['text']))
plt.figure(figsize=(10,5))
plt.imshow(wc)
plt.axis('off')

Out[76]:

(-0.5, 399.5, 199.5, -0.5)

In [77]:

data['stars'].value_counts()

Out[77]:

5    7532
1    2468
Name: stars, dtype: int64

In [82]:

good =data[data['stars'] == 5]['text']

In [80]:

bad =data[data['stars'] == 1]['text']

In [83]:

wc = WordCloud().generate(str(good))
plt.figure(figsize=(10,5))
plt.imshow(wc)
plt.axis('off')

Out[83]:

(-0.5, 399.5, 199.5, -0.5)

In [84]:

wc = WordCloud().generate(str(bad))
plt.figure(figsize=(10,5))
plt.imshow(wc)
plt.axis('off')

Out[84]:

(-0.5, 399.5, 199.5, -0.5)

In [ ]:

감정 예측 모델¶

In [88]:

data2 = pd.read_csv('./09. 상품 리뷰 분석(NLP)/yelp.csv', index_col =0)

In [ ]:

In [93]:

X = data2['text']
y = data2['stars']

In [90]:

from sklearn.feature_extraction.text import CountVectorizer

Count Vectorizer¶

단어별 빈도를 계산하여 데이터 프레임으로 정리¶

In [94]:

cv = CountVectorizer()

In [95]:

cv.fit(X)

Out[95]:

CountVectorizer()

In [96]:

X= cv.transform(X)

In [97]:

Out[97]:

<10000x28253 sparse matrix of type '<class 'numpy.int64'>'
	with 679150 stored elements in Compressed Sparse Row format>

In [140]:

# print(X)

In [100]:

cv.get_feature_names()[28048]

Out[100]:

'yourself'

In [101]:

from sklearn.model_selection import train_test_split

In [102]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state = 100)

Nive Bayes classifier¶

각 변수가 독립적이라는 가정

¶

N < p 일 때(변수가 너무 많을 때) 복잡한 모델은 오버피팅에 빠지기 때문에 문자열 처리에는 단순한 모델이 유용하게 쓰임

딥러닝을 제외하면, 텍스트 데이터에 가장 적합(스팸 메일 필터링, 감정 분석)

¶

In [104]:

from sklearn.naive_bayes import MultinomialNB

In [105]:

model = MultinomialNB()

In [106]:

model.fit(X_train, y_train)

Out[106]:

MultinomialNB()

In [107]:

pred = model.predict(X_test)

In [108]:

pred

Out[108]:

array([5, 5, 5, ..., 1, 5, 1], dtype=int64)

In [109]:

y_test

Out[109]:

1373705    5
3128713    5
212088     1
1622136    5
2380124    5
          ..
3548316    5
38943      5
2423674    1
1564863    5
3629333    1
Name: stars, Length: 2000, dtype: int64

In [110]:

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Naive Bayes 정확도 92.65%¶

In [111]:

accuracy_score(y_test, pred)

Out[111]:

0.9265

In [113]:

confusion_matrix(y_test, pred)

Out[113]:

array([[ 421,   65],
       [  82, 1432]], dtype=int64)

In [114]:

print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           1       0.84      0.87      0.85       486
           5       0.96      0.95      0.95      1514

    accuracy                           0.93      2000
   macro avg       0.90      0.91      0.90      2000
weighted avg       0.93      0.93      0.93      2000

Random Forest 로 비교¶

In [116]:

from sklearn.ensemble import RandomForestClassifier

In [118]:

rf = RandomForestClassifier(max_depth = 10)

In [121]:

rf.fit(X_train, y_train)

Out[121]:

RandomForestClassifier(max_depth=10)

In [122]:

pred2 = rf.predict(X_test)

In [124]:

accuracy_score(y_test, pred2)

Out[124]:

0.787

In [131]:

confusion_matrix(y_test, pred2)

Out[131]:

array([[  64,  422],
       [   4, 1510]], dtype=int64)

트리와 깊이를 늘려서 다시 돌림¶

In [132]:

rf = RandomForestClassifier(max_depth = 10, n_estimators = 1000)

In [133]:

rf.fit(X_train, y_train)

Out[133]:

RandomForestClassifier(max_depth=10, n_estimators=1000)

In [135]:

pred3 = rf.predict(X_test)

Random Forest 정확도 78.45 Naive Bayes 92.65보다 정확도가 떨어짐¶

In [136]:

accuracy_score(y_test, pred3)

Out[136]:

0.7845

In [137]:

confusion_matrix(y_test, pred3)

Out[137]:

array([[  59,  427],
       [   4, 1510]], dtype=int64)

In [138]:

print(classification_report(y_test, pred3))

              precision    recall  f1-score   support

           1       0.94      0.12      0.21       486
           5       0.78      1.00      0.88      1514

    accuracy                           0.78      2000
   macro avg       0.86      0.56      0.55      2000
weighted avg       0.82      0.78      0.71      2000

In [ ]:

'Machine Learning, Deep Learning' 카테고리의 다른 글

[sklearn, statsmodels] Linear Regression 한국 환경 공단 실내 공기질 분석 (0)	2021.01.28
[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기 (0)	2021.01.27
[sklearn] Logistic Regression을 활용한 소비자 광고 반응률 예측 (0)	2021.01.22
[sklearn,statsmodels] Linear Regression을 이용한 고객별 연간 지출액 예측 (0)	2021.01.22
[sklearn, SVM] 서포트 백터 머신을 이용해서 외국어 문장 판별하기 (0)	2021.01.17

'Machine Learning, Deep Learning' Related Articles

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Data Navigator

[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier 본문

[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier

NLP 상품 리뷰 분석¶

감정 예측 모델¶

Count Vectorizer¶

단어별 빈도를 계산하여 데이터 프레임으로 정리¶

Nive Bayes classifier¶

¶

¶

Naive Bayes 정확도 92.65%¶

Random Forest 로 비교¶

트리와 깊이를 늘려서 다시 돌림¶

Random Forest 정확도 78.45 Naive Bayes 92.65보다 정확도가 떨어짐¶

'Machine Learning, Deep Learning' 카테고리의 다른 글

티스토리툴바