Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
Tags
- 자연어분석
- fastapi #파이썬웹개발
- 판다스 데이터정렬
- 딥러닝
- HTML
- pytorch
- 파이썬
- 파이토치기본
- konlpy
- 비지도학습
- langchain
- 머신러닝
- fastapi
- pandas
- Python
- 판다스
- 챗gpt
- NLP
- sklearn
- fastapi #python웹개발
- deeplearning
- MachineLearning
- 랭체인
- 사이킷런
- 파이토치
- python 정렬
- programmablesearchengine
- chatGPT
- OpenAIAPI
- 파이썬웹개발
Archives
- Today
- Total
Data Navigator
[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier 본문
Machine Learning, Deep Learning
[sklearn, NLP] 상품 리뷰 분석 NLP, Count Vectorizer, Naive Bayes Classifier
코딩하고분석하는돌스 2021. 1. 24. 00:09
NLP 상품 리뷰 분석¶
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [3]:
data = pd.read_csv('./09. 상품 리뷰 분석(NLP)/yelp.csv', index_col =0)
In [4]:
data.head()
Out[4]:
review_id | user_id | business_id | stars | date | text | useful | funny | cool | |
---|---|---|---|---|---|---|---|---|---|
2967245 | aMleVK0lQcOSNCs56_gSbg | miHaLnLanDKfZqZHet0uWw | Xp_cWXY5rxDLkX-wqUg-iQ | 5 | 2015-09-30 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 |
4773684 | Hs1f--t9JnVKW9A1U2uhKA | r_RUQSGZcd5bSgmTcS5IfQ | NuGZD3yBVqzpY1HuzT26mQ | 5 | 2015-06-04 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 |
1139855 | i7aiPgNrNaFoM8J_j2OSyQ | zz7lojg6QdZbKFCJiHsj7w | ii8sAGBexBOJoYRFafF9XQ | 1 | 2016-07-03 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 |
3997153 | uft6iMwNQh4I2UDpmbXggA | p_oXN3L9oi8nmmJigf8c9Q | r0j4IpUbcdC1-HfoMYae4w | 5 | 2016-10-15 | Love this place - super amazing - staff here i... | 0 | 0 | 0 |
4262000 | y9QmJ16mrfBZS6Td6Yqo0g | jovtGPaHAqP6XfG9BFwY7A | j6UwIfXrSkGTdVkRu7K6WA | 5 | 2017-03-14 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 |
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 2967245 to 838267
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review_id 10000 non-null object
1 user_id 10000 non-null object
2 business_id 10000 non-null object
3 stars 10000 non-null int64
4 date 10000 non-null object
5 text 10000 non-null object
6 useful 10000 non-null int64
7 funny 10000 non-null int64
8 cool 10000 non-null int64
dtypes: int64(4), object(5)
memory usage: 781.2+ KB
In [6]:
data.describe()
Out[6]:
stars | useful | funny | cool | |
---|---|---|---|---|
count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 4.012800 | 1.498800 | 0.464200 | 0.542500 |
std | 1.724684 | 6.339355 | 1.926523 | 2.010273 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 5.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 5.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 5.000000 | 2.000000 | 0.000000 | 0.000000 |
max | 5.000000 | 533.000000 | 83.000000 | 97.000000 |
In [7]:
data.head()
Out[7]:
review_id | user_id | business_id | stars | date | text | useful | funny | cool | |
---|---|---|---|---|---|---|---|---|---|
2967245 | aMleVK0lQcOSNCs56_gSbg | miHaLnLanDKfZqZHet0uWw | Xp_cWXY5rxDLkX-wqUg-iQ | 5 | 2015-09-30 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 |
4773684 | Hs1f--t9JnVKW9A1U2uhKA | r_RUQSGZcd5bSgmTcS5IfQ | NuGZD3yBVqzpY1HuzT26mQ | 5 | 2015-06-04 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 |
1139855 | i7aiPgNrNaFoM8J_j2OSyQ | zz7lojg6QdZbKFCJiHsj7w | ii8sAGBexBOJoYRFafF9XQ | 1 | 2016-07-03 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 |
3997153 | uft6iMwNQh4I2UDpmbXggA | p_oXN3L9oi8nmmJigf8c9Q | r0j4IpUbcdC1-HfoMYae4w | 5 | 2016-10-15 | Love this place - super amazing - staff here i... | 0 | 0 | 0 |
4262000 | y9QmJ16mrfBZS6Td6Yqo0g | jovtGPaHAqP6XfG9BFwY7A | j6UwIfXrSkGTdVkRu7K6WA | 5 | 2017-03-14 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 |
In [10]:
data.drop(['review_id','user_id','business_id','date'], axis =1, inplace=True)
In [11]:
data.head()
Out[11]:
stars | text | useful | funny | cool | |
---|---|---|---|---|---|
2967245 | 5 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 |
4773684 | 5 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 |
1139855 | 1 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 |
3997153 | 5 | Love this place - super amazing - staff here i... | 0 | 0 | 0 |
4262000 | 5 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 |
In [12]:
len(data.iloc[0]['text'])
Out[12]:
347
In [14]:
data['text_length'] = data['text'].apply(len)
In [15]:
data.head()
Out[15]:
stars | text | useful | funny | cool | text_length | |
---|---|---|---|---|---|---|
2967245 | 5 | LOVE the cheeses here. They are worth the pri... | 0 | 0 | 1 | 347 |
4773684 | 5 | This has become our go-to sushi place. The sus... | 0 | 0 | 0 | 377 |
1139855 | 1 | I was very disappointed with the hotel. The re... | 2 | 1 | 1 | 663 |
3997153 | 5 | Love this place - super amazing - staff here i... | 0 | 0 | 0 | 141 |
4262000 | 5 | Thank you Dana!!!! Having dyed my hair black p... | 0 | 0 | 0 | 455 |
In [16]:
data['stars'].value_counts()
Out[16]:
5 7532
1 2468
Name: stars, dtype: int64
In [17]:
sns.countplot(data['stars'])
d:\ProgramData\Anaconda3\envs\bigdata\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
Out[17]:
<AxesSubplot:xlabel='stars', ylabel='count'>
In [18]:
sns.distplot(data['text_length'])
d:\ProgramData\Anaconda3\envs\bigdata\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Out[18]:
<AxesSubplot:xlabel='text_length', ylabel='Density'>
In [19]:
data.corr()
Out[19]:
stars | useful | funny | cool | text_length | |
---|---|---|---|---|---|
stars | 1.000000 | -0.098825 | -0.089860 | 0.060101 | -0.221752 |
useful | -0.098825 | 1.000000 | 0.656630 | 0.525962 | 0.161592 |
funny | -0.089860 | 0.656630 | 1.000000 | 0.741797 | 0.215003 |
cool | 0.060101 | 0.525962 | 0.741797 | 1.000000 | 0.193500 |
text_length | -0.221752 | 0.161592 | 0.215003 | 0.193500 | 1.000000 |
In [21]:
sns.heatmap(data.corr(), cmap = 'coolwarm')
Out[21]:
<AxesSubplot:>
In [22]:
data['text']
Out[22]:
2967245 LOVE the cheeses here. They are worth the pri...
4773684 This has become our go-to sushi place. The sus...
1139855 I was very disappointed with the hotel. The re...
3997153 Love this place - super amazing - staff here i...
4262000 Thank you Dana!!!! Having dyed my hair black p...
...
1567641 I'm a sucker for places like this. Get me in f...
4910763 Extremely rude staff! Was told 4 min on a lar...
1036315 I live in NYC and went to the RTR here in the ...
555962 If you are looking for a trainer, then look no...
838267 Awesome food. Awesome beer. Awesome service. N...
Name: text, Length: 10000, dtype: object
In [23]:
import string
In [24]:
string.punctuation
Out[24]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [25]:
s = data.iloc[0]['text']
In [26]:
s
Out[26]:
'LOVE the cheeses here. They are worth the price. Great for finding treats for a special dinner or picnic. Nice on sample days. Yum!!! Top quality meats. Nice selection of non brand frozen veggies. Veggie chips are mega tasty. Always quick and friendly check out. Produce not as stellar as it once was, but also not finding better in Madison.'
In [33]:
def remove_punc(x):
new_s = []
for i in x:
if i not in string.punctuation:
new_s.append(i)
new_s = ''.join(new_s)
return new_s
In [35]:
data['text'].apply(remove_punc)
Out[35]:
'LOVE the cheeses here They are worth the price Great for finding treats for a special dinner or picnic Nice on sample days Yum Top quality meats Nice selection of non brand frozen veggies Veggie chips are mega tasty Always quick and friendly check out Produce not as stellar as it once was but also not finding better in Madison'
In [37]:
''.join([i for i in s if i not in string.punctuation])
Out[37]:
'LOVE the cheeses here They are worth the price Great for finding treats for a special dinner or picnic Nice on sample days Yum Top quality meats Nice selection of non brand frozen veggies Veggie chips are mega tasty Always quick and friendly check out Produce not as stellar as it once was but also not finding better in Madison'
In [40]:
data['text'] = data['text'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))
In [41]:
data['text']
Out[41]:
2967245 LOVE the cheeses here They are worth the pric...
4773684 This has become our goto sushi place The sushi...
1139855 I was very disappointed with the hotel The res...
3997153 Love this place super amazing staff here is ...
4262000 Thank you Dana Having dyed my hair black previ...
...
1567641 Im a sucker for places like this Get me in fro...
4910763 Extremely rude staff Was told 4 min on a larg...
1036315 I live in NYC and went to the RTR here in the ...
555962 If you are looking for a trainer then look no ...
838267 Awesome food Awesome beer Awesome service Need...
Name: text, Length: 10000, dtype: object
In [42]:
from nltk.corpus import stopwords
In [139]:
# stopwords.words('english')
In [85]:
#s.split(' ')
In [45]:
s.lower()
Out[45]:
'love the cheeses here. they are worth the price. great for finding treats for a special dinner or picnic. nice on sample days. yum!!! top quality meats. nice selection of non brand frozen veggies. veggie chips are mega tasty. always quick and friendly check out. produce not as stellar as it once was, but also not finding better in madison.'
In [54]:
def stop_w(x):
new_s = []
for i in x.split():
if i.lower() not in stopwords.words('english'):
new_s.append(i.lower())
return new_s
In [55]:
data['text'].apply(stop_w)
Out[55]:
2967245 [love, cheeses, worth, price, great, finding, ...
4773684 [become, goto, sushi, place, sushi, always, fr...
1139855 [disappointed, hotel, restaurants, good, booke...
3997153 [love, place, super, amazing, staff, always, f...
4262000 [thank, dana, dyed, hair, black, previously, k...
...
1567641 [im, sucker, places, like, get, front, meat, c...
4910763 [extremely, rude, staff, told, 4, min, large, ...
1036315 [live, nyc, went, rtr, flatiron, didnt, select...
555962 [looking, trainer, look, moment, humberto, met...
838267 [awesome, food, awesome, beer, awesome, servic...
Name: text, Length: 10000, dtype: object
In [60]:
data['text'] = data['text'].apply(lambda x: [ i.lower() for i in x.split() if i.lower() not in stopwords.words('english')])
In [61]:
data.head()
Out[61]:
stars | text | useful | funny | cool | text_length | |
---|---|---|---|---|---|---|
2967245 | 5 | [love, cheeses, worth, price, great, finding, ... | 0 | 0 | 1 | 347 |
4773684 | 5 | [become, goto, sushi, place, sushi, always, fr... | 0 | 0 | 0 | 377 |
1139855 | 1 | [disappointed, hotel, restaurants, good, booke... | 2 | 1 | 1 | 663 |
3997153 | 5 | [love, place, super, amazing, staff, always, f... | 0 | 0 | 0 | 141 |
4262000 | 5 | [thank, dana, dyed, hair, black, previously, k... | 0 | 0 | 0 | 455 |
In [66]:
word_split = []
for i in range(len(data)):
for j in data.iloc[i]['text']:
word_split.append(j)
In [68]:
len(word_split)
Out[68]:
542773
In [70]:
from nltk.probability import FreqDist
In [72]:
plt.figure(figsize=(20,10))
FreqDist(word_split).plot(30)
Out[72]:
<AxesSubplot:xlabel='Samples', ylabel='Counts'>
In [73]:
from wordcloud import WordCloud
In [76]:
wc = WordCloud().generate(str(data['text']))
plt.figure(figsize=(10,5))
plt.imshow(wc)
plt.axis('off')
Out[76]:
(-0.5, 399.5, 199.5, -0.5)
In [77]:
data['stars'].value_counts()
Out[77]:
5 7532
1 2468
Name: stars, dtype: int64
In [82]:
good =data[data['stars'] == 5]['text']
In [80]:
bad =data[data['stars'] == 1]['text']
In [83]:
wc = WordCloud().generate(str(good))
plt.figure(figsize=(10,5))
plt.imshow(wc)
plt.axis('off')
Out[83]:
(-0.5, 399.5, 199.5, -0.5)
In [84]:
wc = WordCloud().generate(str(bad))
plt.figure(figsize=(10,5))
plt.imshow(wc)
plt.axis('off')
Out[84]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
감정 예측 모델¶
In [88]:
data2 = pd.read_csv('./09. 상품 리뷰 분석(NLP)/yelp.csv', index_col =0)
In [ ]:
In [93]:
X = data2['text']
y = data2['stars']
In [90]:
from sklearn.feature_extraction.text import CountVectorizer
In [94]:
cv = CountVectorizer()
In [95]:
cv.fit(X)
Out[95]:
CountVectorizer()
In [96]:
X= cv.transform(X)
In [97]:
X
Out[97]:
<10000x28253 sparse matrix of type '<class 'numpy.int64'>'
with 679150 stored elements in Compressed Sparse Row format>
In [140]:
# print(X)
In [100]:
cv.get_feature_names()[28048]
Out[100]:
'yourself'
In [101]:
from sklearn.model_selection import train_test_split
In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state = 100)
In [104]:
from sklearn.naive_bayes import MultinomialNB
In [105]:
model = MultinomialNB()
In [106]:
model.fit(X_train, y_train)
Out[106]:
MultinomialNB()
In [107]:
pred = model.predict(X_test)
In [108]:
pred
Out[108]:
array([5, 5, 5, ..., 1, 5, 1], dtype=int64)
In [109]:
y_test
Out[109]:
1373705 5
3128713 5
212088 1
1622136 5
2380124 5
..
3548316 5
38943 5
2423674 1
1564863 5
3629333 1
Name: stars, Length: 2000, dtype: int64
In [110]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Naive Bayes 정확도 92.65%¶
In [111]:
accuracy_score(y_test, pred)
Out[111]:
0.9265
In [113]:
confusion_matrix(y_test, pred)
Out[113]:
array([[ 421, 65],
[ 82, 1432]], dtype=int64)
In [114]:
print(classification_report(y_test, pred))
precision recall f1-score support
1 0.84 0.87 0.85 486
5 0.96 0.95 0.95 1514
accuracy 0.93 2000
macro avg 0.90 0.91 0.90 2000
weighted avg 0.93 0.93 0.93 2000
Random Forest 로 비교¶
In [116]:
from sklearn.ensemble import RandomForestClassifier
In [118]:
rf = RandomForestClassifier(max_depth = 10)
In [121]:
rf.fit(X_train, y_train)
Out[121]:
RandomForestClassifier(max_depth=10)
In [122]:
pred2 = rf.predict(X_test)
In [124]:
accuracy_score(y_test, pred2)
Out[124]:
0.787
In [131]:
confusion_matrix(y_test, pred2)
Out[131]:
array([[ 64, 422],
[ 4, 1510]], dtype=int64)
트리와 깊이를 늘려서 다시 돌림¶
In [132]:
rf = RandomForestClassifier(max_depth = 10, n_estimators = 1000)
In [133]:
rf.fit(X_train, y_train)
Out[133]:
RandomForestClassifier(max_depth=10, n_estimators=1000)
In [135]:
pred3 = rf.predict(X_test)
Random Forest 정확도 78.45 Naive Bayes 92.65보다 정확도가 떨어짐¶
In [136]:
accuracy_score(y_test, pred3)
Out[136]:
0.7845
In [137]:
confusion_matrix(y_test, pred3)
Out[137]:
array([[ 59, 427],
[ 4, 1510]], dtype=int64)
In [138]:
print(classification_report(y_test, pred3))
precision recall f1-score support
1 0.94 0.12 0.21 486
5 0.78 1.00 0.88 1514
accuracy 0.78 2000
macro avg 0.86 0.56 0.55 2000
weighted avg 0.82 0.78 0.71 2000
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
'Machine Learning, Deep Learning' 카테고리의 다른 글
[sklearn, statsmodels] Linear Regression 한국 환경 공단 실내 공기질 분석 (0) | 2021.01.28 |
---|---|
[sklearn] KNN(K Nearlist Neighbors) 알고리즘을 이용하여 고객이탈 예측하기 (0) | 2021.01.27 |
[sklearn] Logistic Regression을 활용한 소비자 광고 반응률 예측 (0) | 2021.01.22 |
[sklearn,statsmodels] Linear Regression을 이용한 고객별 연간 지출액 예측 (0) | 2021.01.22 |
[sklearn, SVM] 서포트 백터 머신을 이용해서 외국어 문장 판별하기 (0) | 2021.01.17 |