Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
Tags
- 파이썬웹개발
- 비지도학습
- OpenAIAPI
- 딥러닝
- python 정렬
- 판다스
- chatGPT
- 판다스 데이터정렬
- HTML
- programmablesearchengine
- 챗gpt
- fastapi
- Python
- pandas
- 파이썬
- langchain
- 머신러닝
- konlpy
- 파이토치기본
- MachineLearning
- fastapi #파이썬웹개발
- 사이킷런
- 랭체인
- fastapi #python웹개발
- deeplearning
- 자연어분석
- 파이토치
- NLP
- sklearn
- pytorch
Archives
- Today
- Total
Data Navigator
[NLP, gensim] 영문 데이터 Word2Vec 만들고 Embedding Visualization으로 시각화 하기 본문
Machine Learning, Deep Learning
[NLP, gensim] 영문 데이터 Word2Vec 만들고 Embedding Visualization으로 시각화 하기
코딩하고분석하는돌스 2021. 2. 17. 23:491. 영어 Word2Vec 만들기 gensim 패키지 사용¶
In [13]:
import nltk
nltk.download('punkt')
[nltk_data] Downloading package punkt to /home/haram4th/nltk_data... [nltk_data] Package punkt is already up-to-date!
Out[13]:
True
In [14]:
import urllib.request
import zipfile
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
2. Data load (ted_en-20160408.xml, 출처: wit3.fbk.eu)¶
In [15]:
targetXML = open('ted_en-20160408.xml', 'r', encoding='utf-8')
In [16]:
target_text = etree.parse(targetXML)
In [17]:
# xml 파일로부터 <content>와 </content> 사이의 내용만 가져옴
parse_text = '\n'.join(target_text.xpath('//content/text()'))
In [18]:
# 정규표현식의 sub 모듈을 사용해 (Ausio, Laughter)등 배경음 부분 제거
content_text = re.sub(r'\([^])]*\)','', parse_text)
In [19]:
# 입력 말뭉치에 대해서 NLTK를 이용해 문장 토큰화
sent_text = sent_tokenize(content_text)
In [20]:
# 문장 중 글자와 숫자만 추출하고 소문자로 변경
normalized_text = []
for string in sent_text:
tokens = re.sub(r"[^a-z0-9]+"," ", string.lower())
normalized_text.append(tokens)
# 각 문장을 NLTK를 이용해서 단어로 토큰화
result = []
result = [word_tokenize(sentence) for sentence in normalized_text]
In [21]:
len(result)
Out[21]:
273425
In [22]:
print(result[:10])
[['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new'], ['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation'], ['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing'], ['consider', 'facit'], ['i', 'm', 'actually', 'old', 'enough', 'to', 'remember', 'them'], ['facit', 'was', 'a', 'fantastic', 'company'], ['they', 'were', 'born', 'deep', 'in', 'the', 'swedish', 'forest', 'and', 'they', 'made', 'the', 'best', 'mechanical', 'calculators', 'in', 'the', 'world'], ['everybody', 'used', 'them'], ['and', 'what', 'did', 'facit', 'do', 'when', 'the', 'electronic', 'calculator', 'came', 'along'], ['they', 'continued', 'doing', 'exactly', 'the', 'same']]
In [45]:
len(result[150])
Out[45]:
24
3. Word2Vec 훈련시키기¶
In [30]:
from gensim.models import Word2Vec, KeyedVectors
In [29]:
model = Word2Vec(sentences=result, size=100, window=5, min_count=5, workers=4, sg=9)
In [32]:
# model.wv.most_similar 입력한 단어와 가장 비슷한 단어 출력
model_result = model.wv.most_similar('man')
print(model_result)
[('guy', 0.757636547088623), ('woman', 0.7501822710037231), ('soldier', 0.7123635411262512), ('lady', 0.6881694793701172), ('boy', 0.684287428855896), ('testament', 0.6794043779373169), ('rabbi', 0.6783125996589661), ('comedian', 0.6772958040237427), ('michelangelo', 0.6743930578231812), ('policeman', 0.6743665933609009)]
In [ ]:
4. Word2Vec 모델 저장하고 로드하기¶
In [36]:
model.wv.save_word2vec_format('./eng_w2v') #모델저장
loaded_model = KeyedVectors.load_word2vec_format('./eng_w2v') #모델로드
In [37]:
model_result = loaded_model.most_similar('man')
print(model_result)
[('guy', 0.757636547088623), ('woman', 0.7501822710037231), ('soldier', 0.7123635411262512), ('lady', 0.6881694793701172), ('boy', 0.684287428855896), ('testament', 0.6794043779373169), ('rabbi', 0.6783125996589661), ('comedian', 0.6772958040237427), ('michelangelo', 0.6743930578231812), ('policeman', 0.6743665933609009)]
In [ ]:
5. Google의 Word2Vec 모델을 이용해서 훈련하기¶
In [47]:
import gensim
In [49]:
model = gensim.models.KeyedVectors.load_word2vec_format('./datas/GoogleNews-vectors-negative300.bin', binary=True)
In [50]:
print(model.vectors.shape)
(3000000, 300)
총 300만개의 단어가 300차원으로 있음¶
In [52]:
type(model)
Out[52]:
gensim.models.keyedvectors.Word2VecKeyedVectors
In [53]:
print(model.similarity('this','is'))
0.40797037
In [54]:
print(model.similarity('travel','family'))
0.13925761
In [57]:
print(model['AI'])
[ 0.18066406 0.01342773 0.14746094 0.00302124 -0.16699219 0.00540161 -0.25976562 0.01556396 -0.18457031 -0.11035156 -0.02893066 0.00170135 0.10107422 -0.19433594 -0.05249023 0.00146484 0.28125 -0.02954102 -0.06030273 -0.03833008 -0.0378418 0.08984375 0.234375 0.10888672 -0.10839844 -0.06103516 0.02307129 0.16601562 -0.11669922 -0.17285156 -0.14160156 -0.2265625 -0.08935547 -0.08496094 -0.27539062 0.17480469 0.02062988 0.12158203 -0.0703125 -0.00286865 0.328125 -0.00318909 -0.07666016 0.43554688 0.00619507 -0.39453125 0.16699219 -0.11621094 0.14648438 -0.04101562 -0.12695312 -0.04980469 -0.09082031 0.05712891 -0.21484375 0.04101562 0.21875 -0.20117188 0.05078125 0.32617188 -0.046875 -0.05395508 0.08349609 0.04516602 -0.20410156 -0.07910156 0.35351562 -0.06787109 -0.24804688 0.11035156 -0.12304688 0.33203125 0.12011719 -0.19238281 -0.20410156 -0.34179688 0.20019531 -0.125 0.06201172 0.1328125 -0.25976562 0.27734375 0.06933594 0.01879883 0.04052734 -0.08740234 0.04370117 0.3671875 0.18066406 0.05761719 0.02514648 0.05273438 -0.55078125 0.02514648 0.16015625 0.16894531 0.12304688 -0.10742188 0.05200195 -0.25195312 0.15234375 -0.28710938 -0.08496094 0.09082031 0.1640625 0.09326172 -0.00909424 -0.23730469 0.09033203 0.00314331 -0.19140625 -0.13964844 -0.21972656 0.0189209 -0.11035156 -0.10107422 -0.27148438 -0.17480469 0.08935547 -0.34765625 -0.18164062 -0.18261719 -0.17480469 -0.12353516 -0.296875 -0.16796875 -0.2265625 -0.140625 -0.01586914 -0.21582031 -0.12060547 -0.01104736 -0.4375 0.09521484 0.02648926 -0.24121094 0.05175781 -0.15136719 -0.27734375 -0.04589844 0.35351562 -0.20605469 -0.046875 0.25390625 -0.2734375 0.05932617 0.24902344 -0.1328125 0.17480469 0.05737305 -0.28125 0.09619141 -0.1875 -0.22265625 -0.09033203 -0.484375 0.06787109 -0.06640625 -0.03222656 -0.14746094 -0.03344727 -0.00115204 -0.19433594 -0.08398438 -0.36523438 -0.27148438 0.09716797 -0.05175781 -0.12451172 0.2265625 -0.12890625 -0.34179688 -0.01538086 -0.15820312 0.34960938 -0.23828125 -0.21289062 -0.48828125 0.3203125 -0.04223633 0.0546875 0.03735352 0.078125 0.33007812 -0.06982422 0.01470947 0.25 -0.17382812 -0.20019531 -0.13476562 -0.20898438 0.05151367 -0.13574219 -0.15429688 0.03515625 0.13085938 0.0222168 -0.35546875 -0.11279297 -0.16894531 0.00212097 -0.03491211 -0.42578125 0.0267334 -0.05761719 -0.08886719 -0.1796875 0.0246582 -0.03063965 -0.00860596 -0.17285156 0.18359375 0.04638672 -0.07128906 -0.12792969 0.06738281 0.21289062 0.13476562 -0.18457031 -0.171875 -0.27734375 0.17285156 0.10449219 0.05444336 0.01977539 -0.00830078 -0.27148438 0.22167969 -0.05273438 0.06835938 0.14941406 0.1796875 0.08935547 -0.109375 -0.04516602 -0.08300781 0.08007812 -0.26757812 0.16601562 -0.00854492 -0.12988281 0.02416992 -0.16699219 -0.20410156 -0.28515625 0.03039551 -0.13476562 0.09228516 -0.13867188 -0.0534668 0.14550781 -0.0402832 0.03369141 0.07470703 0.05737305 -0.14453125 0.21484375 0.13964844 0.11572266 -0.16992188 0.13867188 0.15917969 0.15234375 -0.11035156 -0.0456543 -0.13769531 -0.30273438 -0.00686646 -0.17480469 0.24707031 0.02246094 0.10742188 -0.09814453 -0.14941406 0.05908203 -0.15429688 -0.0390625 0.15820312 -0.05004883 0.14453125 -0.22460938 0.04711914 0.04443359 -0.140625 0.12011719 -0.03393555 -0.25976562 -0.23730469 0.12695312 0.25 -0.33789062 0.15820312 0.01196289 -0.04150391 -0.11767578 -0.06884766 -0.16796875 0.10253906 -0.07128906 0.18359375]
In [ ]:
6. 학습된 임베딩 벡터의 시각화(Embedding Visualization)¶
google의 embedding projector를 사용한 시각화¶
In [58]:
!python -m gensim.scripts.word2vec2tensor --input eng_w2v --output eng_w2v
2021-02-17 22:42:34,280 - word2vec2tensor - INFO - running /home/haram4th/anaconda3/envs/mdai/lib/python3.8/site-packages/gensim/scripts/word2vec2tensor.py --input eng_w2v --output eng_w2v 2021-02-17 22:42:34,281 - utils_any2vec - INFO - loading projection weights from eng_w2v 2021-02-17 22:42:35,886 - utils_any2vec - INFO - loaded (21613, 100) matrix from eng_w2v 2021-02-17 22:42:37,145 - word2vec2tensor - INFO - 2D tensor file saved to eng_w2v_tensor.tsv 2021-02-17 22:42:37,145 - word2vec2tensor - INFO - Tensor metadata file saved to eng_w2v_metadata.tsv 2021-02-17 22:42:37,147 - word2vec2tensor - INFO - finished running word2vec2tensor.py
In [ ]:
2) Embedding Projector를 사용하여 시각화 하기¶
In [59]:
from IPython.display import Image
In [60]:
Image('./2021-02-17 22-46-35.png')
Out[60]:
In [61]:
Image('./2021-02-17 22-57-16.png')
Out[61]:
화면 오른쪽 위의 search에 travel을 넣고 neibothers 636개, T-SNE로 돌려 나온 결과¶
In [62]:
Image('./2021-02-17 23-20-40.png')
Out[62]:
In [ ]:
In [ ]: