AI 공부 저장소
[파이썬 머신러닝 완벽가이드] Chap08-5. 감정 분석 본문
Artificial Intelligence/[파이썬 머신러닝 완벽가이드]
[파이썬 머신러닝 완벽가이드] Chap08-5. 감정 분석
aiclaudev 2022. 1. 18. 19:25In [ ]:
# - 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다.
# 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.
In [1]:
# 지도학습 기반 감정분석
In [2]:
import pandas as pd
review_df = pd.read_csv('./labeledTrainData.tsv', header = 0, sep = "\t", quoting = 3) # tsv는 탭으로 컬럼이 분리되었기에 sep='\t'
review_df.head(3)
Out[2]:
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
In [3]:
print(review_df['review'][0])
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
In [4]:
# HTML형식에서 추출하였기에 <br />태그가 여전히 존재 -> 피처로 만들 필요 없으므로 삭제
# 알파벳이 아닌 특수문자도 피처로 만들 필요 없으므로 삭제
import re # 정규표현식을 지원하기 위한 파이썬 모듈
# <br> html 태그는 replace 함수로 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br />', ' ') # str은 판다스에서의 문자열 연산을 지원하기 위함
# 파이썬의 정규표현식 모듈인 re를 이용해 영어 문자열이 아닌 문자는 모두 공백으로 변환
review_df['review'] = review_df['review'].apply(lambda x : re.sub("[^a-zA-Z]", " ", x))
In [5]:
# sentiment 컬럼을 별도의 결정 값 데이터 세트로, id와 sentiment를 삭제해 피처 데이터 세트를 생성
# train, test set으로 분리
from sklearn.model_selection import train_test_split
class_df = review_df['sentiment']
feature_df = review_df.drop(['id', 'sentiment'], axis = 1, inplace = False)
X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size = 0.3, random_state = 156)
X_train.shape, X_test.shape
Out[5]:
((17500, 1), (7500, 1))
In [6]:
# Pipeline 객체를 이용해 피처 벡터화와 예측 성능 측정을 한번에 수행
# CountVectorization 이용
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
# 스톱 워드는 english, filtering, ngram은(1, 2)로 설정해 CountVectorization 수행
# LogisticRegression의 C는 10으로 설정
pipeline = Pipeline([
('cnt_vect', CountVectorizer(stop_words = 'english', ngram_range = (1, 2))),
('lr_clf', LogisticRegression(C = 10))])
# Pipeline객체를 이용해 fit(), predict()로 학습/예측 수행. predict_proba()는 roc_auc 때문에 수행
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]
print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test, pred), roc_auc_score(y_test, pred_probs)))
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
예측 정확도는 0.8860, ROC-AUC는 0.9503
In [7]:
# TF-IDF 이용
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
# 스톱 워드는 english, filtering, ngram은(1, 2)로 설정해 CountVectorization 수행
# LogisticRegression의 C는 10으로 설정
pipeline = Pipeline([
('tfidf_vect', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2))),
('lr_clf', LogisticRegression(C = 10))])
# Pipeline객체를 이용해 fit(), predict()로 학습/예측 수행. predict_proba()는 roc_auc 때문에 수행
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]
print('예측 정확도는 {0:.4f}, ROC-AUC는 {1:.4f}'.format(accuracy_score(y_test, pred), roc_auc_score(y_test, pred_probs)))
예측 정확도는 0.8936, ROC-AUC는 0.9598
In [ ]:
In [8]:
# 비지도학습 기반 감정분석
In [9]:
# 위 지도학습은 레이블 값을 가지지만 대부분의 감정분석용 데이터는 레이블 값 없음
# 따라서, 비지도 감정 분석은 Lexicon(어휘집)을 기반으로 함.
# Lexicon은 감성사전으로 표현 가능하며, 긍정 또는 부정 감성의 정도를 의미하는 수치값인 감성 지수를 가짐
In [10]:
# NLP 패키지의 WordNet
# NLP에서 제공하는 WordNet 모듈은 방대한 영어 어휘 사전
# 시맨틱(Semantic) : 문맥상의 의미를 칭함
# NLP 패키지는 시맨틱을 프로그램적으로 인터페이스할 수 있는 다양한 방법 제공
# WordNet은 다양한 상황에서 같은 어휘라도 다르게 상ㅇ되는 어휘의 시맨틱 정보를 제공, 이를 위해 각각의 품사로 구성된
# 개별 단어를 Synset이라는 개념을 이용해 표현
# Synset은 단순한 하나의 단어가 아니라 그 단어가 가지는 문맥, 시맨틱 정보를 제공하는 WordNet의 핵심 개념
In [11]:
# NLTK에서 제공하는 감성사전은 훌륭한 사전 역할을 제공하지만, 예측 성능은 그리 좋지 못함
In [12]:
# NLTK를 포함한 대표적인 감성 사전은 아래와 같음
In [13]:
# SentiWordNet
# NLTK 패키지의 WordNet과 유사하게 감성 단어 전용의 WordNet을 구현
# WordNet의 Synset 개념을 감성 분석에 적용
# 긍정 감성 지수, 부정 감성 지수, 객관적 지수를 수치화하여 WordNet의 Synset별로 점수를 할당
# 문장별로 단어들의 긍정 감성 지수와 부정 감성 지수를 합산하여 최종 감성 지수를 계산하여 긍정인지 부정인지 결정
In [14]:
# VADER
# 주로 소셜미디어의 텍스트에 대한 감성 분석을 제공하기 위한 패키지임
# 뛰어난 감성 분석 결과를 제공하며, 비교적 빠른 수행시간을 보장해 대용량 텍스트 데이터에 잘 사용되는 패키지임
In [15]:
# Pattern
# 예측 성능 측면에서 가장 주목받는 패키지이지만, 파이썬 3.X 버전에서 호환되지 않고 파이썬 2.X 버전에서만 동작함
In [ ]:
In [16]:
# SentiWordNet을 이용한 감성 분석
In [17]:
import nltk
nltk.download('all') # NLTK의 WordNet 서브패키지와 데이터 세트를 모두 내려받기
[nltk_data] Downloading collection 'all' [nltk_data] | [nltk_data] | Downloading package abc to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package abc is already up-to-date! [nltk_data] | Downloading package alpino to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package alpino is already up-to-date! [nltk_data] | Downloading package averaged_perceptron_tagger to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package averaged_perceptron_tagger is already up- [nltk_data] | to-date! [nltk_data] | Downloading package averaged_perceptron_tagger_ru to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package averaged_perceptron_tagger_ru is already [nltk_data] | up-to-date! [nltk_data] | Downloading package basque_grammars to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package basque_grammars is already up-to-date! [nltk_data] | Downloading package biocreative_ppi to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package biocreative_ppi is already up-to-date! [nltk_data] | Downloading package bllip_wsj_no_aux to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package bllip_wsj_no_aux is already up-to-date! [nltk_data] | Downloading package book_grammars to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package book_grammars is already up-to-date! [nltk_data] | Downloading package brown to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package brown is already up-to-date! [nltk_data] | Downloading package brown_tei to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package brown_tei is already up-to-date! [nltk_data] | Downloading package cess_cat to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package cess_cat is already up-to-date! [nltk_data] | Downloading package cess_esp to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package cess_esp is already up-to-date! [nltk_data] | Downloading package chat80 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package chat80 is already up-to-date! [nltk_data] | Downloading package city_database to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package city_database is already up-to-date! [nltk_data] | Downloading package cmudict to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package cmudict is already up-to-date! [nltk_data] | Downloading package comparative_sentences to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package comparative_sentences is already up-to- [nltk_data] | date! [nltk_data] | Downloading package comtrans to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package comtrans is already up-to-date! [nltk_data] | Downloading package conll2000 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package conll2000 is already up-to-date! [nltk_data] | Downloading package conll2002 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package conll2002 is already up-to-date! [nltk_data] | Downloading package conll2007 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package conll2007 is already up-to-date! [nltk_data] | Downloading package crubadan to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package crubadan is already up-to-date! [nltk_data] | Downloading package dependency_treebank to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package dependency_treebank is already up-to-date! [nltk_data] | Downloading package dolch to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package dolch is already up-to-date! [nltk_data] | Downloading package europarl_raw to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package europarl_raw is already up-to-date! [nltk_data] | Downloading package floresta to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package floresta is already up-to-date! [nltk_data] | Downloading package framenet_v15 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package framenet_v15 is already up-to-date! [nltk_data] | Downloading package framenet_v17 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package framenet_v17 is already up-to-date! [nltk_data] | Downloading package gazetteers to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package gazetteers is already up-to-date! [nltk_data] | Downloading package genesis to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package genesis is already up-to-date! [nltk_data] | Downloading package gutenberg to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package gutenberg is already up-to-date! [nltk_data] | Downloading package ieer to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package ieer is already up-to-date! [nltk_data] | Downloading package inaugural to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package inaugural is already up-to-date! [nltk_data] | Downloading package indian to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package indian is already up-to-date! [nltk_data] | Downloading package jeita to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package jeita is already up-to-date! [nltk_data] | Downloading package kimmo to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package kimmo is already up-to-date! [nltk_data] | Downloading package knbc to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package knbc is already up-to-date! [nltk_data] | Downloading package large_grammars to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package large_grammars is already up-to-date! [nltk_data] | Downloading package lin_thesaurus to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package lin_thesaurus is already up-to-date! [nltk_data] | Downloading package mac_morpho to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package mac_morpho is already up-to-date! [nltk_data] | Downloading package machado to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package machado is already up-to-date! [nltk_data] | Downloading package masc_tagged to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package masc_tagged is already up-to-date! [nltk_data] | Downloading package maxent_ne_chunker to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package maxent_ne_chunker is already up-to-date! [nltk_data] | Downloading package maxent_treebank_pos_tagger to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package maxent_treebank_pos_tagger is already up- [nltk_data] | to-date! [nltk_data] | Downloading package moses_sample to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package moses_sample is already up-to-date! [nltk_data] | Downloading package movie_reviews to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package movie_reviews is already up-to-date! [nltk_data] | Downloading package mte_teip5 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package mte_teip5 is already up-to-date! [nltk_data] | Downloading package mwa_ppdb to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package mwa_ppdb is already up-to-date! [nltk_data] | Downloading package names to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package names is already up-to-date! [nltk_data] | Downloading package nombank.1.0 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package nombank.1.0 is already up-to-date! [nltk_data] | Downloading package nonbreaking_prefixes to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package nonbreaking_prefixes is already up-to-date! [nltk_data] | Downloading package nps_chat to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package nps_chat is already up-to-date! [nltk_data] | Downloading package omw to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package omw is already up-to-date! [nltk_data] | Downloading package omw-1.4 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package omw-1.4 is already up-to-date! [nltk_data] | Downloading package opinion_lexicon to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package opinion_lexicon is already up-to-date! [nltk_data] | Downloading package panlex_swadesh to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package panlex_swadesh is already up-to-date! [nltk_data] | Downloading package paradigms to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package paradigms is already up-to-date! [nltk_data] | Downloading package pe08 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package pe08 is already up-to-date! [nltk_data] | Downloading package perluniprops to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package perluniprops is already up-to-date! [nltk_data] | Downloading package pil to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package pil is already up-to-date! [nltk_data] | Downloading package pl196x to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package pl196x is already up-to-date! [nltk_data] | Downloading package porter_test to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package porter_test is already up-to-date! [nltk_data] | Downloading package ppattach to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package ppattach is already up-to-date! [nltk_data] | Downloading package problem_reports to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package problem_reports is already up-to-date! [nltk_data] | Downloading package product_reviews_1 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package product_reviews_1 is already up-to-date! [nltk_data] | Downloading package product_reviews_2 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package product_reviews_2 is already up-to-date! [nltk_data] | Downloading package propbank to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package propbank is already up-to-date! [nltk_data] | Downloading package pros_cons to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package pros_cons is already up-to-date! [nltk_data] | Downloading package ptb to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package ptb is already up-to-date! [nltk_data] | Downloading package punkt to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package punkt is already up-to-date! [nltk_data] | Downloading package qc to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package qc is already up-to-date! [nltk_data] | Downloading package reuters to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package reuters is already up-to-date! [nltk_data] | Downloading package rslp to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package rslp is already up-to-date! [nltk_data] | Downloading package rte to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package rte is already up-to-date! [nltk_data] | Downloading package sample_grammars to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package sample_grammars is already up-to-date! [nltk_data] | Downloading package semcor to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package semcor is already up-to-date! [nltk_data] | Downloading package senseval to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package senseval is already up-to-date! [nltk_data] | Downloading package sentence_polarity to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package sentence_polarity is already up-to-date! [nltk_data] | Downloading package sentiwordnet to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package sentiwordnet is already up-to-date! [nltk_data] | Downloading package shakespeare to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package shakespeare is already up-to-date! [nltk_data] | Downloading package sinica_treebank to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package sinica_treebank is already up-to-date! [nltk_data] | Downloading package smultron to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package smultron is already up-to-date! [nltk_data] | Downloading package snowball_data to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package snowball_data is already up-to-date! [nltk_data] | Downloading package spanish_grammars to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package spanish_grammars is already up-to-date! [nltk_data] | Downloading package state_union to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package state_union is already up-to-date! [nltk_data] | Downloading package stopwords to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package stopwords is already up-to-date! [nltk_data] | Downloading package subjectivity to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package subjectivity is already up-to-date! [nltk_data] | Downloading package swadesh to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package swadesh is already up-to-date! [nltk_data] | Downloading package switchboard to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package switchboard is already up-to-date! [nltk_data] | Downloading package tagsets to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package tagsets is already up-to-date! [nltk_data] | Downloading package timit to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package timit is already up-to-date! [nltk_data] | Downloading package toolbox to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package toolbox is already up-to-date! [nltk_data] | Downloading package treebank to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package treebank is already up-to-date! [nltk_data] | Downloading package twitter_samples to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package twitter_samples is already up-to-date! [nltk_data] | Downloading package udhr to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package udhr is already up-to-date! [nltk_data] | Downloading package udhr2 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package udhr2 is already up-to-date! [nltk_data] | Downloading package unicode_samples to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package unicode_samples is already up-to-date! [nltk_data] | Downloading package universal_tagset to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package universal_tagset is already up-to-date! [nltk_data] | Downloading package universal_treebanks_v20 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package universal_treebanks_v20 is already up-to- [nltk_data] | date! [nltk_data] | Downloading package vader_lexicon to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package vader_lexicon is already up-to-date! [nltk_data] | Downloading package verbnet to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package verbnet is already up-to-date! [nltk_data] | Downloading package verbnet3 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package verbnet3 is already up-to-date! [nltk_data] | Downloading package webtext to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package webtext is already up-to-date! [nltk_data] | Downloading package wmt15_eval to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package wmt15_eval is already up-to-date! [nltk_data] | Downloading package word2vec_sample to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package word2vec_sample is already up-to-date! [nltk_data] | Downloading package wordnet to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package wordnet is already up-to-date! [nltk_data] | Downloading package wordnet2021 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package wordnet2021 is already up-to-date! [nltk_data] | Downloading package wordnet31 to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package wordnet31 is already up-to-date! [nltk_data] | Downloading package wordnet_ic to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package wordnet_ic is already up-to-date! [nltk_data] | Downloading package words to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package words is already up-to-date! [nltk_data] | Downloading package ycoe to [nltk_data] | C:\Users\82102\AppData\Roaming\nltk_data... [nltk_data] | Package ycoe is already up-to-date! [nltk_data] | [nltk_data] Done downloading collection all
Out[17]:
True
In [18]:
from nltk.corpus import wordnet as wn
term = 'present'
# 'present'라는 단어로 wordnet의 synsets 생성
synsets = wn.synsets(term) # 'present'단어에 대한 Synset을 추출함 / wordnet의 synsets()는 파라미터로 지정된 단어에 대해 WordNet에 등재된
# 모든 Synset 객체를 반환함
print('synsets() 반환 type : ', type(synsets))
print('synsets() 반환 값 개수: ', len(synsets))
print('synsets() 반환 값: ', synsets)
synsets() 반환 type : <class 'list'> synsets() 반환 값 개수: 18 synsets() 반환 값: [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]
In [19]:
# synsets()호출 시 반환되는 것은 Synset 객체를 가지는 리스트 cf) synset도 있음
# Synset('present.n.01')와 같이 파라미터 'present.n.01'은 POS 태그를 의미
# present는 의미, n은 명사 품사, 01은 명사로서 가지는 여러 의미가 있기에 이를 구분하는 인덱스임
In [20]:
# Synset 객체가 가지는 여러 가지 속성 살펴보기
# POS(Part of speech : 품사), 정의(Definition), 부명제(Lemma)
for synset in synsets :
print('##### Synset name : ', synset.name(), '#####')
print('POS : ', synset.lexname())
print('Definition : ', synset.definition())
print('Lemmas : ', synset.lemma_names())
print('\n')
##### Synset name : present.n.01 ##### POS : noun.time Definition : the period of time that is happening now; any continuous stretch of time including the moment of speech Lemmas : ['present', 'nowadays'] ##### Synset name : present.n.02 ##### POS : noun.possession Definition : something presented as a gift Lemmas : ['present'] ##### Synset name : present.n.03 ##### POS : noun.communication Definition : a verb tense that expresses actions or states at the time of speaking Lemmas : ['present', 'present_tense'] ##### Synset name : show.v.01 ##### POS : verb.perception Definition : give an exhibition of to an interested audience Lemmas : ['show', 'demo', 'exhibit', 'present', 'demonstrate'] ##### Synset name : present.v.02 ##### POS : verb.communication Definition : bring forward and present to the mind Lemmas : ['present', 'represent', 'lay_out'] ##### Synset name : stage.v.01 ##### POS : verb.creation Definition : perform (a play), especially on a stage Lemmas : ['stage', 'present', 'represent'] ##### Synset name : present.v.04 ##### POS : verb.possession Definition : hand over formally Lemmas : ['present', 'submit'] ##### Synset name : present.v.05 ##### POS : verb.stative Definition : introduce Lemmas : ['present', 'pose'] ##### Synset name : award.v.01 ##### POS : verb.possession Definition : give, especially as an honor or reward Lemmas : ['award', 'present'] ##### Synset name : give.v.08 ##### POS : verb.possession Definition : give as a present; make a gift of Lemmas : ['give', 'gift', 'present'] ##### Synset name : deliver.v.01 ##### POS : verb.communication Definition : deliver (a speech, oration, or idea) Lemmas : ['deliver', 'present'] ##### Synset name : introduce.v.01 ##### POS : verb.communication Definition : cause to come to know personally Lemmas : ['introduce', 'present', 'acquaint'] ##### Synset name : portray.v.04 ##### POS : verb.creation Definition : represent abstractly, for example in a painting, drawing, or sculpture Lemmas : ['portray', 'present'] ##### Synset name : confront.v.03 ##### POS : verb.communication Definition : present somebody with something, usually to accuse or criticize Lemmas : ['confront', 'face', 'present'] ##### Synset name : present.v.12 ##### POS : verb.communication Definition : formally present a debutante, a representative of a country, etc. Lemmas : ['present'] ##### Synset name : salute.v.06 ##### POS : verb.communication Definition : recognize with a gesture prescribed by a military regulation; assume a prescribed position Lemmas : ['salute', 'present'] ##### Synset name : present.a.01 ##### POS : adj.all Definition : temporal sense; intermediate between past and future; now existing or happening or in consideration Lemmas : ['present'] ##### Synset name : present.a.02 ##### POS : adj.all Definition : being or existing in a specified place Lemmas : ['present']
In [21]:
# 위처럼, Synset은 하나의 단어가 가질 수 있는 여러가지 시맨틱 정보를 개별 클래스로 나타낸 것
In [22]:
# WordNet은 어떤 어휘와 다른 어휘 간의 관계를 유사도로 나타낼 수 있음
# path_similairy()메소드를 제공
# Synset 객체를 단어별로 생성
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')
entities = [tree, lion, tiger, cat, dog]
similarities = []
entity_names = [entity.name().split('.')[0] for entity in entities]
# 단어별 synset을 반복하면서 다른 단어의 synset과 유사도를 측정
for entity in entities :
similarity = [round(entity.path_similarity(compared_entity), 2) for compared_entity in entities]
similarities.append(similarity)
# 개별 단어별 synset과 다른 단어의 synset과의 유사도를 DataFrame 형태로 저장
similarity_df = pd.DataFrame(similarities, columns = entity_names, index = entity_names)
similarity_df
Out[22]:
tree | lion | tiger | cat | dog | |
---|---|---|---|---|---|
tree | 1.00 | 0.07 | 0.07 | 0.08 | 0.12 |
lion | 0.07 | 1.00 | 0.33 | 0.25 | 0.17 |
tiger | 0.07 | 0.33 | 1.00 | 0.25 | 0.17 |
cat | 0.08 | 0.25 | 0.25 | 1.00 | 0.20 |
dog | 0.12 | 0.17 | 0.17 | 0.20 | 1.00 |
In [23]:
# SentiWordNet은 WordNet의 Synset과 유사한 Senti_Synset 클래스를 가짐
# SentiWordNet 모듈의 senti_synsets()는 WordNet 모듈이라서 synsets()와 비슷하게 Senti_Synset 클래스를 리스트 형태로 반환함
from nltk.corpus import sentiwordnet as swn
senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type : ', type(senti_synsets))
print('senti_synsets() 반환 값 개수 : ', len(senti_synsets))
print('senti_synsets() 반환 값 : ', senti_synsets)
senti_synsets() 반환 type : <class 'list'> senti_synsets() 반환 값 개수 : 11 senti_synsets() 반환 값 : [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]
In [25]:
# SentiSynset 객체는 단어의 감성을 나타내는 감성지수와(긍정, 부정) 객관성을 나타내는 객관성 지수를 가지고 있음
# 어떤 단어가 전혀 감성적이지 않으면 객관성 지수는 1, 감성지수는 모두 0이 됨
father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수 : ', father.pos_score())
print('father 부정감성 지수 : ', father.neg_score())
print('father 객관성 지수 : ', father.obj_score())
fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정감성 지수 : ', fabulous.pos_score())
print('fabulous 부정감성 지수 : ', fabulous.neg_score())
print('fabulous 객관성 지수 : ', fabulous.obj_score())
father 긍정감성 지수 : 0.0 father 부정감성 지수 : 0.0 father 객관성 지수 : 1.0 fabulous 긍정감성 지수 : 0.875 fabulous 부정감성 지수 : 0.125 fabulous 객관성 지수 : 0.0
In [26]:
# SentiWordNet을 이용한 영화 감상평 감성 분석
In [27]:
from nltk.corpus import wordnet as wn
# 간단한 NLTK PennTreebank Tag를 기반으로 WordNet기반의 품사 Tag로 변환
def penn_to_wn(tag) :
if tag.startswith('J') :
return wn.ADJ
elif tag.startswith('N') :
return wn.NOUN
elif tag.startswith('R') :
return wn.ADV
elif tag.startswith('V') :
return wn.VERB
In [35]:
# 문서를 문장 -> 단어 토큰 -> 품사 태깅 후에 SentiSynset 클래스를 생성하고 Polarity Score 합산하는 함수 생성
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag
def swn_polarity(text) :
# 감정 지수 초기화
sentiment = 0.0
tokens_count = 0
lemmatizer = WordNetLemmatizer()
raw_sentences = sent_tokenize(text)
#분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산
for raw_sentence in raw_sentences :
# NLTK 기반의 품사 태깅 문장 추출
tagged_sentence = pos_tag(word_tokenize(raw_sentence)) # pos_tag 사용 시 (단어, 품사) 와 같은 튜플형이 원소인 리스트 반환
for word, tag in tagged_sentence :
# WordNet 기반 품사 태깅과 어근 추출
wn_tag = penn_to_wn(tag)
if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV) :
continue
lemma = lemmatizer.lemmatize(word, pos = wn_tag) # lemmatization은 품사를 같이 입력해주어야 한다.
if not lemma :
continue
# 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체 생성
synsets = wn.synsets(lemma, pos = wn_tag)
if not synsets :
continue
# sentiwordnet의 감성 단어 분석으로 감성 synset 추출
# 모든 단어에 대해 긍정 감성 지수는 +로 부정 감성 지수는 -로 합산해 감성 지수 계산.
synset = synsets[0]
swn_synset = swn.senti_synset(synset.name())
sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
tokens_count += 1
if not tokens_count :
continue
# 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
if sentiment >= 0 :
return 1
return 0
In [36]:
# 새로운 컬럼을 추가해 정확도, 정밀도, 재현율 측정
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score
import numpy as np
review_df['preds'] = review_df['review'].apply(lambda x : swn_polarity(x))
y_target = review_df['sentiment'].values
preds = review_df['preds'].values
print(confusion_matrix(y_target, preds))
print('정확도 : ', np.round(accuracy_score(y_target, preds), 4))
print('정밀도 : ', np.round(precision_score(y_target, preds), 4))
print('재현율 : ', np.round(recall_score(y_target, preds), 45))
[[7668 4832] [3636 8864]] 정확도 : 0.6613 정밀도 : 0.6472 재현율 : 0.70912
In [37]:
# VADER를 이용한 감정 분석
# VADER는 NLTK 패키지의 서브 모듈로 제공될 수도 있고, 단독 패키지로 제공될 수도 있음
# 서브 모듈로 제공은 nltk.download('all')에서 이미 완료
In [39]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)
{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}
In [40]:
# SentimentIntensityAnalyzer의 polarity_scores() 메소드를 통해 쉽게 감정 점수를 구할 수 있음
# neg는 부정감정, neu는 중립감정, pos는 긍정감정, compound는 neg, neu, pos를 적절히 조합해 -1~1 사이의 감성지수 표현한 값
# 보통 compound가 0.1이상이면 긍정감정, 그 이하이면 부정감정으로 판단하나 임계값을 적절히 조절해 예측 성능 조절함
In [41]:
# vader_polarity()함수는 입력 파라미터로 영화 감상평 텍스트와 긍정/부정을 결정하는 threshold를 가지고
# SentimentIntensityAnalyzer 객체의 polarity_scores() 메소드를 호출해 감정 결과를 반환
def vader_polarity(review, threshold = 0.1) :
analyzer = SentimentIntensityAnalyzer()
scores = analyzer.polarity_scores(review)
# compound 값에 기반해 threshold 입력값보다 크면 1, 그렇지 않으면 0을 반환
agg_score = scores['compound']
final_sentiment = 1 if agg_score >= threshold else 0
return final_sentiment
# apply lambda 식을 이용해 레코드 별로 vador_polarity()를 수행하고 결과를 'vader_preds'에 저장
review_df['vader_preds'] = review_df['review'].apply(lambda x : vader_polarity(x, 0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values
print(confusion_matrix(y_target, vader_preds))
print("정확도 : ", np.round(accuracy_score(y_target, vader_preds), 4))
print('정밀도 : ', np.round(precision_score(y_target, vader_preds), 4))
print('재현율 : ', np.round(recall_score(y_target, vader_preds), 45))
[[ 6747 5753] [ 1858 10642]] 정확도 : 0.6956 정밀도 : 0.6491 재현율 : 0.8513600000000001
In [42]:
# SentiWordNet에 비해 향상된 것을 알 수 있음.
In [ ]:
In [ ]:
# - 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다.
# 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.
'Artificial Intelligence > [파이썬 머신러닝 완벽가이드]' 카테고리의 다른 글
[파이썬 머신러닝 완벽가이드] Chap08-7. 문서 군집화 소개와 실습(Opinion Review 데이터 세트) (0) | 2022.01.26 |
---|---|
[파이썬 머신러닝 완벽가이드] Chap08-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹 (0) | 2022.01.22 |
[파이썬 머신러닝 완벽가이드] Chap08-4. 20 뉴스그룹 분류 (0) | 2022.01.16 |
[파이썬 머신러닝 완벽가이드] Chap08-3. 피처 벡터화 - Bag Of Words(BOW) (0) | 2022.01.14 |
[파이썬 머신러닝 완벽가이드] Chap08-2. 텍스트 사전 준비 작업(텍스트 전처리) - 텍스트 정규화 (0) | 2022.01.12 |