Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

AI 공부 저장소

[파이썬 머신러닝 완벽가이드] Chap08-8. 문서 유사도 - 코사인 유사도 본문

Artificial Intelligence/[파이썬 머신러닝 완벽가이드]

[파이썬 머신러닝 완벽가이드] Chap08-8. 문서 유사도 - 코사인 유사도

aiclaudev 2022. 1. 28. 15:39

In [1]:

# 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다. 
# 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.

In [41]:

# 코사인 유사도
# 벡터와 벡터 간의 유사도를 비교할 때, 벡터의 크기보다는 벡터의 상호 방향성 유사에 기반

In [2]:

# 코사인 유사도가 문서의 유사도 비교에 가장 많이 사용되는 이유
# 문서를 피처벡터화 하면, 차원이 매우 많은 희소 행렬이 되기 쉽다.
# 이때 문서와 문서 벡터간의 크기에 기반한 유사도 지표(유클리드 거리)는
# 정확도가 떨어지기 쉽다.
# 또한, 문서가 매우 긴 경우 단어의 빈도수도 더 많을 것이기에
# 이러한 빈도수 기반의 유사도 비교는 공정하지 않다.
# 예를 들어, A 문서에서 '머신러닝'이라는 단어가 5번 언급되고
# B 문서에서 '머신러닝'이라는 단어가 3번 언급되었다고 할 때
# A 문서가 '머신러닝'에 더 밀접하게 관련된 문서라고 판단하면 안된다.
# A 문서의 크기가 B 문서에 비해 10배 더 크면
# 오히려 B 문서가 '머신러닝'에 더 밀접히 관련된 문서라고 할 수 있다.

In [3]:

# 두개의 넘파이 배열에 대한 코사인 유사도를 구하는 cos_similarity()함수 작성
import numpy as np

def cos_similarity(v1, v2) :
    dot_product = np.dot(v1, v2)
    l2_norm = (np.sqrt(sum(np.square(v1)))*np.sqrt(sum(np.square(v2))))
    similarity = dot_product / l2_norm
    
    return similarity

In [5]:

# doc_list로 정의된 3개의 간단한 문서의 유사도 비교하기
# TF-IDF로 벡터화

from sklearn.feature_extraction.text import TfidfVectorizer

doc_list = ['if you take the blue pill, the story ends',
           'if you take the red pill, you stay in Wonderland',
           'if you take the red pill, I show you how deep the rabbit hold goes']

tfidf_vect_simple = TfidfVectorizer()
feature_vect_simple = tfidf_vect_simple.fit_transform(doc_list)
print(feature_vect_simple.shape)

(3, 18)

In [12]:

# 반환된 행렬은 희소 행렬.
# 밀집 행렬로 변환한 뒤 유사도 측정

# TfidfVectorizer로 transform()한 결과는 희소 행렬이므로 밀집 행렬로 변환.
feature_vect_dense = feature_vect_simple.todense()

# 첫 번째 문장과 두 번째 문장의 피처 벡터 추출
vect1 = np.array(feature_vect_dense[0]).reshape(-1, )
vect2 = np.array(feature_vect_dense[1]).reshape(-1, )

# 첫 번째 문장과 두 번째 문장의 피처 벡터로 두 개 문장의 코사인 유사도 추출
similarity_simple = cos_similarity(vect1, vect2)
print('문장 1, 문장 2 Cosine 유사도 : {0:.3f}'.format(similarity_simple))

# 첫 번째 문장과 세 번째 문장, 두 번째 문장과 세 번째 문장의 유사도 측정
vect3 = np.array(feature_vect_dense[2]).reshape(-1, )

similarity_simple = cos_similarity(vect1, vect3)
print('문장 1, 문장 3 Cosine 유사도 : {0:.3f}'.format(similarity_simple))

similarity_simple = cos_similarity(vect2, vect3)
print('문장 2, 문장 3 Cosine 유사도 : {0:.3f}'.format(similarity_simple))

문장 1, 문장 2 Cosine 유사도 : 0.402
문장 1, 문장 3 Cosine 유사도 : 0.404
문장 2, 문장 3 Cosine 유사도 : 0.456

In [14]:

# 사이킷런의 cosine_similarity() API는 코사인 유사도를 지원
# 첫 번째 파라미터는 비교의 기준이 되는 문서의 피처 행렬
# 두 번째 파라미터는 비교되는 문서의 피처 행렬
# cosine_similarity()는 희소 행렬, 밀집 행렬 모두 가능하며 행렬 또는 배열 모두 가능.

from sklearn.metrics.pairwise import cosine_similarity

similarity_simple_pair = cosine_similarity(feature_vect_simple[0], feature_vect_simple)
print(similarity_simple_pair)

[[1.         0.40207758 0.40425045]]

In [15]:

# 첫 번째 유사도 값인 1은 첫 번째 문서 자신에 대한 유사도
# 두 번째 유사도 값인 0.402는 첫 번째 문서와 두 번째 문서 간 유사도
# 세 번째 유사도 값인 0.404는 첫 번째 문서와 세 번째 문서 간 유사도

In [16]:

similarity_simple_pair = cosine_similarity(feature_vect_simple[0], feature_vect_simple[1:])
print(similarity_simple_pair)

[[0.40207758 0.40425045]]

In [17]:

# 위 처럼 슬라이싱도 사용 가능

In [19]:

# 쌍으로(pair) 코사인 유사도 값 제공 가능
similarity_simple_pair = cosine_similarity(feature_vect_simple, feature_vect_simple)
print(similarity_simple_pair)

[[1.         0.40207758 0.40425045]
 [0.40207758 1.         0.45647296]
 [0.40425045 0.45647296 1.        ]]

In [ ]:

In [20]:

# Opinion Review 데이터 세트를 이용한 문서 유사도 측정

In [33]:

import pandas as pd
import glob, os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# 디렉토리 설정
path = r'C:\Users\82102\Desktop\sw\PythonMLGuide\OpinosisDataset1.0\OpinosisDataset1.0\topics'

# path로 지정한 디렉토리 밑에 모든 .data 파일의 파일명을 리스트로 취합
all_files = glob.glob(os.path.join(path, "*.data"))
filename_list = []
opinion_text = []

# 개별 파일의 파일명은 filename_list로 취합
# 개별 파일의 파일 내용은 DataFrame 로딩 후 다시 string으로 변환해 opinion_text list로 취합
for file_ in all_files :
    # 개별 파일을 읽어서 DataFrame으로 생성
    df = pd.read_table(file_, index_col = None, header = 0, encoding = 'latin1')
    
    # 절대 경로로 주어진 파일명을 가공, 맨 마지막 .data 확장자도 제거
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]
    
    # 파일명 list와 파일 내용 list에 파일명과 파일 내용을 추가
    filename_list.append(filename)
    opinion_text.append(df.to_string())
    
# 파일명 list와 파일 내용 list 객체를 DataFrame으로 생성
document_df = pd.DataFrame({'filename' : filename_list, 'opinion_text' : opinion_text})

In [34]:

# TF-IDF 피처 벡터화 수행하기 전, tokenizer 함수를 정의
from nltk.stem import WordNetLemmatizer
import nltk
import string

# ord는 문자의 유니코드 값을 반환, string.punctuation은 구두점
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens) : 
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text) :
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

In [35]:

tfidf_vect = TfidfVectorizer(tokenizer = LemNormalize, stop_words = 'english',
                            ngram_range = (1, 2), min_df = 0.05, max_df = 0.85)
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

C:\Users\82102\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '

In [36]:

km_cluster = KMeans(n_clusters = 3, max_iter = 10000, random_state = 0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_
document_df['cluster_label'] = cluster_label

In [37]:

# 호텔을 주제로 군집화된 데이터 먼저 추출
# 이후, 이 데이터에 해당하는 TFidfVectorizer의 데이터를 추출

from sklearn.metrics.pairwise import cosine_similarity

# cluster_label = 2인 데이터는 호텔로 군집화된 데이터임. DataFrame에서 해당 인덱스 추출
hotel_indexes = document_df[document_df['cluster_label'] == 2].index
print('호텔로 군집화 된 문서들의 DataFrame Index : ', hotel_indexes)

# 호텔로 군집화된 데이터 중 첫 번째 문서를 추출해 파일명 표시
comparison_docname = document_df.iloc[hotel_indexes[0]]['filename']
print('##### 비교 기준 문서명', comparison_docname, '와 타 문서 유사도 #####')

'''
document_df에서 추출한 Index 객체를 feature_vect로 입력해 호텔 군집화된 feature_vect 추출
이를 이용해 호텔로 군집화된 문서 중 첫 번째 문서와 다른 문서간의 코사인 유사도 측정
'''
similarity_pair = cosine_similarity(feature_vect[hotel_indexes[0]], feature_vect[hotel_indexes])
print(similarity_pair)

호텔로 군집화 된 문서들의 DataFrame Index :  Int64Index([1, 13, 14, 15, 20, 21, 24, 28, 30, 31, 32, 38, 39, 40, 45, 46], dtype='int64')
##### 비교 기준 문서명 bathroom_bestwestern_hotel_sfo 와 타 문서 유사도 #####
[[1.         0.0430688  0.05221059 0.06189595 0.05846178 0.06193118
  0.03638665 0.11742762 0.38038865 0.32619948 0.51442299 0.11282857
  0.13989623 0.1386783  0.09518068 0.07049362]]

In [39]:

# 첫 번쨰 문서와 다른 문서 간에 유사도가 높은 순으로 정렬하고 시각화
# cosine_similarity()는 쌍 형태의 ndarray를 반환하므로 판닥스 인덱스로 이용하기 위해
# reshape(-1)로 차원 변경

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 첫 번째 문서와 타 문서 간 유사도가 큰 순으로 정렬한 인덱스를 추출하되 자기 자신은 제외
sorted_index = similarity_pair.argsort()[:, ::-1]
sorted_index = sorted_index[:, 1:]

# 유사도가 큰 순으로 hotel_indexes를 추출해 재정렬
hotel_sorted_indexes = hotel_indexes[sorted_index.reshape(-1)]

# 유사도가 큰 순으로 유사도 값을 재정렬하되 자기 자신은 제외
hotel_1_sim_value = np.sort(similarity_pair.reshape(-1))[::-1]
hotel_1_sim_value = hotel_1_sim_value[1:]

# 유사도가 큰 순으로 정렬된 인덱스와 유사도 값을 이용해 파일명과 유사도 값을 막대 그래프로 시각화
hotel_1_sim_df = pd.DataFrame()
hotel_1_sim_df['filename'] = document_df.iloc[hotel_sorted_indexes]['filename']
hotel_1_sim_df['similarity'] = hotel_1_sim_value

sns.barplot(x = 'similarity', y = 'filename', data = hotel_1_sim_df)
plt.title(comparison_docname)

Out[39]:

Text(0.5, 1.0, 'bathroom_bestwestern_hotel_sfo')

In [40]:

# 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다. 
# 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.

'Artificial Intelligence > [파이썬 머신러닝 완벽가이드]' 카테고리의 다른 글

[파이썬 머신러닝 완벽가이드] Chap08-9. 한글 텍스트 처리 - 네이버 영화 평점 감성 분석 (0)	2022.01.29
[파이썬 머신러닝 완벽가이드] Chap08-7. 문서 군집화 소개와 실습(Opinion Review 데이터 세트) (0)	2022.01.26
[파이썬 머신러닝 완벽가이드] Chap08-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹 (0)	2022.01.22
[파이썬 머신러닝 완벽가이드] Chap08-5. 감정 분석 (0)	2022.01.18
[파이썬 머신러닝 완벽가이드] Chap08-4. 20 뉴스그룹 분류 (0)	2022.01.16

'Artificial Intelligence/[파이썬 머신러닝 완벽가이드]' Related Articles

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

AI 공부 저장소

[파이썬 머신러닝 완벽가이드] Chap08-8. 문서 유사도 - 코사인 유사도 본문

[파이썬 머신러닝 완벽가이드] Chap08-8. 문서 유사도 - 코사인 유사도

'Artificial Intelligence > [파이썬 머신러닝 완벽가이드]' 카테고리의 다른 글

티스토리툴바