Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

AI 공부 저장소

[파이썬 머신러닝 완벽가이드] Chap08-4. 20 뉴스그룹 분류 본문

Artificial Intelligence/[파이썬 머신러닝 완벽가이드]

[파이썬 머신러닝 완벽가이드] Chap08-4. 20 뉴스그룹 분류

aiclaudev 2022. 1. 16. 22:30

- 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다. 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.

In [1]:

from sklearn.datasets import fetch_20newsgroups
import pandas as pd

# news_data는 파이썬 딕셔너리 형태와 비슷한 Bunch 객체
news_data = fetch_20newsgroups(subset = 'all', random_state = 156) # subset을 통해 갖고오고자 하는 데이터를 선정해 가져올 수 있음.
                                                                   # 예를 들어, subset = 'test'라면 미리 분류된 test set을 가져옴

print('target 클래스의 값과 분포도 \n', pd.Series(news_data.target).value_counts().sort_index())
print('target 클래스의 이름들 \n', news_data.target_names)

target 클래스의 값과 분포도 
 0     799
1     973
2     985
3     982
4     963
5     988
6     975
7     990
8     996
9     994
10    999
11    991
12    984
13    990
14    987
15    997
16    910
17    940
18    775
19    628
dtype: int64
target 클래스의 이름들 
 ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

In [2]:

print(news_data.data[0])

From: egreen@east.sun.com (Ed Green - Pixel Cruncher)
Subject: Re: Observation re: helmets
Organization: Sun Microsystems, RTP, NC
Lines: 21
Distribution: world
Reply-To: egreen@east.sun.com
NNTP-Posting-Host: laser.east.sun.com

In article 211353@mavenry.altcit.eskimo.com, maven@mavenry.altcit.eskimo.com (Norman Hamer) writes:
> 
> The question for the day is re: passenger helmets, if you don't know for 
>certain who's gonna ride with you (like say you meet them at a .... church 
>meeting, yeah, that's the ticket)... What are some guidelines? Should I just 
>pick up another shoei in my size to have a backup helmet (XL), or should I 
>maybe get an inexpensive one of a smaller size to accomodate my likely 
>passenger? 

If your primary concern is protecting the passenger in the event of a
crash, have him or her fitted for a helmet that is their size.  If your
primary concern is complying with stupid helmet laws, carry a real big
spare (you can put a big or small head in a big helmet, but not in a
small one).

---
Ed Green, former Ninjaite |I was drinking last night with a biker,
  Ed.Green@East.Sun.COM   |and I showed him a picture of you.  I said,
DoD #0111  (919)460-8302  |"Go on, get to know her, you'll like her!"
 (The Grateful Dead) -->  |It seemed like the least I could do...

In [3]:

# 뉴스그룹 기사의 내용뿐만 아니라, 뉴스그룹 제목, 작성자, 소속, 이메일 등의 다양한 정보가 있음
# 내용을 제외하고 제목 등의 다른 정보는 제거.
# 제목과 소속, 이메일 주소 등의 헤더와 푸터 정보들은 Target클래스 값과 유사한 데이터를 가지고 있는 경우가 많기에 이를 포함하면 
# 웬만한 ML 알고리즘을 적용해도 높은 성능을 보임

In [5]:

# remove 파라미터를 이용하여 뉴스그룹 기사의 헤더와 푸터를 제거할 수 있음 => 내용만 추출하기 위해 headers, footer, quotes를 remove
train_news = fetch_20newsgroups(subset = 'train', remove = ('headers', 'footer', 'quotes'), random_state = 156) # subset을 통해 train 셋만을 추출

X_train = train_news.data
y_train = train_news.target

# test에 대해 remove 파라미터 적용
test_news = fetch_20newsgroups(subset = 'test', remove = ('headers', 'footer', 'quotes'), random_state = 156)

X_test = test_news.data
y_test = test_news.target

print('데이터 타입 : ', type(X_test))
print('학습 데이터 수 : {0}, 테스트 데이터 수 : {1}'.format(len(X_train), len(X_test)))

데이터 타입 :  <class 'list'>
학습 데이터 수 : 11314, 테스트 데이터 수 : 7532

In [9]:

# CountVectorizer로 피처 추출(벡터화)하기.
# 주의점 : CountVectorizer 사용 시 fit()과 transform()을 하는데, 학습데이터를 fit()한 객체에 그대로 테스트데이터를 변환해야함
# 테스트 데이터에 다시 fit()한 후 transform할 경우 피처의 개수가 달라질 수도 있음.
# 비슷한 맥락으로, fit_transform()과 같이 fit과 transform을 동시에 수행하는 것을 지양해야함.

from sklearn.feature_extraction.text import CountVectorizer

# Count Vectorization으로 피처 벡터화 수행
cnt_vect = CountVectorizer()
cnt_vect.fit(X_train)
X_train_cnt_vect = cnt_vect.transform(X_train)

# 학습 데이터로 fit()된 CountVectorizer 객체를 이용해 테스트 데이터를 변환해야함
X_test_cnt_vect = cnt_vect.transform(X_test)

print('학습 데이터 텍스트의 CountVectorizer Shape : ', X_train_cnt_vect.shape)

학습 데이터 텍스트의 CountVectorizer Shape :  (11314, 113393)

In [10]:

# 위 결과를 통해 11314개의 문서에서 113393개의 단어가 feature로 만들어졌음을 알 수 있음

In [12]:

# 로지스틱 회귀를 적용해 뉴스그룹에 대한 분류 예측하기
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# LogisticRegression을 이용해 학습/예측/평가 수행
lr_clf = LogisticRegression()
lr_clf.fit(X_train_cnt_vect, y_train)
pred = lr_clf.predict(X_test_cnt_vect)
print('CountVectorized Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

CountVectorized Logistic Regression의 예측 정확도는 0.665

C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

In [13]:

# TF-IDF 피처 벡터화로 모델링 수행하기
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF 벡터화를 적용해 학습 데이터 세트와 테스트 데이터 세트 변환
tfidf_vect = TfidfVectorizer()
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

# LogisticRegression 수행
lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
print('TF-IDF Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

TF-IDF Logistic Regression의 예측 정확도는 0.719

In [14]:

# CountVectorizer보다 TF-IDF 기반의 피처 벡터화가 일반적으로 더 높은 예측 정확도를 가짐

In [15]:

# 머신러닝 모델의 성능을 향상시키는 중요한 2가지 방법
# 1) 최적의 ML 알고리즘 선정   2) 최상의 피처 전처리 수행하는 것
# 텍스트 정규화나 Count/TF-IDF 기반 피처 벡터화를 어떻게 효과적으로 적용했는지가 성능에 큰 영향을 미칠 수 있음

# TfidVectorizer 클래스의 스톱워드를 기존 'None'에서 'english'로, ngram_range는 (1,1)에서 (1,2)로, max_df=300으로
tfidf_vect = TfidfVectorizer(stop_words = 'english', ngram_range = (1,2), max_df = 300)
tfidf_vect.fit(X_train)
X_train_tfidf_vect = tfidf_vect.transform(X_train)
X_test_tfidf_vect = tfidf_vect.transform(X_test)

# Logistic Regression 수행
lr_clf = LogisticRegression()
lr_clf.fit(X_train_tfidf_vect, y_train)
pred = lr_clf.predict(X_test_tfidf_vect)
print('TF-IDF Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

TF-IDF Logistic Regression의 예측 정확도는 0.736

In [17]:

# GridSearchCV를 이용해 Logistic Regression의 하이퍼 파라미터 최적화 수행
from sklearn.model_selection import GridSearchCV

# 최적 C 값 도출 튜닝 수행. CV는 3 폴드 세트로 설정
params = {'C' : [0.01, 0.1, 1, 5, 10]}
grid_cv_lr = GridSearchCV(lr_clf, param_grid = params, cv=3, scoring = 'accuracy', verbose = 1)
grid_cv_lr.fit(X_train_tfidf_vect, y_train)
print('Logistic Regression best C parameter : ', grid_cv_lr.best_params_)

# 최적 C 값으로 학습된 grid_cv로 예측 및 정확도 평기
pred = grid_cv_lr.predict(X_test_tfidf_vect)
print('TF-IDF Vectorized Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

Fitting 3 folds for each of 5 candidates, totalling 15 fits

C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Logistic Regression best C parameter :  {'C': 10}
TF-IDF Vectorized Logistic Regression의 예측 정확도는 0.741

In [18]:

# 하이퍼 파라미터 튜닝으로 Accuracy가 더욱 향상된 것을 알 수 있음

In [21]:

# 사이킷런의 파이프라인(Pipeline) 사용 및 GridSearchCV와의 결합
# Pipeline 클래스 사용 시 피처 벡터화와 ML 알고리즘 학습/예측을 위한 코드 작성을 한 번에 진행할 수 있음
# 일반적으로 사이킷런 파이프라인은 텍스트 기반 피처 벡터화뿐만 아니라 모든 데이터 전처리 작업과 Estimator를 결합할 수 있음
# 예를 들어, 스케일링 또는 벡터 정규화, PCA 등의 변환 작업과 분류, 회귀 등의 Estimator를 한번에 결합하는 것

from sklearn.pipeline import Pipeline

# TfidVectorizer 객체를 tfidf_vect로, LogisticRegression객체를 lr_clf로 생성하는 Pipeline 객체 생성
pipeline = Pipeline([('tfidf_vect', TfidfVectorizer(stop_words = 'english', ngram_range = (1, 2), max_df = 300)),
                   ('lr_clf', LogisticRegression(C = 10))])

# 별도의 TfidVectorizer 객체의 fit(), transform()과 LogisticRegression의 fit(), predict()가 필요없음
# pipeline의 fit()과 predict()만으로 한꺼번에 피처 벡터화와 ML 학습/예측이 가능
pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)
print('Pipeline을 통한 Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred)))

C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Pipeline을 통한 Logistic Regression의 예측 정확도는 0.741

In [ ]:

# GridSearchCV 클래스의 생성 파라미터로 Pipeline을 입력하기

pipeline = Pipeline([('tfidf_vect', TfidfVectorizer(stop_words = 'english',)), 
                    ('lr_clf', LogisticRegression())])

# Pipeline에 기술된 각각의 객체 변수에 언더바(_) 2개를 연달아 붙여 GridSearchCV에 사용될 파라미터/하이퍼 파라미터
# 이름과 값을 설정한다.
# 앞에 객체 변수 이름을 붙여야 하는 이유는, 최적화 할 파라미터가 피처 벡터화 시 사용되는 파라미터인지, 
# 모델링의 하이퍼 파라미터인지 구분하기 위함임
params = {'tfidf_vect__ngram_range' : [(1, 1), (1, 2), (1, 3)],
         'tfidf_vect__max_df' : [100, 300, 700],
         'lr_clf__C' : [1, 5, 10]}

# GridSearchCV 생성자에 Estimator 대신 Pipeline 입력
grid_cv_pipe = GridSearchCV(pipeline, param_grid = params, cv = 3, scoring = 'accuracy', verbose = 1)
grid_cv_pipe.fit(X_train, y_train)
print(gric_cv_pipe.best_params_, grid_cv_pipe_best_score_)

pred = grid_cv_pipe.predict(X_test)
print('Pipeline을 통한 Logistic Regression의 예측 정확도는 {0:.3f}'.format(accuracy_score(y_test, pred))) # 0.702 나옴

Fitting 3 folds for each of 27 candidates, totalling 81 fits

C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\Users\82102\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

- 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다. 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.

'Artificial Intelligence > [파이썬 머신러닝 완벽가이드]' 카테고리의 다른 글

[파이썬 머신러닝 완벽가이드] Chap08-6. 토픽 모델링(Topic Modeling) - 20 뉴스그룹 (0)	2022.01.22
[파이썬 머신러닝 완벽가이드] Chap08-5. 감정 분석 (0)	2022.01.18
[파이썬 머신러닝 완벽가이드] Chap08-3. 피처 벡터화 - Bag Of Words(BOW) (0)	2022.01.14
[파이썬 머신러닝 완벽가이드] Chap08-2. 텍스트 사전 준비 작업(텍스트 전처리) - 텍스트 정규화 (0)	2022.01.12
[파이썬 머신러닝 완벽가이드] Chap08-1. 텍스트 분석 이해 (0)	2022.01.12

'Artificial Intelligence/[파이썬 머신러닝 완벽가이드]' Related Articles

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

AI 공부 저장소

[파이썬 머신러닝 완벽가이드] Chap08-4. 20 뉴스그룹 분류 본문

[파이썬 머신러닝 완벽가이드] Chap08-4. 20 뉴스그룹 분류

- 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다. 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.

- 본 글은 파이썬 머신러닝 완벽 가이드 (권철민 지음)을 공부하며 내용을 추가한 정리글입니다. 개인 공부를 위해 만들었으며, 만약 문제가 발생할 시 글을 비공개로 전환함을 알립니다.

'Artificial Intelligence > [파이썬 머신러닝 완벽가이드]' 카테고리의 다른 글

티스토리툴바