[ML] 지도학습(분류) 모델, 하이퍼파라미터 튜닝, 모델 평가

Notice

Recent Posts

Recent Comments

Link

Today

Total

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

관리 메뉴

pocket

[ML] 지도학습(분류) 모델, 하이퍼파라미터 튜닝, 모델 평가 본문

Machine Learning

[ML] 지도학습(분류) 모델, 하이퍼파라미터 튜닝, 모델 평가

jpocket 2025. 5. 12. 19:41

지도학습 모델 및 모델 평가

1. 지도학습

의사결정나무
- 의사결정나무 하이퍼 파라미터
랜덤포레스트
- 랜덤포레스트 하이퍼 파라미터
XGBoost
- XGBoost 하이퍼 파라미터
조기종료

2. 모델 평가

교차검증
- KFold
- StratifiedKFold
- 간편하게 교차검증
평가
- 평가지표

사용 툴: tldraw

전체 흐름도

지도학습 모델 및 모델 평가

데이터 불러오기

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 데이터 생성
from sklearn.datasets import load_breast_cancer

def make_dataset():
    iris = load_breast_cancer()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop('target', axis=1), df['target'], test_size=0.5, random_state=1004)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = make_dataset()
X_train.shape, X_test.shape, y_train.shape, y_test.shape

지도학습(분류)

의사결정나무

지도학습 알고리즘(분류, 회귀)
직관적인 알고리즘이라서 이해하기 쉬움
과대적합 되기 쉬운 알고리즘 이므로 트리 깊이 제한 필요
정보이득이 최대가 되는 특성을 기준으로 삼기 때문에 불순도를 측정하는 기준으로는 지니와 엔트로피가 사용
데이터가 한 종류라면 엔트로피/지니 불순도는 0에 가깝고, 서로 다른 데이터의 비율이 비슷하면 1에 가까움
정보이득이 최대일 때는 (1-불순도). 즉, 불순도가 낮은 값을 찾아나가는 것

from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=0) # 모델 선택

model.fit(X_train, y_train) # 모델 학습
pred = model.predict(X_test) # 모델 예측

accuracy_score(y_test,pred) # 정확도 평가

의사결정나무 하이퍼파라미터

criterion(default gini): 불순도 지표
max_depth(default none): 최대한도 깊이, 데이터 수가 많지 않을 경우 일정 깊이 이상은 같은 정확도 출력
min_samples_split(default 2): 자식 노드를 갖기 위한 최소한의 데이터 수
min_samples_leaf(default 1): 리프 노드가 되기 위한 최소 샘플 수

model = DecisionTreeClassifier(criterion='entropy',
                               max_depth=7,
                               min_samples_split=2,
                               min_samples_leaf=2,
                               random_state=0) # 모델 선택, 하이퍼파라터

랜덤포레스트

지도학습 알고리즘(분류, 회귀)
의사결정나무가 여러 개 있는 구성 = 의사결정나무의 앙상블
성능이 좋음(과대적합 가능성 낮음)
부트스트랩 샘플링(데이터셋 중복 허용)
최종 다수결 투표
앙상블 -> 배깅(랜덤포레스트), 부스팅(XGBoost)
- *배깅: 같은 알고리즘으로 여러 모델 만들어 분류
- *부스팅: 학습과 예측하며 가중치 반영

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

랜덤포레스트 하이퍼파라미터

n_estimators (default 100) : 트리의 수
criterion (default gini) : 불순도 지표
max_depth (default None) : 최대한도 깊이
min_samples_split (default 2) : 자식 노드를 갖기 위한 최소한의 데이터 수
min_samples_leaf (default 1) : 리프 노드가 되기 위한 최소 샘플 수

model = RandomForestClassifier(
    n_estimators=200,           # 트리의 수가 많으면 속도가 느려짐
    criterion='gini',           # 불순도 지표 ('gini' 또는 'entropy')
    max_depth=3,                # 트리의 최대 깊이
    min_samples_split=2,        # 분할하기 위한 최소 샘플 수
    min_samples_leaf=1,         # 리프 노드가 되기 위한 최소 샘플 수
    random_state=0              # 재현성을 위한 시드값
)

XGBoost(eXtreme Gradient Boosting)

트리 앙상블 중 성능이 좋은 알고리즘
부스팅(앙상블) 기반의 알고리즘
약한 학습기가 계속해서 업데이트를 하며 좋은 모델을 만들어 감
캐글에서 뛰어난 성능을 보이면서 인기가 높아짐

from xgboost import XGBClassifier
model = XGBClassifier(random_state=0)
model.fit(X_train, y_train)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

# warning ignore
# 1) use_label_encoder=False
# 2) eval_metric='logloss'

XGBoost 하이퍼파라미터

booster (default gbtree) : 부스팅 알고리즘 (또는 dart, gblinear)
objective (default binary:logistic) : 이진분류 (다중분류: multi:softmax)
max_depth (default 6) : 최대 한도 깊이
learning_rate (default 0.1) : 학습률
경사하강법: 기울기가 0인 지점을 찾아나감
n_estimators (default 100) : 트리의 수
learning_rate를 낮췄다면 n_estimators는 올려주어야 한다
subsample (default 1) : 훈련 샘플 개수의 비율
colsample_bytree (default 1) : 특성 개수의 비율
n_jobs (default 1) : 사용 코어 수 (-1: 모든 코어를 다 사용)

model = XGBClassifier(random_state=0,
                     booster='gbtree',
                     objective='binary:logistic',
                     max_depth=5,
                     learning_rate=0.1,
                     n_estimators=500,
                     subsample=1,
                     colsample_bytree=1,
                     n_jobs=-1)

조기종료

model = XGBClassifier(random_state=0,
                      use_label_encoder=False,
                      eval_metric='logloss', 
                     learning_rate = 0.1,
                      n_estimators = 500, # 500이 되기 전에 괜찮은 점수가 나오면 조기종료)
                      early_stopping_rounds=10) # 10번 이상 돌려도 성능 향상이 없으면 조기종료
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_set=eval_set)
pred = model.predict(X_test)
accuracy_score(y_test, pred)

모델 평가

Kfold

데이터 준비

# 데이터셋 로드
import pandas as pd
from sklearn.datasets import load_breast_cancer

def make_dataset2():
    iris = load_breast_cancer()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target
    return df.drop('target', axis=1), df['target']
X, y = make_dataset2()

kfold 코드

from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

model = DecisionTreeClassifier(random_state=0)
kfold = KFold(n_splits=5)

for train_idx, test_idx in kfold.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(accuracy_score(y_test, pred))

StratifiedKfold

불균형한 티켓 비율을 가진 데이터가 한쪽으로 치우치는 것을 방지

StratifiedKFold는 클래스 비율(예: 0과 1의 비율)을 유지하면서 데이터를 나누는 교차검증 방식
즉, 학습용 데이터와 검증용 데이터 모두가 원래와 비슷한 클래스 비율을 가지게 해줘서, 한쪽 클래스가 무시되는 것을 방지

from sklearn.model_selection import StratifiedKFold
model = DecisionTreeClassifier(random_state=0)

stratifiedKfold = StratifiedKFold(n_splits=5)
for train_idx, test_idx in stratifiedKfold.split(X,y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(accuracy_score(y_test, pred))

간편하게 교차 검증

사이킷런 내부 API를 통해 fit -> predict -> evaluation

cross_val_score 인자에 y의 타입이 클래스인지, 연속값인지에 따라 구분된다.

클래스다 -> StratifiedKFold
연속값이다 -> KFold

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model,X,y,cv=5) # y가 분류형이면 StratifiedKFold, 연속형이면 KFold

# StratifiedKFold
stratifiedKfold = StratifiedKFold(n_splits=5)
scores = cross_val_score(model,X,y,cv=stratifiedKfold)

평가(분류 모델)

정확도 accuracy: 실제 값과 예측값이 일치하는 비율
정밀도 precision: 양성이라고 예측한 값 중 실제 양성인 값의 비율 (암이라고 예측 한 값 중 실제 암)
재현율 recall: 실제 양성 값 중 양성으로 예측한 값의 비율 (암을 암이라고 판단)
F1: 정밀도와 재현율의 조화평균
ROC-AUC
- ROC: 참 양성 비율(True Positive Rate)에 대한 거짓 양성 비율(False Positive Rate) 곡선
- AUC: ROC곡선 면적 아래 (완벽하게 분류되면 AUC가 1임)

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

print(accuracy_score(y_test, pred)) # 정확도
print(precision_score(y_test, pred)) # 정밀도
print(recall_score(y_test, pred)) # 재현율
print(f1_score(y_test, pred)) # f1

# roc_auc
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

model = XGBClassifier(random_state=0)
model.fit(X_train, y_train)
pred = model.predict_proba(X_test) # 0과 1이 아니라 확률값으로 받음

roc_auc_score(y_test, pred[:,1])

'Machine Learning' 카테고리의 다른 글

[시계열] 시계열 데이터, 시계열 데이터 성질, 시계열 데이터의 EDA (5)	2025.05.22
[ML] 비지도학습 - 군집화(Clustering) (0)	2025.05.20
[ML] 지도학습(회귀) 모델, 하이퍼파라미터 튜닝, 모델 평가 (1)	2025.05.14
[ML] Machine Learning 머신러닝 과정 정리 (코드로 이해하기) (1)	2025.05.08

'Machine Learning' Related Articles

pocket

[ML] 지도학습(분류) 모델, 하이퍼파라미터 튜닝, 모델 평가 본문

[ML] 지도학습(분류) 모델, 하이퍼파라미터 튜닝, 모델 평가

지도학습 모델 및 모델 평가

전체 흐름도

지도학습 모델 및 모델 평가

데이터 불러오기

지도학습(분류)

의사결정나무

의사결정나무 하이퍼파라미터

랜덤포레스트

랜덤포레스트 하이퍼파라미터

XGBoost(eXtreme Gradient Boosting)

XGBoost 하이퍼파라미터

조기종료

모델 평가

Kfold

데이터 준비

StratifiedKfold

간편하게 교차 검증

평가(분류 모델)

'Machine Learning' 카테고리의 다른 글

티스토리툴바