[ML수업] 10주차 실습5: pipeline

Notice

Recent Posts

Recent Comments

Link

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

AI·빅데이터 융합 경영학 Study Note

[ML수업] 10주차 실습5: pipeline_stacking 예시 코드 본문

AI·ML

[ML수업] 10주차 실습5: pipeline_stacking 예시 코드

SubjectOwner 2024. 10. 24. 20:34

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>

# Maria Telenczuk <https://github.com/maikia>

# License: BSD 3 clause

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer  # still experimental 
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import StandardScaler, PowerTransformer 
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from category_encoders import TargetEncoder  # scikit-learn과 호환됨
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn import set_config

import optuna

1. Download the dataset

data = pd.read_csv('allstate_train.csv')
data.head()

# 수치형/범주형 피처 분리
numeric_features = ['group_size','car_age','age_oldest','age_youngest','duration_previous','cost']
categorical_features = ['day','homeowner','car_value','risk_factor','married_couple','C_previous','state','shopping_pt']

# 학습/평가 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(data[numeric_features+categorical_features], 
                                                    data['record_type'], test_size=0.9, 
                                                    stratify=data['record_type'], random_state=0)

2. Make pipeline to preprocess the data

from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder

cat_tree_processor = OrdinalEncoder(
    handle_unknown="use_encoded_value",
    unknown_value=-1,
    encoded_missing_value=-2,
)
num_tree_processor = SimpleImputer(strategy="mean", add_indicator=True)

tree_preprocessor = make_column_transformer(
    (num_tree_processor, numeric_features), (cat_tree_processor, categorical_features)
)
tree_preprocessor

from sklearn.preprocessing import OneHotEncoder, StandardScaler

cat_linear_processor = OneHotEncoder(handle_unknown="ignore")
num_linear_processor = make_pipeline(
    StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)

linear_preprocessor = make_column_transformer(
    (num_linear_processor, numeric_features), (cat_linear_processor, categorical_features)
)
linear_preprocessor

3. Stack of predictors on a single data set

from sklearn.linear_model import LogisticRegression

lr_pipeline = make_pipeline(linear_preprocessor, LogisticRegression())
lr_pipeline

from sklearn.ensemble import RandomForestClassifier

rf_pipeline = make_pipeline(tree_preprocessor, RandomForestClassifier(random_state=42))
rf_pipeline

from sklearn.ensemble import HistGradientBoostingClassifier

gbdt_pipeline = make_pipeline(
    tree_preprocessor, HistGradientBoostingClassifier(random_state=0)
)
gbdt_pipeline

4. Measure and plot the results

from sklearn.model_selection import cross_val_score

for m in estimators + [("Stacking Classifier", stacking_cla)]:
    scores = cross_val_score(m[1], X_train, y_train, cv=3)
    print(f"{m[0]}: {scores.mean():.3f}")

Random Forest: 0.866

LR: 0.869

Gradient Boosting: 0.869

Stacking Classifier: 0.870

저작자표시 비영리 변경금지 (새창열림)

'AI·ML' 카테고리의 다른 글

[ML수업] 5주차 실습1: Feature Engineering- Missing Data Handling (0)	2024.10.28
scikit-learn의 make_column_selector로 범주형 수치형 자동 구분 (0)	2024.10.24
깃 충돌 나지 않게 pull requests로 올리는 방법 (0)	2024.07.18
[GPT뿌시기] ChatGPT 출력 형식 지정 (1)	2024.01.04
[GPT 뿌시기] ChatGPT에서 저작권 없는 이미지 생성 (2)	2024.01.04

'AI·ML' Related Articles

AI·빅데이터 융합 경영학 Study Note

[ML수업] 10주차 실습5: pipeline_stacking 예시 코드 본문

[ML수업] 10주차 실습5: pipeline_stacking 예시 코드

'AI·ML' 카테고리의 다른 글

티스토리툴바