AI·빅데이터 융합 경영학 Study Note

[ML수업] 10주차 실습1: pipeline_basics 본문

AI·ML

[ML수업] 10주차 실습1: pipeline_basics

SubjectOwner 2023. 11. 21. 15:53
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectPercentile
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
# load and split the data
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0)
### Building Pipelines
The **Pipeline** is built using a list of **(key, value)** pairs, where the **key** is a string containing the name you want to give this step and **value** is an estimator object:
# Build Pipelines
from sklearn.pipeline import Pipeline

#반드시 튜플 형태로 입력해야 함.
pipe = Pipeline([('scaler', MinMaxScaler()), ('selector', SelectPercentile()), ("svm", SVC())])
You only have to call **fit** and **predict** once on your data to fit a whole sequence of estimators'
pipe.fit(X_train, y_train).score(X_test, y_test)
#0.9440559440559441

 

### Using Pipelines in Grid-searches
Parameters of the estimators in the pipeline shoud be defined using the **estimator__parameter** syntax
 

 

param_grid = {
    'selector__percentile': range(10, 100, 10),
    'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]
}

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(
    grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))
print(grid.score(X_test, y_test))

"""
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(
    grid.best_score_))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))
print(grid.score(X_test, y_test))
"""
C parameter: hard margin vs. soft margin

Kernel Trick

Gamma parameter

 

Convenient Pipeline creation with *make_pipeline*
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), 
                      ('selector', SelectPercentile(percentile=70)),
                      ("svm", SVC(C=100,gamma=0.01))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SelectPercentile(percentile=70), SVC(C=100,gamma=0.01))

print("Pipeline steps:\n{}".format(pipe_short.steps))

"""
0.9440559440559441
Best cross-validation accuracy: 0.98
Test set score: 0.98
Best parameters: {'selector__percentile': 70, 'svm__C': 100, 'svm__gamma': 0.01}
0.9790209790209791
Pipeline steps:
[('minmaxscaler', MinMaxScaler()), ('selectpercentile', SelectPercentile(percentile=70)), ('svc', SVC(C=100, gamma=0.01))]
"""​

 

#추가로??
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), 
                     StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))