[ML수업] 8주차 실습4: feature engineering- Feature Selection

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

AI·빅데이터 융합 경영학 Study Note

[ML수업] 8주차 실습4: feature engineering- Feature Selection 본문

AI·ML

[ML수업] 8주차 실습4: feature engineering- Feature Selection

SubjectOwner 2023. 11. 21. 15:53

* Model based feature selection 과 Univariate feature selection 둘 중 하나, 혹은 둘 다 적용

* 그 후 상황에 따라 필요하면 Recursive feature elimination 적용

### Model based feature selection #####

- SelectFromModel은 모델 기반의 특성 선택 기법으로, 지정된 모델의 특성 중요도에 따라 특정 기준 이상의 중요도를 가진 특성만 선택하는 방식입니다.
- transform된 데이터(X_train_sc3_fs1과 X_test_sc3_fs1)는 선택된 특성만 포함하여 모델의 성능을 향상시키거나 과적합을 줄이는 데 도움이 될 수 있습니

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

select1 = SelectFromModel(RandomForestClassifier(random_state=0), threshold=None)

#fit 메서드는 X_train_sc3 데이터와 y_train 레이블을 사용해 특성 선택 기준을 학습합니다.
X_train_sc3_fs1 = select1.fit(X_train_sc3, y_train).transform(X_train_sc3)
#transform 메서드는 X_train_sc3와 X_test_sc3의 원래 특성 중에서 선택된 중요한 특성만 남깁니다.
X_test_sc3_fs1 = select1.transform(X_test_sc3)

print("X_train_sc3.shape: {}, X_train_sc3_fs1.shape: {}".format(X_train_sc3.shape, X_train_sc3_fs1.shape))
#X_train_sc3.shape: (426, 30), X_train_sc3_fs1.shape: (426, 10)

따라서, 이 출력문은 X_train_sc3의 형태와 X_train_sc3_fs1의 형태를 비교하는 데 사용됩니다. 이를 통해 두 데이터셋의 크기나 차원 등의 정보를 확인할 수 있습니다

mask = select1.get_support()는 불리언 마스크(mask)를 생성합니다. 이 마스크는 특성(feature)을 선택하는 데 사용되며, True와 False로 구성된 배열입니다. select1.get_support()는 선택된 특성은 True로, 선택되지 않은 특성은 False로 표시된 마스크를 반환합니다.

plt.matshow(mask.reshape(1,-1), cmap="gray_r")는 이 마스크를 시각화하는 코드입니다. matshow 함수를 사용하여 마스크를 흑백 이미지로 표시합니다. True는 밝은 색(흰색)으로 표시되고, False는 어두운 색(검은색)으로 표시됩니다. 따라서 이 시각화를 통해 어떤 특성이 선택되었고 어떤 특성이 선택되지 않았는지를 확인할 수 있습니다.

mask = select1.get_support()
plt.matshow(mask.reshape(1,-1), cmap="gray_r")
plt.show()

svm.fit(X_train_sc3_fs1, y_train).score(X_test_sc3_fs1, y_test)
#0.9370629370629371

### Univariate feature selection

from sklearn.feature_selection import SelectKBest

select2 = SelectKBest(k=10)
X_train_sc3_fs2 = select2.fit_transform(X_train_sc3, y_train)
X_test_sc3_fs2 = select2.transform(X_test_sc3)

X_train_sc3_fs2.shape

mask = select2.get_support()
plt.matshow(mask.reshape(1,-1), cmap="gray_r")
plt.show()

svm.fit(X_train_sc3_fs2, y_train).score(X_test_sc3_fs2, y_test)
# 0.951048951048951

"SelectPercentile와 교차 검증을 사용한 최적의 백분위수 찾기

#
# Univariate feature selection을 통해 자동으로 feature를 선택하는 방법 예시 (단, 과적합 주의)
#

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import cross_val_score
from tqdm import tqdm_notebook

# 사용할 모델 설정 (성능이 떨어지지 않으면서 가능한 속도가 빠른 모델 사용 권장)
model = LogisticRegression(random_state=0)

# SelectPercentile()의 p(feature의 비율)값을 변화시키면서 최적의 p를 결정
cv_scores = []
for p in tqdm_notebook(range(5,100,1)):
    X_new = SelectPercentile(percentile=p).fit_transform(X_train_sc3, y_train)    
    cv_score = cross_val_score(model, X_new, y_train, cv=3).mean()
    cv_scores.append((p, cv_score))
#SelectPercentile의 percentile 값을 5%부터 99%까지 변화시키며 최적의 비율을 찾습니다.
#percentile=p는 특성을 상위 p%만큼 선택하겠다는 의미입니다.
#각 p값에 대해 3-겹 교차 검증으로 성능을 평가하고, 평균 점수를 cv_scores 리스트에 저장합니다.


# Print the best percentile
#cv_scores에서 가장 높은 교차 검증 점수를 가지는 percentile 값을 찾아 best_score로 저장하고 출력합니다.
best_score = cv_scores[np.argmax([score for _, score in cv_scores])]
print(best_score)

# Plot the performance change with p
plt.plot([k for k, _ in cv_scores], [score for _, score in cv_scores])
plt.xlabel('Percent of features')
plt.grid()
plt.show()

# 과적합을 피하기 위해 최적의 p값 주변의 값을 선택하는게 더 나은 결과를 얻을 수 있다. 
fs = SelectPercentile(percentile=best_score[0]).fit(X_train_sc3, y_train)
X_train_sc3_fs2p = fs.transform(X_train_sc3)
X_test_sc3_fs2p = fs.transform(X_test_sc3)

print(X_train_sc3_fs2p.shape)

svm.fit(X_train_sc3_fs2p, y_train).score(X_test_sc3_fs2p, y_test)

### Recursive feature elimination

재귀적 특성 제거(Recursive Feature Elimination, RFE)은 특성 선택 방법 중 하나로, 모델을 사용하여 특성을 순차적으로 제거하면서 최적의 특성 부분집합을 찾는 방법입니다. RFE는 다음과 같은 절차로 진행됩니다.

초기에는 모든 특성을 포함한 전체 특성 집합을 가지고 시작합니다.
모델을 학습하고 각 특성의 중요도를 평가합니다.
중요도가 가장 낮은 특성을 제거합니다.
제거한 특성을 제외한 나머지 특성으로 모델을 다시 학습하고 중요도를 평가합니다.
이 과정을 원하는 특성 개수나 성능 기준에 도달할 때까지 반복합니다.

estimator=RandomForestClassifier(random_state=0) 부분에서는 RFE가 어떤 모델을 기반으로 특성을 선택할지를 정의합니다. 이 코드에서는 랜덤 포레스트 분류기(RandomForestClassifier)를 사용하며, random_state=0은 랜덤 시드를 설정한 것입니다.

n_features_to_select=10은 선택할 특성의 수를 지정하는 부분입니다. 여기서는 10개의 특성을 선택하도록 설정하였습니다.

step=1은 RFE 알고리즘이 한 번에 하나의 특성을 제거하도록 설정합니다.

from sklearn.feature_selection import RFE

select3 = RFE(estimator=RandomForestClassifier(random_state=0), n_features_to_select=10, step=1)
X_train_sc3_fs3 = select3.fit_transform(X_train_sc3, y_train)
X_test_sc3_fs3 = select3.transform(X_test_sc3)

X_train_sc3_fs3.shape

svm.fit(X_train_sc3_fs3, y_train).score(X_test_sc3_fs3, y_test)

저작자표시 비영리 변경금지 (새창열림)

'AI·ML' 카테고리의 다른 글

[ML수업] 8주차 실습6: feature engineering- Feature Generation (0)	2023.11.21
[ML수업] 8주차 실습5: feature engineering- Dimensionality Reduction (차원 축소) (0)	2023.11.21
[ML수업] 8주차 실습3: feature engineering- Numerical Feature Transformation (수치형) (0)	2023.11.21
[ML수업] 8주차 실습2: feature engineering- Categorical Feature Transformation (범주형) (0)	2023.11.21
[ML수업] 8주차 실습0: feature engineering-들어가기 (0)	2023.11.21

'AI·ML' Related Articles

AI·빅데이터 융합 경영학 Study Note

[ML수업] 8주차 실습4: feature engineering- Feature Selection 본문

[ML수업] 8주차 실습4: feature engineering- Feature Selection

'AI·ML' 카테고리의 다른 글

티스토리툴바