AI·빅데이터 융합 경영학 Study Note

[ML수업] 8주차 실습0: feature engineering-들어가기 본문

AI·ML

[ML수업] 8주차 실습0: feature engineering-들어가기

SubjectOwner 2023. 11. 21. 15:52

allstate_train.zip
7.05MB

 

#### Scikit-Learn Preprocessor Interface      
① Import the preprocessor  
② Instantiate the preprocessor  
③ Fit the data to the preprocessor: .fit()  
④ Generate the preprocessed data: .transform()
 
1. [Missing Data Hadling](#1.-Missing-Data-Hadling)
2. [Categorical Feature Transformation](#2.-Categorical-Feature-Transformation)
3. [Numerical Feature Transformation](#3.-Numerical-Feature-Transformation)
4. [Feature Selection](#4.-Feature-Selection)
5. [Dimensionality Reduction](#5.-Dimensionality-Reduction)
6. [Feature Generation](#6.-Feature-Generation)
 
 

 

allstate-purchase-prediction-challenge/data)**
- 고객이 자동차 보험상품을 구매하기까지의 transaction 기록
- 각 customer ID 별로 quote history 포함
- 각 customer ID 별 마지막 행이 구매 포인트 (record_type=1)
 
*Variable Descriptions*
- customer_ID
- shopping_pt - Unique identifier for the shopping point of a given customer
- record_type - 0=shopping point, 1=purchase point
- day - Day of the week (0-6, 0=Monday)
- time - Time of day (HH:MM)
- state - State where shopping point occurred
- location - Location ID where shopping point occurred
- group_size - How many people will be covered under the policy (1, 2, 3 or 4)
- homeowner - Whether the customer owns a home or not (0=no, 1=yes)
- car_age - Age of the customer’s car
- car_value - How valuable was the customer’s car when new
- risk_factor - An ordinal assessment of how risky the customer is (1, 2, 3, 4)
- age_oldest - Age of the oldest person in customer's group
- age_youngest - Age of the youngest person in customer’s group
- married_couple - Does the customer group contain a married couple (0=no, 1=yes)
- C_previous - What the customer formerly had or currently has for product option C (0=nothing, 1, 2, 3,4)
- duration_previous -  how long (in years) the customer was covered by their previous issuer
- A,B,C,D,E,F,G - the coverage options
- cost - cost of the quoted coverage options

 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

pd.set_option("display.max_columns", 30)

data = pd.read_csv('allstate_train.csv')

 

 

데이터에서 customer_ID 열이 10152724인 행을 조회하는 코드 예제

# 데이터에서 customer_ID 열이 10152724인 행을 조회하는 코드 예제
data.query('customer_ID == 10152724')

 

# 분석의 편의를 위해 사용할 Feature을 범주형과 연속형으로 구분
 
# 분석의 편의를 위해 사용할 Feature을 범주형과 연속형으로 구분
con = ['group_size','car_age','age_oldest','age_youngest','duration_previous','cost']
cat = ['day','homeowner','car_value','risk_factor','married_couple','C_previous','state','shopping_pt']
X = data[con+cat]
X.head()

 

y = data['record_type']
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)