[Kaggle 노트북 리뷰] 랜덤포레스트 모델을 이용해 퇴사자 예측하기

알 수 없는 사용자 2020. 9. 15. 14:53

www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

IBM HR Analytics Employee Attrition & Performance

Predict attrition of your valuable employees

www.kaggle.com

캐글에는 IBM의 데이터 사이언티스트들이 만든 HR 데이터셋이 있음. 분석의 목표는 퇴사자 예측하기

설명 : Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists

오늘 리뷰할 노트북은 의사결정 나무 알고리즘에 기반하여 퇴사자를 예측하는 노트북

www.kaggle.com/faressayah/decision-trees-and-random-forest-tutorial

Decision Trees and Random Forest Tutorial

Explore and run machine learning code with Kaggle Notebooks | Using data from IBM HR Analytics Employee Attrition & Performance

www.kaggle.com

노트북 속 포인트

1. 데이터를 전처리할 때 astype().cat.codes를 활용해 범주형 데이터로 바꾸고 문자열을 숫자열로 변환

이전에는 sklearn에서 제공하는 labelencorder 함수를 이용하여 작업을 진행했지만, 판다스에서 직접 동일한 작업을 진행할 수 있었음

df['Attrition'] = df.Attrition.astype("category").cat.codes

2. GridSearchCV를 활용한 모델의 최적 파라미터 도출

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

params = {
    "criterion":("gini", "entropy"), 
    "splitter":("best", "random"), 
    "max_depth":(list(range(1, 20))), 
    "min_samples_split":[2, 3, 4], 
    "min_samples_leaf":list(range(1, 20)), 
}


model = DecisionTreeClassifier(random_state=42)
grid_search_cv = GridSearchCV(model, params, scoring="accuracy", n_jobs=-1, verbose=1, cv=3)

# 최적의 파라미터를 찾는 하이퍼 파라미터 튜닝  
grid_search_cv.best_estimator_

tree = DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=6, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=10, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

의사결정나무, 랜덤포레스트 등의 분류 알고리즘의 로직을 간단하게 이해하고 있는 상태에서 알고리즘을 좀 더 효율적으로 만들기 위해 조정해야하는 파라미터를 파악하고 싶다면, CV를 활용하면 되며, GridSearchCV에서 제공하는 best_estimator_ 함수를 이용해볼 수 있음. 이 노트북은 특히 GridSearchCV를 어떻게 활용해볼 수 있는지 이해하게 도와주는 노트북

GridSearchCV란?

최적의 파라미터를 찾아주고, 교차검정도 해주는 매우 훌륭한 존재!

datascienceschool.net/view-notebook/ff4b5d491cc34f94aea04baca86fbef8/

Data Science School

Data Science School is an open space!

datascienceschool.net

teddylee777.github.io/scikit-learn/grid-search-%EB%A1%9C-hyperparameter%EC%B5%9C%EC%A0%81%ED%99%94

GridSearch를 이용한 머신러닝 Hyperparameter 튜닝

GridSearch를 이용한 Hyperparameter 튜닝하는 방법에 대하여 알아보겠습니다.

teddylee777.github.io