Data Analytics and Modeling with XGBoost Classifier : WNS Hackathon Challenge

Table of Contents

HR Analytics : Hackathon Challenge

I participated in WNS Analytics Wizard hackathon, “To predict whether an employee will be promoted or not” and hence I am coming up with this blog-post of the solution submitted which ranked me 138 (Top 11%) in the challenge. The leader board ranking was decided on the F1-score which is harmonic mean of precision and recall.

About Data
The data-set consists of 54808 rows where each row had 14 attributes including target variable (i.e “is_promoted”). There are 4668 cases where employees have been promoted (8.5%). The data-set is provided in GitHub link here.

Let’s get started in building the data analytics pipeline end to end.

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve
from sklearn.model_selection import GridSearchCV, train_test_split,cross_val_score

import xgboost as xgb
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

# Set all options
%matplotlib inline
plt.style.use('seaborn-notebook')
plt.rcParams["figure.figsize"] = (20, 3)
pd.options.display.float_format = '{:20,.4f}'.format
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sns.set(context="paper", font="monospace")

User Defined Functions

def convert_categorical_to_dummies(d_convert):

    """
    Author: Abhijeet Kumar
    Description: returns Dataframe with all categorical variables converted into dummies
    Arguments: Dataframe (having categorical variables)
    """

    df = d_convert.copy()
    list_to_drop = []
    for col in df.columns:
        if df[col].dtype == 'object':
            list_to_drop.append(col)
            df = pd.concat([df,pd.get_dummies(df[col],prefix=col,prefix_sep='_', drop_first=False)], axis=1)
            df = df.drop(list_to_drop,axis=1)
    return df

def quality_report(df):

    """
    Author: Abhijeet Kumar
    Description: Displays quality of data in terms of missing values, unique numbers, datatypes etc.
    Arguments: Dataframe
    """
    dtypes = df.dtypes
    nuniq = df.T.apply(lambda x: x.nunique(), axis=1)
    total = df.isnull().sum().sort_values(ascending = False)
    percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False)
    quality_df = pd.concat([total, percent, nuniq, dtypes], axis=1, keys=['Total', 'Percent','Nunique', 'Dtype'])
    display(quality_df)

def score_on_test_set(model, file_name, out_name):

    """
    Author: Abhijeet Kumar
    Description : It runs same steps of preprocessing as in training, scores
    on the test data provided in hackathon and generates the submission file.
    Argument : model, test data file, submission file
    """

    test_data = pd.read_csv(file_name)

    # Treating the missing values of education as a separate category
    test_data['education'] = test_data['education'].replace(np.NaN, 'NA')

    # Treating the missing values of education as a separate category
    test_data['previous_year_rating'] = test_data['previous_year_rating'].fillna(0)

    # Creating dummy variables for all the categorical columns, droping that column
    master_test_data = convert_categorical_to_dummies(test_data)

    # Removing the id attributes
    df_test_data = master_test_data.drop(['employee_id'],axis=1)
    if out_name == "submission_lightgbm.csv":
        y_pred = model.predict_proba(df_test_data.values, num_iteration=model.best_iteration_)
    else:
        y_pred = model.predict_proba(df_test_data.values)
    submission_df = pd.DataFrame({'employee_id':master_test_data['employee_id'],'is_promoted':y_pred[:,1]})
    submission_df.to_csv(out_name, index=False)

    score = model.predict_proba(df_test_data.values)
    return test_data,score

Reading Data

data = pd.read_csv("train.csv")
print("Shape of Data = ",data.shape)
data.sample(5)
Shape of Data =  (54808, 14)

Checking the event rate

plt.figure(figsize=(6,3))
sns.countplot(x='is_promoted',data=data)
plt.show()

# Checking the event rate : event is when claim is made
data['is_promoted'].value_counts()
event-rate
0    50140
1     4668
Name: is_promoted, dtype: int64

Displaying the attributes

# Checking the attribute names
pd.DataFrame(data.columns)
0employee_id
1department
2region
3education
4gender
5recruitment_channel
6no_of_trainings
7age
8previous_year_rating
9length_of_service
10KPIs_met >80%
11awards_won?
12avg_training_score
13is_promoted

Checking Data Quality

# checking missing data
quality_report(data)
AttributesTotalPercentNuniqueDtype
KPIs_met >80%00.00002int64
age00.000041int64
avg_training_score00.000061int64
awards_won?00.00002int64
department00.00009object
education24094.39533object
employee_id00.000054808int64
gender00.00002object
is_promoted00.00002int64
length_of_service00.000035int64
no_of_trainings00.000010int64
previous_year_rating41247.52445float64
recruitment_channel00.00003object
region00.000034object

Missing Value Treatment

# Treating the missing values of education as a separate category
data['education'] = data['education'].replace(np.NaN, 'NA')

# Treating the missing values of previous year rating as 0
data['previous_year_rating'] = data['previous_year_rating'].fillna(0)

Looking at attributes (EDA)

Can we make some inferences from EDA ?

  • Promotions are worst in Legal department (5.1%). Best promotions are in technology department (10.7%).
  • Region 9 is worst (1.9%) and region 4 is best (14.4%) in terms of promotions.
  • Although Master’s & above has greater promotion percentage but difference is not much.
  • Employees having previous years rating greater than 5 will have better chances of promotion than others.
  • Employess having KPI greater than 80% has good chances of promotions (16%)
  • Employees winning awards are promoted more (44%).
for col in data.drop('is_promoted',axis=1).columns:
    if data[col].dtype == 'object' or data[col].nunique()
        xx = data.groupby(col)['is_promoted'].value_counts().unstack(1)
        per_not_promoted = xx.iloc[:, 0] *100/xx.apply(lambda x: x.sum(), axis=1)
        per_promoted = xx.iloc[:, 1]*100/xx.apply(lambda x: x.sum(), axis=1)
        xx['%_0'] = per_not_promoted
        xx['%_1'] = per_promoted
        display(xx)
is_promoted01%_0%_1
department
Analytics484051290.43359.5665
Finance233020691.87708.1230
HR228213694.37555.6245
Legal9865394.89895.1011
Operations10325102390.98529.0148
Procurement645068890.36149.6386
R&D9306993.09316.9069
Sales & Marketing15627121392.79697.2031
Technology637076889.240710.7593
is_promoted01%_0%_1
region
region_15525890.49189.5082
region_105975192.12967.8704
region_1112417494.37265.6274
region_124673393.40006.6000
region_13241823091.31428.6858
region_147656292.50307.4970
region_15258622292.09407.9060
region_16136310293.03756.9625
region_1768710986.306513.6935
region_1830196.77423.2258
region_198215393.93596.0641
region_21135498991.98748.0126
region_208014994.23535.7647
region_213931895.62044.3796
region_22569473488.581211.4188
region_23103813788.340411.6596
region_244901896.45673.5433
region_2571610387.423712.5763
region_26211714393.67266.3274
region_27152813192.10377.8963
region_28116415488.315611.6844
region_299514395.67404.3260
region_33093789.306410.6936
region_305985991.01988.9802
region_31182511094.31525.6848
region_329054095.76724.2328
region_332591096.28253.7175
region_34284897.26032.7397
region_4145724685.554914.4451
region_57313595.43084.5692
region_66583295.36234.6377
region_7432751689.345410.6546
region_86025391.90848.0916
region_9412898.09521.9048
is_promoted01%_0%_1
education
Bachelor’s33661300891.79698.2031
Below Secondary7386791.67708.3230
Master’s & above13454147190.14419.8559
NA228712294.93575.0643
is_promoted01%_0%_1
gender
f14845146791.00668.9934
m35295320191.68498.3151
is_promoted01%_0%_1
recruitment_channel
other27890255691.60488.3952
referred100413887.915912.0841
sourcing21246197491.49878.5013
is_promoted01%_0%_1
previous_year_rating
0.0000378533991.77988.2202
1.000061358898.58591.4141
2.0000404418195.71604.2840
3.000017263135592.72217.2779
4.0000909378492.06247.9376
5.00009820192183.638516.3615
is_promoted01%_0%_1
KPIs_met >80%
034111140696.04133.9587
116029326283.090616.9094
is_promoted01%_0%_1
awards_won?
049429410992.32517.6749
171155955.984344.0157

Preparing Data for Modeling

# Creating dummy variables for all the categorical columns, droping that column
master_data = convert_categorical_to_dummies(data)
print("Total shape of Data :",master_data.shape)

# dropping the target from dataset
labels = np.array(master_data['is_promoted'].tolist())

# Removing the id attributes
df_data = master_data.drop(['is_promoted','employee_id'],axis=1)
print("Shape of Data:",df_data.shape)
df = df_data.values
Total shape of Data : (54808, 61)
Shape of Data: (54808, 59)

Model 1 – XGB Classifier

xgb_model = xgb.XGBClassifier()
print(xgb_model)

# Cross validation scores
f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1')
print("F1-score = ",f1_scores," Mean F1 score = ",np.mean(f1_scores))

# Training the models
xgb_model.fit(df,labels)

# Scoring on test set
test_data,score_xgb = score_on_test_set(xgb_model,"test.csv","submission_xgb.csv")
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
F1-score = [ 0.4526749   0.41547519  0.43579122  0.43012552  0.43427621]  
Mean F1 score =  0.433668606717

XGB Classifier : Parameter Tuning

Our goal is usually to set the model parameters to optimal values that enable a model to complete learning task in the best way possible. Thus, tuning XGboost classifier can optimize the parameters that impact the model in order to enable the algorithm to perform the best.
I performed lot of iterations patiently which led to fine tuning of parameters: n_estimators, max_depth and L1 regularization. A norm is to take baby steps to learn (small learning rate) and tune the parameters. Here, I found that with large number of trees (n_estimators), the F1-scores were improving.

# Create parameters to search
params = {
     'learning_rate': [0.01],
     'n_estimators': [900,1000,1100],
     'max_depth':[7,8,9],
     'reg_alpha':[0.3,0.4,0.5]
    }

# Initializing the XGBoost Regressor
xgb_model = xgb.XGBClassifier()

# Gridsearch initializaation
gsearch = GridSearchCV(xgb_model, params,
                    verbose=True,
                    cv=5,
                    n_jobs=2)

gsearch.fit(df, labels)

#Printing the best chosen params
print("Best Parameters :",gsearch.best_params_)

params = {'objective':'binary:logistic', 'booster':'gbtree'}

# Updating the parameter as per grid search
params.update(gsearch.best_params_)

# Initializing the XGBoost Regressor
xgb_model = xgb.XGBClassifier(**params)
print(xgb_model)

# Cross validation scores
f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1',n_jobs=2)
print("F1_scores per fold : ",f1_scores," \nMean F1_score= ",np.mean(f1_scores))

# Fitting model on tuned parameters
xgb_model.fit(df, labels)

# Scoring on test set
test_data,score_xgb_tuned = score_on_test_set(xgb_model,"test.csv","submission_xgb_tuned.csv")
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed: 13.0min finished
Best Parameters :{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 1000, 'reg_alpha': 0.4}

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=8, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.4, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
F1_scores per fold : [ 0.51014041  0.48657188  0.49528302  0.53054911  0.51130164]  
Mean F1_score= 0.506769210361

XGB Classifier : Setting threshold

How does XGBoost classifier predicts the class (‘promoted’ or ‘not promoted’) ? It predicts a probability between 0 and 1 for the unseen cases. Further, it predicts 0 and 1 by putting a threshold at 0.5 by default (1 if probability > 0.5). In unbalance data-set as in here, it may be a biased setting as it would be difficult to capture rare event with 0.5 threshold.

  • We can change the by default threshold of 0.5 by finding the optimal threshold to increase F1-score.
  • We need to find the threshold where f1-score is highest.
  • I tried submissions on few optimal cut-offs to get maximum possible improved F1-score.

The following python code splits the data in 90:10 and trains XGBoost classifier with tuned parameters. It calculates precision and recall at different thresholds and plots the precision recall curve. Further, we calculate F1-score for the same using precision and recall values.

# Splitting the dataset in order to use early stopping round
X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.10, stratify=labels)

xgb_model = xgb.XGBClassifier(**params)

# Training the models
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict_proba(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1])

thresholds = np.append(thresholds, 1)
f1_scores = 2*(precision*recall)/(precision+recall)
plt.step(recall, precision, color='b', alpha=0.4, where='post')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve')
plt.show()

prcurve

Getting optimal threshold

We plot F1-scores with respect to threshold in x-axis to check the F1-score peak. The below python codes gets the threshold value where the F1-score was highest.

scrs = pd.DataFrame({'precision' : precision, 'recal' : recall, 'thresholds' : thresholds, 'f1_score':f1_scores})
print("Threshold cutoff: ",scrs.loc[scrs['f1_score'] == scrs.f1_score.max(),'thresholds'].iloc[0])
print("Max F1-score at cut-off : ",scrs.f1_score.max())
scrs.plot(x='thresholds', y='f1_score')
Threshold cutoff:  0.340377241373
Max F1-score at cut-off :  0.53791130186
f1_threshold

Once you get the optimal threshold, use it for test set probability predictions as a cutoff to predict class labels 0 and 1 for the final submission.

What did not work ?

I tried the following other techniques which did not work and hence my final submissions were based on single model “XGBoost classifier” as described in this post.

  • I tried logistic regression and SVM, f1 score was low (less than 0.4).
  • I tried Random Forest. F1-score was comparatively low.
  • I tried LightGBM model. In default setting, It gave 0.50 f1-score but somehow it was not improving with parameter tuning. Little improvement was there when early_stopping_rounds was used. We consider best iteration for predictions on test set.
  • I created some interaction variables like if previous_year_rating == 5 and KPI > 80 == 1 then 1 else 0if awards_won? == 1 and KPI > 80 == 1 then 1 else 0. It did not help.
  • Finally, I took the best tuned params of all three (RF, XGboost and LightGBM) and stacked them with ‘Logistics Regression’ as classifier. It did not gave better f1 than individual XGB Classifier model.

At the End

Readers are also encouraged to download the data-set and check if they can reproduce the results. Also, I would love to check in comments if you can surpass the F1-score achieved here in the blog-post. There are following other things which one can try.

  • Generally, Stacking improves scores when there are lot of models. One can train say 100s of models of XGBoost and LightGBM (with different close by parameters) and then apply logistic regression on top of that (I tried with only 3 models, failed).
  • Also, one can try an interaction variable by calculating total score achieved in training (Number of training * Avg. training score).
  • One can try setting “early_stopping_rounds”  in XGBoost classifier training which I did not tried. It prevents over-fitting and can improve results.

The full implementation of the followed approach along with LightGBM model example (jupyter notebook) can be downloaded from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy data analytics 🙂