Data Analytics and Modeling with XGBoost Classifier : WNS Hackathon Challenge

HR Analytics : Hackathon Challenge

I participated in WNS Analytics Wizard hackathon, “To predict whether an employee will be promoted or not” and hence I am coming up with this blog-post of the solution submitted which ranked me 138 (Top 11%) in the challenge. The leader board ranking was decided on the F1-score which is harmonic mean of precision and recall.

About Data
The data-set consists of 54808 rows where each row had 14 attributes including target variable (i.e “is_promoted”). There are 4668 cases where employees have been promoted (8.5%). The data-set is provided in GitHub link here.

Let’s get started in building the data analytics pipeline end to end.

Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from sklearn.metrics import confusion_matrix, f1_score, precision_recall_curve
from sklearn.model_selection import GridSearchCV, train_test_split,cross_val_score

import xgboost as xgb
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

# Set all options
%matplotlib inline
plt.style.use('seaborn-notebook')
plt.rcParams["figure.figsize"] = (20, 3)
pd.options.display.float_format = '{:20,.4f}'.format
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
sns.set(context="paper", font="monospace")

User Defined Functions

def convert_categorical_to_dummies(d_convert):

    """
    Author: Abhijeet Kumar
    Description: returns Dataframe with all categorical variables converted into dummies
    Arguments: Dataframe (having categorical variables)
    """

    df = d_convert.copy()
    list_to_drop = []
    for col in df.columns:
        if df[col].dtype == 'object':
            list_to_drop.append(col)
            df = pd.concat([df,pd.get_dummies(df[col],prefix=col,prefix_sep='_', drop_first=False)], axis=1)
            df = df.drop(list_to_drop,axis=1)
    return df

def quality_report(df):

    """
    Author: Abhijeet Kumar
    Description: Displays quality of data in terms of missing values, unique numbers, datatypes etc.
    Arguments: Dataframe
    """
    dtypes = df.dtypes
    nuniq = df.T.apply(lambda x: x.nunique(), axis=1)
    total = df.isnull().sum().sort_values(ascending = False)
    percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False)
    quality_df = pd.concat([total, percent, nuniq, dtypes], axis=1, keys=['Total', 'Percent','Nunique', 'Dtype'])
    display(quality_df)

def score_on_test_set(model, file_name, out_name):

    """
    Author: Abhijeet Kumar
    Description : It runs same steps of preprocessing as in training, scores
    on the test data provided in hackathon and generates the submission file.
    Argument : model, test data file, submission file
    """

    test_data = pd.read_csv(file_name)

    # Treating the missing values of education as a separate category
    test_data['education'] = test_data['education'].replace(np.NaN, 'NA')

    # Treating the missing values of education as a separate category
    test_data['previous_year_rating'] = test_data['previous_year_rating'].fillna(0)

    # Creating dummy variables for all the categorical columns, droping that column
    master_test_data = convert_categorical_to_dummies(test_data)

    # Removing the id attributes
    df_test_data = master_test_data.drop(['employee_id'],axis=1)
    if out_name == "submission_lightgbm.csv":
        y_pred = model.predict_proba(df_test_data.values, num_iteration=model.best_iteration_)
    else:
        y_pred = model.predict_proba(df_test_data.values)
    submission_df = pd.DataFrame({'employee_id':master_test_data['employee_id'],'is_promoted':y_pred[:,1]})
    submission_df.to_csv(out_name, index=False)

    score = model.predict_proba(df_test_data.values)
    return test_data,score

Reading Data

data = pd.read_csv("train.csv")
print("Shape of Data = ",data.shape)
data.sample(5)

Shape of Data =  (54808, 14)

Checking the event rate

plt.figure(figsize=(6,3))
sns.countplot(x='is_promoted',data=data)
plt.show()

# Checking the event rate : event is when claim is made
data['is_promoted'].value_counts()

0    50140
1     4668
Name: is_promoted, dtype: int64

Displaying the attributes

# Checking the attribute names
pd.DataFrame(data.columns)

0	employee_id
1	department
2	region
3	education
4	gender
5	recruitment_channel
6	no_of_trainings
7	age
8	previous_year_rating
9	length_of_service
10	KPIs_met >80%
11	awards_won?
12	avg_training_score
13	is_promoted

Checking Data Quality

# checking missing data
quality_report(data)

Attributes	Total	Percent	Nunique	Dtype
KPIs_met >80%	0	0.0000	2	int64
age	0	0.0000	41	int64
avg_training_score	0	0.0000	61	int64
awards_won?	0	0.0000	2	int64
department	0	0.0000	9	object
education	2409	4.3953	3	object
employee_id	0	0.0000	54808	int64
gender	0	0.0000	2	object
is_promoted	0	0.0000	2	int64
length_of_service	0	0.0000	35	int64
no_of_trainings	0	0.0000	10	int64
previous_year_rating	4124	7.5244	5	float64
recruitment_channel	0	0.0000	3	object
region	0	0.0000	34	object

Missing Value Treatment

# Treating the missing values of education as a separate category
data['education'] = data['education'].replace(np.NaN, 'NA')

# Treating the missing values of previous year rating as 0
data['previous_year_rating'] = data['previous_year_rating'].fillna(0)

Looking at attributes (EDA)

Can we make some inferences from EDA ?

Promotions are worst in Legal department (5.1%). Best promotions are in technology department (10.7%).
Region 9 is worst (1.9%) and region 4 is best (14.4%) in terms of promotions.
Although Master’s & above has greater promotion percentage but difference is not much.
Employees having previous years rating greater than 5 will have better chances of promotion than others.
Employess having KPI greater than 80% has good chances of promotions (16%)
Employees winning awards are promoted more (44%).

for col in data.drop('is_promoted',axis=1).columns:
    if data[col].dtype == 'object' or data[col].nunique()
        xx = data.groupby(col)['is_promoted'].value_counts().unstack(1)
        per_not_promoted = xx.iloc[:, 0] *100/xx.apply(lambda x: x.sum(), axis=1)
        per_promoted = xx.iloc[:, 1]*100/xx.apply(lambda x: x.sum(), axis=1)
        xx['%_0'] = per_not_promoted
        xx['%_1'] = per_promoted
        display(xx)

is_promoted	0	1	%_0	%_1
department
Analytics	4840	512	90.4335	9.5665
Finance	2330	206	91.8770	8.1230
HR	2282	136	94.3755	5.6245
Legal	986	53	94.8989	5.1011
Operations	10325	1023	90.9852	9.0148
Procurement	6450	688	90.3614	9.6386
R&D	930	69	93.0931	6.9069
Sales & Marketing	15627	1213	92.7969	7.2031
Technology	6370	768	89.2407	10.7593

is_promoted	0	1	%_0	%_1
region
region_1	552	58	90.4918	9.5082
region_10	597	51	92.1296	7.8704
region_11	1241	74	94.3726	5.6274
region_12	467	33	93.4000	6.6000
region_13	2418	230	91.3142	8.6858
region_14	765	62	92.5030	7.4970
region_15	2586	222	92.0940	7.9060
region_16	1363	102	93.0375	6.9625
region_17	687	109	86.3065	13.6935
region_18	30	1	96.7742	3.2258
region_19	821	53	93.9359	6.0641
region_2	11354	989	91.9874	8.0126
region_20	801	49	94.2353	5.7647
region_21	393	18	95.6204	4.3796
region_22	5694	734	88.5812	11.4188
region_23	1038	137	88.3404	11.6596
region_24	490	18	96.4567	3.5433
region_25	716	103	87.4237	12.5763
region_26	2117	143	93.6726	6.3274
region_27	1528	131	92.1037	7.8963
region_28	1164	154	88.3156	11.6844
region_29	951	43	95.6740	4.3260
region_3	309	37	89.3064	10.6936
region_30	598	59	91.0198	8.9802
region_31	1825	110	94.3152	5.6848
region_32	905	40	95.7672	4.2328
region_33	259	10	96.2825	3.7175
region_34	284	8	97.2603	2.7397
region_4	1457	246	85.5549	14.4451
region_5	731	35	95.4308	4.5692
region_6	658	32	95.3623	4.6377
region_7	4327	516	89.3454	10.6546
region_8	602	53	91.9084	8.0916
region_9	412	8	98.0952	1.9048

is_promoted	0	1	%_0	%_1
education
Bachelor’s	33661	3008	91.7969	8.2031
Below Secondary	738	67	91.6770	8.3230
Master’s & above	13454	1471	90.1441	9.8559
NA	2287	122	94.9357	5.0643

is_promoted	0	1	%_0	%_1
gender
f	14845	1467	91.0066	8.9934
m	35295	3201	91.6849	8.3151

is_promoted	0	1	%_0	%_1
recruitment_channel
other	27890	2556	91.6048	8.3952
referred	1004	138	87.9159	12.0841
sourcing	21246	1974	91.4987	8.5013

is_promoted	0	1	%_0	%_1
previous_year_rating
0.0000	3785	339	91.7798	8.2202
1.0000	6135	88	98.5859	1.4141
2.0000	4044	181	95.7160	4.2840
3.0000	17263	1355	92.7221	7.2779
4.0000	9093	784	92.0624	7.9376
5.0000	9820	1921	83.6385	16.3615

is_promoted	0	1	%_0	%_1
KPIs_met >80%
0	34111	1406	96.0413	3.9587
1	16029	3262	83.0906	16.9094

is_promoted	0	1	%_0	%_1
awards_won?
0	49429	4109	92.3251	7.6749
1	711	559	55.9843	44.0157

Preparing Data for Modeling

# Creating dummy variables for all the categorical columns, droping that column
master_data = convert_categorical_to_dummies(data)
print("Total shape of Data :",master_data.shape)

# dropping the target from dataset
labels = np.array(master_data['is_promoted'].tolist())

# Removing the id attributes
df_data = master_data.drop(['is_promoted','employee_id'],axis=1)
print("Shape of Data:",df_data.shape)
df = df_data.values

Total shape of Data : (54808, 61)
Shape of Data: (54808, 59)

Model 1 – XGB Classifier

xgb_model = xgb.XGBClassifier()
print(xgb_model)

# Cross validation scores
f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1')
print("F1-score = ",f1_scores," Mean F1 score = ",np.mean(f1_scores))

# Training the models
xgb_model.fit(df,labels)

# Scoring on test set
test_data,score_xgb = score_on_test_set(xgb_model,"test.csv","submission_xgb.csv")

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
F1-score = [ 0.4526749   0.41547519  0.43579122  0.43012552  0.43427621]  
Mean F1 score =  0.433668606717

XGB Classifier : Parameter Tuning

Our goal is usually to set the model parameters to optimal values that enable a model to complete learning task in the best way possible. Thus, tuning XGboost classifier can optimize the parameters that impact the model in order to enable the algorithm to perform the best.
I performed lot of iterations patiently which led to fine tuning of parameters: n_estimators, max_depth and L1 regularization. A norm is to take baby steps to learn (small learning rate) and tune the parameters. Here, I found that with large number of trees (n_estimators), the F1-scores were improving.

# Create parameters to search
params = {
     'learning_rate': [0.01],
     'n_estimators': [900,1000,1100],
     'max_depth':[7,8,9],
     'reg_alpha':[0.3,0.4,0.5]
    }

# Initializing the XGBoost Regressor
xgb_model = xgb.XGBClassifier()

# Gridsearch initializaation
gsearch = GridSearchCV(xgb_model, params,
                    verbose=True,
                    cv=5,
                    n_jobs=2)

gsearch.fit(df, labels)

#Printing the best chosen params
print("Best Parameters :",gsearch.best_params_)

params = {'objective':'binary:logistic', 'booster':'gbtree'}

# Updating the parameter as per grid search
params.update(gsearch.best_params_)

# Initializing the XGBoost Regressor
xgb_model = xgb.XGBClassifier(**params)
print(xgb_model)

# Cross validation scores
f1_scores = cross_val_score(xgb_model, df, labels, cv=5, scoring='f1',n_jobs=2)
print("F1_scores per fold : ",f1_scores," \nMean F1_score= ",np.mean(f1_scores))

# Fitting model on tuned parameters
xgb_model.fit(df, labels)

# Scoring on test set
test_data,score_xgb_tuned = score_on_test_set(xgb_model,"test.csv","submission_xgb_tuned.csv")

Fitting 5 folds for each of 1 candidates, totalling 5 fits

[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed: 13.0min finished

Best Parameters :{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 1000, 'reg_alpha': 0.4}

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=8, min_child_weight=1, missing=None, n_estimators=1000,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.4, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
F1_scores per fold : [ 0.51014041  0.48657188  0.49528302  0.53054911  0.51130164]  
Mean F1_score= 0.506769210361

XGB Classifier : Setting threshold

How does XGBoost classifier predicts the class (‘promoted’ or ‘not promoted’) ? It predicts a probability between 0 and 1 for the unseen cases. Further, it predicts 0 and 1 by putting a threshold at 0.5 by default (1 if probability > 0.5). In unbalance data-set as in here, it may be a biased setting as it would be difficult to capture rare event with 0.5 threshold.

We can change the by default threshold of 0.5 by finding the optimal threshold to increase F1-score.
We need to find the threshold where f1-score is highest.
I tried submissions on few optimal cut-offs to get maximum possible improved F1-score.

The following python code splits the data in 90:10 and trains XGBoost classifier with tuned parameters. It calculates precision and recall at different thresholds and plots the precision recall curve. Further, we calculate F1-score for the same using precision and recall values.

# Splitting the dataset in order to use early stopping round
X_train, X_test, y_train, y_test = train_test_split(df, labels, test_size=0.10, stratify=labels)

xgb_model = xgb.XGBClassifier(**params)

# Training the models
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict_proba(X_test)
precision, recall, thresholds = precision_recall_curve(y_test, y_pred[:,1])

thresholds = np.append(thresholds, 1)
f1_scores = 2*(precision*recall)/(precision+recall)
plt.step(recall, precision, color='b', alpha=0.4, where='post')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve')
plt.show()

Getting optimal threshold

We plot F1-scores with respect to threshold in x-axis to check the F1-score peak. The below python codes gets the threshold value where the F1-score was highest.

scrs = pd.DataFrame({'precision' : precision, 'recal' : recall, 'thresholds' : thresholds, 'f1_score':f1_scores})
print("Threshold cutoff: ",scrs.loc[scrs['f1_score'] == scrs.f1_score.max(),'thresholds'].iloc[0])
print("Max F1-score at cut-off : ",scrs.f1_score.max())
scrs.plot(x='thresholds', y='f1_score')

Threshold cutoff:  0.340377241373
Max F1-score at cut-off :  0.53791130186

Once you get the optimal threshold, use it for test set probability predictions as a cutoff to predict class labels 0 and 1 for the final submission.

What did not work ?

I tried the following other techniques which did not work and hence my final submissions were based on single model “XGBoost classifier” as described in this post.

I tried logistic regression and SVM, f1 score was low (less than 0.4).
I tried Random Forest. F1-score was comparatively low.
I tried LightGBM model. In default setting, It gave 0.50 f1-score but somehow it was not improving with parameter tuning. Little improvement was there when early_stopping_rounds was used. We consider best iteration for predictions on test set.
I created some interaction variables like if previous_year_rating == 5 and KPI > 80 == 1 then 1 else 0, if awards_won? == 1 and KPI > 80 == 1 then 1 else 0. It did not help.
Finally, I took the best tuned params of all three (RF, XGboost and LightGBM) and stacked them with ‘Logistics Regression’ as classifier. It did not gave better f1 than individual XGB Classifier model.

At the End

Readers are also encouraged to download the data-set and check if they can reproduce the results. Also, I would love to check in comments if you can surpass the F1-score achieved here in the blog-post. There are following other things which one can try.

Generally, Stacking improves scores when there are lot of models. One can train say 100s of models of XGBoost and LightGBM (with different close by parameters) and then apply logistic regression on top of that (I tried with only 3 models, failed).
Also, one can try an interaction variable by calculating total score achieved in training (Number of training * Avg. training score).
One can try setting “early_stopping_rounds” in XGBoost classifier training which I did not tried. It prevents over-fitting and can improve results.

The full implementation of the followed approach along with LightGBM model example (jupyter notebook) can be downloaded from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy data analytics 🙂

Table of Contents