Wednesday, July 2, 2014

Improving the Performance of your Model

In my first tutorial (part 1part 2), I presented some basic concepts about Machine Learning for classification by implementing Logistic Regression through Gradient Descent. Then, I presented one of several ways to scale that same algorithm by rewriting it to run on a Hadoop cluster. The main goal was to focus on core concepts and not much on the results produced.

Now I want to guide you through some investigative approaches to assess the performance of the model constructed. By performance, I mean how well the model is capable of predicting future outcomes.

At the end of that tutorial, in part 2, I gave some hints by showing a cross tabulation of the results:

Cross Tabulation:
      y_pred
y_test   0   1
     0 137   6
     1  43  14

That means we got 137 out of (137+6), which is 95.8% correctly predicted as good credit and 14 out of (43+14), which is 24.6% correctly predicted as bad credit.

Do you notice the problem? Although we got a general accuracy of 75.5%, we actually incorrectly classified 75.4% (100-24.6) of the applications as good credit where in fact they are bad credit. That would mean huge financial losses, if that model were used by a bank.

Now let's see how this can be improved. The first step is to take a look at the available data. If you have some knowledge on the specific business domain in question, it certainly helps a lot. Otherwise there is some simple things that are general best practices, like looking for possible outliers, missing values, and overall data distribution.

So, let's start by loading the data into R:

data <- read.table("german.data-numeric")

This gives us an R dataframe object ('data'). We  can start by inspecting its overall structure:

str(data)

This will show us an output like this:

'data.frame': 1000 obs. of  25 variables:
 $ V1 : int  1 2 4 1 1 4 4 2 4 2 ...
 $ V2 : int  6 48 12 42 24 36 24 36 12 30 ...
 $ V3 : int  4 2 4 2 3 2 2 2 2 4 ...
 $ V4 : int  12 60 21 79 49 91 28 69 31 52 ...
 $ V5 : int  5 1 1 1 1 5 3 1 4 1 ...
 ...

You can explore some characteristics of this data through some simple R plots. For example, lets see the histogram for the variable 'V2', which is the duration of the loan, in months:

hist(data$V2)

And we will get a plot like this:
We can also see its summary statistics:

summary(data$V2)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 4.0    12.0    18.0    20.9    24.0    72.0

And notice that the loan periods range from 4 to 72 months, with an average of 20.9 months.

That is not the case, but box plots would also help with the visualization of possible outliers:

boxplot(data$V10)

That shows us that the age of customers seems to be in a reasonable range:
If you have some business domain knowledge in credit risk assessment (not my case), you could explore more complexes visualizations, like this one:

data_to_plot1 <- data.frame(cbind(1,count(data[data$V25==1,]$V2)))
names(data_to_plot1) <- c('credit','months','count')
data_to_plot2 <- data.frame(cbind(2,count(data[data$V25==2,]$V2)))
names(data_to_plot2) <- c('credit','months','count')
data_to_plot <- rbind(data_to_plot1, data_to_plot2)

library(ggplot2)
ggplot(data_to_plot, aes(x=factor(data_to_plot$months), y=(data_to_plot$count),
 fill=factor(data_to_plot$credit))) + geom_bar(stat='identity') +
 labs(x='months of loan', y='number of loans', fill='type of loan', title='')

Which produces the following plot:

And suggests that the proportion of bad loans in relation to good loans tend to be larger for loans with durations greater than 36  months.

Now, let's take a look at the data preparation step. I would recommend to get rid of extreme outliers, null values, and missing values. It is not the case for this data set, as it seems to be purposely prepared for machine learning research.

For example, if it was the case, we could verify and eliminate missing values which are represented in R by the type 'NA':

for(i in 1:ncol(data)) print(which(is.na(data[,i])==T))

integer(0)
integer(0)
integer(0)
integer(0)
integer(0)
...

data <- na.omit(data)

Now let's see what is the proportion of good and bad credits:

round(prop.table(table(data$V25))*100, 2)

 1  2 
70 30

This shows us that, from the 1,000 examples in the data set, 70% are examples of good credit and 30 % are examples of bad credit. We have to take that into consideration when we split this data set for training and testing our model. So, let's prepare our training set so that it contains the same amount of good and bad credit examples and see if its performance on the test set improves:

index <- sample(which(y_train==0),length(which(y_train==1)))
index <- c(index, which(y_train==1))
X_train <- X_train[index,]
y_train <- y_train[index]

Now let's check the proportion of good (0) and bad (1) credit on the train data to see if the splitting we want is correct. Notice that this step is performed after adjusting the outcome values for good and bad credit from 1 and 2 to 0 and 1 respectively, which is needed to perform binary classification with Logistic Regression:

round(prop.table(table(y_train))*100, 2)

y_train
 0  1 
50 50

After running the training and test steps, we get the following results:

steps:  1351 
corrects:  145 
wrongs:  55 
accuracy:  0.725

But let's see what really matters, the cross table results:

print(CrossTable(y_test,round(y_pred), dnn=c('y_test','y_pred'), prop.chisq = FALSE,
prop.c = FALSE, prop.r = FALSE))

...
Total Observations in Table:  200 

             | y_pred 
      y_test |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------|
           0 |       104 |        39 |       143 | 
             |     0.520 |     0.195 |           | 
-------------|-----------|-----------|-----------|
           1 |        16 |        41 |        57 | 
             |     0.080 |     0.205 |           | 
-------------|-----------|-----------|-----------|
Column Total |       120 |        80 |       200 | 
-------------|-----------|-----------|-----------|

$t
   y
x     0   1
  0 104  39
  1  16  41

$prop.row
   y
x           0         1
  0 0.7272727 0.2727273
  1 0.2807018 0.7192982
...

That is a lot of improvement in relation to the last results. By better balancing our training data we were able to reduce the misclassification of bad credits (classification of bad credit as good credit) from 75.4% to 28%. That was possible because the misclassification of good credits increased from 4.2% to 27.3%.

Now that our results are better balanced, let's see how our model performs with its own training data. With that, we aim to find out if the model is overfitting or underfitting the data:

y_pred <- hypot(X_train%*%theta)

print(CrossTable(y_train,round(y_pred), dnn=c('y_train','y_pred'), prop.chisq = FALSE,
prop.c = FALSE, prop.r = FALSE))

...
Total Observations in Table:  486 
 
             | y_pred 
     y_train |         0 |         1 | Row Total | 
-------------|-----------|-----------|-----------|
           0 |       181 |        62 |       243 | 
             |     0.372 |     0.128 |           | 
-------------|-----------|-----------|-----------|
           1 |        63 |       180 |       243 | 
             |     0.130 |     0.370 |           | 
-------------|-----------|-----------|-----------|
Column Total |       244 |       242 |       486 | 
-------------|-----------|-----------|-----------|
 
$t
   y
x     0   1
  0 181  62
  1  63 180

$prop.row
   y
x           0         1
  0 0.7448560 0.2551440
  1 0.2592593 0.7407407
...

It seems not be the case of overfitting, as the errors by scoring the training data are very similar when scoring the test data, for a relatively small number of examples to train on. We should investigating further to see if the model is underfitting, or if it is performing just rigth according to the given data.

Let's then analyze how the error in both the training and test data vary according to the number of examples in the training data, For that, we will plot the Mean Squared Error (MSE) of the predicted classes for both the training and test data, by the number of examples used for training the model. For that, we will vary our training data size from 25 to 550 examples, equally divided between good and bad credit classes. And we will separate 10% of the data set as the test data, keeping the original proportion of 70% of good credit and 30% of bad credit:

This learning curve seems not to be a typical curve of overfitting, as there is no gap between the train and test errors after a certain number of examples. Therefore, trying to address overfitting, by adding more examples or a Regularization factor in the cost function of Logistic Regression.

If it was a case of underfitting, we should try to come up with new relevant features that would help to better describe our data. This would be a task for someone with access to the right data and a good understanding of the credit business problem.

But even if we are not able to add more features, we can still have a better understanding of how the features we have at hand are able to describe our data. Two common approaches for that are Exploratory Factor Analysis for Feature Selection and Principal Component Analysis.

Trying to understand attribute importance through Exploratory Factor Analysis or Principal Component Analysis would allow us to identify the subset of variables that better describe our data and account for as much variability in the data as possible (in case such a subset exists).

Different from Feature Selection methods, Principal Component Analysis will create a new set of  linearly uncorrelated features from the original set of variables, which is equal in size or less than the size of those original features.

It is out of the scope of this tutorial to dive deeper into the field of Exploratory Data Analysis. Nonetheless, I will show you some benefits that we can get when applying those techniques as a pre-processing step for our Logistic Regression classifier. The results bellow are averages of 50 independent runs of Logistic Regression:

=> using all original variables:
about 1300 steps to converge
error (good credit classified as bad):  0.2288
error (bad credit classified as good):  0.2421333

=> 12 most important original variables (from Feature Selection):
about 2400 steps to converge
error (good credit classified as bad):  0.184
error (bad credit classified as good):  0.2874667

=> 8 most important new features (from Principal Component Analysis):
about 250 steps to converge
error (good credit classified as bad):  0.1696
error (bad credit classified as good):  0.2888

We can clearly see that by applying Principal Component Analysis and choosing the 8 most important new derived features, we can further reduce the false positive error to 17%. This seems to be a very desirable effect to the credit business, even in the expense of increasing the false negative error a little bit.

Another desirable benefit of this approach is the simplification in the model. We are able to reduce the number of original variables from 24 to only 8 new features that seem to better model the data. With this simplification, the training process through Gradient Descent only takes 250 steps, instead of 1300 in the original setup.

©2014 - Alexandre Vilcek