When using machine learning, the concern is that it’s sometimes difficult to see how decisions are made, or how a mystical model seems to work. In this example, we show a class of machine learning models whose decision-making processes are easier and more intuitive to analyse. Credit risk can be a serious drawback to the operations of a business. This is where machine learning comes in. In this article, we explore machine learning models using MATLAB which aim to classify whether or not somebody will default on their credit card payment in the next 30 days. We develop models which are instinctive to understand and able to be quickly prototyped and improved. We provide a live script for you to get started immediately with model development.
Given a variety of customer input data – age, gender, educational level, and billing history amongst others – we develop a decision tree to classify credit card default. While this can also be done graphically using MATLAB’s Classification Learner App, we train models using code in order to develop them quickly and improve on them where necessary. This is a swift, ad-hoc way to throw our data into the most basic model and to generate predictions. We train a decision tree as below:
While this crudely developed model is approximately 73% accurate, it only took one line of code to run the training!
Let’s look at the model’s confusion matrix below in Figure 1:
The true values are on the rows of the matrix, and our predictions from the model are on the columns. Hence, elements which appear on the diagonal of the matrix are correctly classified. We see that there is a rather high number of misclassifications in the top-right corner of the matrix. This might be a problem with the model. For example, it would be a further liability for a bank if a customer is going to default on their next payment, but they are identified as not defaulting. To probe this further, let’s look at the occurrences of “default” and “not default” from the training data set in Figure 2:
We can see that there are far more cases of “no default” in the training data than “default”. This inherent bias in the training data causes the classification tree model to be inaccurate. To fix this, let’s train a new model, which is going to be an ensemble of trees. This means that this model is going to consist of 100 classification trees where each tree tries to reduce the error created by the previous tree. This also allows us to use the technique of random under-sampling. This means that the over-represented outcome (in our case, “no default”) is going to have less of its data set as input to the model for training. By using the simple line of code below, we train an improved machine learning model:
This results in a more balanced proportion of “no default” and “default” cases for the model. The confusion matrix for this model is seen in Figure 3:
This model has an accuracy of about 76%, but the main attraction here is that we have successfully improved on the first model. We see this by the significantly lower number in the top-right corner of the matrix. It is not a priority for us to misclassify customers who are “not default” as “default”. An accuracy of 76% is still not ideal, but MATLAB has an arsenal of other functions and approaches to use. This serves as a useful starting point.
A great advantage of using trees is that it is easy to follow the decision-making process involved with classification. This is important in the context of a bank where increased levels of transparency are needed. We can have a look at one of the trees (remember, the model is made of 100 trees!) from model 2, below in Figure 4. The decision-making process is clearly illustrated by means of this tree.
We have provided you with a MATLAB live script for you to experiment with machine learning for credit classification. In doing so, you will be able to appreciate the strength of MATLAB to develop and improve on machine learning models rapidly and with ease. Through this example, we have also shown how you can create models which are intuitive and that aim to demystify deep learning.