Feature Selection made easy with ‘screenpredictors’

Pre-processing of features or predictor variables for the development of machine learning models is usually a tedious process. The MathWorks has come to the rescue with a function called screenpredictors that was released with the R2019a MATLAB version. Several customers have commented on its usefulness, which is why I’ve been inspired to write this post.

The function evaluates the predictive power of all the univariate predictor variables that will be used in a logistic regression model (probability of default model for a credit scorecard to be specific).  The test does not consider the interaction between variables, therefore does not negate the need for performing factor analysis altogether. Hence, it is useful in eliminating variables with poor predictive power and so saving you some time by possibly preventing you from including weak variables in your model. This could also be useful if you have credit data which consists of thousands of variables. The screenpredictors function  will allow a conveninent and methodical way for you to assess the predictive power of these variables. It also gives you an objective measure by which you can accept or reject certain variables in your model. There are several metrics you can use (see later) which can be used for reporting purposes.

The figure below shows the information value (InfoValue) for a dataset containing customer credit card data (customer age, time with current bank, etc.). The green predictors have passed the test, and are deemed to increase the predictive power of a model you subsequently develop, whereas the red ones do not.

 

 

 

 

 

 

 

 

 

 

 

 

The screenpredictor function calculates the metrics listed below that measure the predictive strength of the variables based on different properties of the data. For example, the InfoValue measures the deviation in the distribution between the classes of the binary response variable. Each predictor can then be deemed suitable for further evaluation or not based on a threshold value that you determine (dashed line in figure above). The predictors that “fail” all or multiple of these tests (are either below or above the threshold values depending on the test) can then be disregarded. In this way, the screenpredictor function is a very useful filtering tool.

You can see how the function is used in this MathWorks example. An alternative filtering approach is shown in the example and summarised in the picture below, where you can select the predictors that “pass” all or multiple tests.

 

 

 

 

 

 

 

 

 

 

 

The metrics that are output from screenpredictors function:

  • Area under the ROC curve
  • Entropy
  • Gini
  • Chi-square p-value
  • Percentage of missing values in the predictor

The screenpredictors function forms part of the Credit Scorecard development functionality found in the Risk Management Toolbox.

As you can see, the screenpredictors function can remove the tediousness and uncertainty of variable selection and can allow you a quick, systematic way to build a more robust model.

Meet the Author

 

Talita is a Finance Team Leader at Opti-Num Solutions and is a specialist in the area of data analysis using MathWorks tools. Prior to joining the team she worked as a research and development chemist as well as a production planner at Prominent Paints. Talita has a BSc(Hons) in Textile and Polymer Science from the University of Stellenbosch and an MSc in Mechanical Engineering from the University of Cape Town. During her MSc she investigated the adhesive properties of composite structures (Fibre Metal Laminates) under dynamic loading.

 

 

 

What Can I Do Next?

Follow us