Modelling with Categorical Variables

Download the code

Building predictive models or analysing data to glean insights from the data is relatively easy when working with numerical data. However, when our data consists of mostly categorical variables (e.g. survey data) it becomes tricky because many of the techniques used to evaluate or understand numerical data, cannot be used for categorical or discrete variables.

In this article we are sharing some highlights from the process used to predict used car sales prices using data downloaded from Kaggle. The data contains various car properties that you would find on a website selling used cars. You can download the code for all our workings here.

Some highlights to look forward to:

  • Preparing your data
  • Visualising categorical data
  • Feature Selection
  • Building the model

Preparing your data

Converting categorical variables to dummy variables is typically a step you need to in order to use categorical variables in either classification or regression models. MATLAB has a dummyvar function that does this conversion. However, when fitting a linear regression model in MATLAB, you can include the categorical variables directly into the models as the algorithm will do the dummy variable conversion.  Another consideration is if a categorical variable consists of n discrete categories, the dummy variable equivalent will consist n logical columns. One needs to be careful of not falling into the ‘dummy variable trap’ and including all n columns into your model (causes multicollinearity). When fitting models in the Statistics and Machine Learning toolbox the algorithm will also only use n-1 of the logical variables for each feature.

Visualising categorical data

If we were modelling with numerical data, we would probably be creating a correlation matrix to look at the relation between the variables and the response variable and between each other. Identifying possible multi-collinearity would be an important step. To inspect the inter-variable relationships between categorical variables we can use cross tabulation. This does not provide a metric such as a correlation to rely on, but a visual interrogation method. The heat map is great way of visualising it, where the colour in each block represents the number of cars for that given body type and manufacturer.










We suspected that the model_name would be unique to the manufacturer name and would be an example of “multicollinearity”. There happened to be only 6 observations out of 38521 where the model_name was not unique. Below we use word clouds to inspect how the model names overlap with the body_type variable for each manufacturer_name.

There seems to be significant overlap between models that are hatchback, sedan and universal. This is often because a specific model is available in both body types. But by having a look at this, body type may be an appropriate differentiator within a given manufacturer and the model_name can be excluded from the model.

Feature Selection

We used F-tests to rank the importance of both the numerical and categorical variables and then used Rrelief algorithm to rank the importance of the numerical variable. The results were different, as you would expect from two different type algorithms, however in both cases the duration_listed variable was ranked low or lowest and was subsequently removed from the model. The R-Squared only reduced by 0.001 confirming the variable’s insignificance.

Building a Model

We are going to build a linear model as it is the easiest model to understand and is typically a good model to use as a baseline to compare other models to. This wasn’t an exhaustive model development exercise, but rather an exploration of what approaches are useful when working with categorical variables.

There are several approaches we could take

  1. Fit the model to the entire dataset (R-squared = 0.784, RMSE = 2.81x 103)
  2. Use stepwise regression to build the model
  3. Fit the model using the numerical variables only, then add the categorical variables (R-squared = 0.856 RMSE = 2.29 x 103 )

All features (1): Fitting the model to the entire data set means that you are either limited to a very simple linear model with no higher order terms. If you increase the ‘order’ of the model to include interaction terms, there will be numerous interaction terms that are added to the model that do not necessarily improve the predictive ability.

Stepwise regression (2): When attempting a stepwise regression no linear terms were added to the model, but only interaction terms. This not only took forever to run, but resulted in a poor predictive ability and overly complicated model.

The winner is number 3!

Here we fit a linear model with only numerical variables to start and then use the step function to add terms.












What Can I Do Next?

Follow us