Fraud Detection in Online Banking Data

Download the Demo

Digital banking fraud resulted in a total financial loss of R284.2m in 2019, an 8% increase from 2018 [1]. The ability to detect fraud rapidly helps to prevent financial losses from occurring and maintain customer relationships between the banking institutions and their respective clients. In this article we are going to share our insights from an investigation where we used a Hidden Markov Model to classify whether an online banking session is fraudulent based on the posterior probability.

Description of Data

Traditional classification approaches will train a model based on specific features that are either present in the data or engineered and a response variable that identifies whether an online banking session was fraudulent or not. The online banking data we explored is structured somewhat differently: this data is “click” sequence data. Challenges in applying traditional classification techniques include:

  1. The number of “clicks” in each online banking session is inconsistent.
  2. The order in which “clicks” occur is important.
  3. You can’t wait until an online banking session is complete to gather all the features before making a prediction (this is too late). A fraud prediction is required at each “click” in the online banking session to ensure timely fraud detection.

Figure 1 highlights an example of a typical sequence of interactions or “clicks” made by the user for an online backing session.



What is a Hidden Markov Model?

A Hidden Markov Model was investigated as an alternative approach to classifying sequential “click” data. A Hidden Markov Model is a model that observes a sequence of emissions (i.e. The “click” sequence data logged during the online banking session), but the sequence of states the model progresses through to generate the emissions is unknown (i.e. The hidden states being whether the banking session is fraudulent or not). [2] In this specific use-case Hidden Markov Models help determine the posterior probability of the model being in a particular state (fraud or non-fraud) at any point in the sequence (observed “click” data).

Why did we use it?

The Hidden Markov Models allow a probability to be assigned to each “click” in the sequence. As a new action or “click” occurs, the model analyses the sequence of events and assigns a probability of fraud as the sequence of “clicks” in the online banking session progresses. A probability threshold is determined and then used to classify whether an online banking session is fraudulent at each “click” in the sequence. The session is flagged for further investigation or alternative security measures should the fraud probability determined from the Hidden Markov Model be greater than the threshold. Figure 2 shows a graphical representation of this process.


Hidden Markov Models in MATLAB

Built-in MATLAB functions are available to train and decode Hidden Markov Models. The online banking data had to be processed into encoded sequences to be used in the Hidden Markov Model functions. The individual banking sessions are then represented by a sequence of identifiers which represent the “click” actions of the user as well as other features of the session. The MATLAB functions used to develop this model were hmmtrain and hmmdecode [2].

What have we have learnt?

  • The Hidden Markov Models works well when the data (a “click” sequence) is long enough to identify a pattern during the online banking transaction. We have seen that these session lengths and thus sequences must be longer than two actions or “clicks”.
  • The Hidden Markov Model in MATLAB cannot account for the time between each “click” as this is not a time-based sequential model such as an LSTM (Long Short-Term Memory).
  • Transforming the “click” sequences to encoded sequences makes the addition of features difficult as each action or click needs to be represented by a unique identifier. Therefore, increasing the number of features increases the number of unique identifiers which in turn increases the difficulty of identifying events which may be fraudulent.



  1. 2020. SABRIC Annual Crime Stats 2019. [online] Available at: <> [Accessed 14 October 2020].
  2. 2020. Hidden Markov Models (HMM)- MATLAB & Simulink. [online] Available at: <> [Accessed 7 October 2020].

What Can I Do Next?

Follow us