Fraud detection is the like looking for a needle in a haystack. The behaviour of a fraudster will differ from the behaviour of a legitimate user but the fraudsters will also try to conceal their activities and they will try to hide in the mass of legitimate transactions. Machine Learning can help to spot these transactions but supervised learning methods may have problems to detect complete new fraud patterns and most of the data isn’t labeled at all, so we can’t apply supervised learning methods.
This tutorial will show how to reuse ideas from language modeling and apply deep learning, recurrent neural networks (LSTM) and embedding layers in particular to learn behavioural patterns/profiles from transactions and detect anomalies in these patterns (which could be a fraudulent transaction). The basic idea is to use a neural network to learn a lower dimensional representation of the input and then apply a classical outlier detection method on this. This approach doesn’t rely on labeled data. The network is implemented in Python using PyTorch.
First one off-topic comment
I decided to clean up my GitHub repository and split it by topics. So there is a new dedicated repository about my fraud detection blog posts (https://www.github.com/mgroncki/frauddetection) and there will another one about Quantitative Finance. Since I don’t want to break all the links in my old post, I will keep the old ones repos and mark them as legacy repos. Maybe there will be also a third repo about general data science topics later. But now let’s come back to the topic.
Problem description
Assuming we have transactional data (e.g. payment history, a log file of an application or website) and we want to identify suspicious (unusual) activities. We will use an RNN to learn user profiles based on their transactional behaviour and search for anomalies in these profiles. We will use an example with artificial data to train and test the network.
In our example the users can login in our system and can perform 5 different actions (action_1, …, action_5) we log all activities together with the user id, time/date of the activity and session id. An example session/activity look like this:
login -> action_1 -> action_2 -> action_5 -> logout
We have two different kind of users (e.g. supervisor and regular staff or retail and wholesale customer, etc) who differ in their behavior.
We simulate two hundred sessions for two hundred users (80% role A and 20% role B) using two different discrete Markov processes.
Brief reminder: In a Markov process the probability of the next action (state) depends only on the current action (state) and the Markov chain can be represented with a stochastic matrix where the entry in the i-th row and j-th colum is transition probability from state i to state j.
Here the transition matrices for our example:
actions = ['start', 'end', 'action_1', 'action_2', 'action_3', 'action_4', 'action_5'] # Normal behavior Role 1 np.array([ [0.00, 0.00, 0.20, 0.20, 0.20, 0.20, 0.20], [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00], [0.00, 0.01, 0.09, 0.30, 0.30, 0.15, 0.15], [0.00, 0.60, 0.05, 0.10, 0.05, 0.05, 0.15], [0.00, 0.50, 0.05, 0.25, 0.05, 0.10, 0.05], [0.00, 0.60, 0.01, 0.10, 0.10, 0.10, 0.09], [0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01], ]), # Normal behavior Role 2 np.array([ [0.00, 0.00, 0.20, 0.10, 0.10, 0.30, 0.30], [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00], [0.00, 0.10, 0.20, 0.20, 0.20, 0.10, 0.20], [0.00, 0.70, 0.05, 0.05, 0.05, 0.05, 0.10], [0.00, 0.70, 0.05, 0.05, 0.05, 0.10, 0.05], [0.00, 0.50, 0.01, 0.01, 0.01, 0.10, 0.37], [0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01], ]),
The transition probabilities of both user roles differ slightly and are even the same for some states.
Let’s now talk about the fraudster in our example.
Two percent of our users are potential fraudsters and for each session there is a 20% chance that the potential fraudster will commit fraud. If he is in the state ‘Fraud’ the session will be sampled from the fraud transition matrix.
np.array([ [0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00], [1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00], [0.00, 0.20, 0.70, 0.025, 0.025, 0.025, 0.025], [0.00, 0.40, 0.40, 0.05, 0.05, 0.05, 0.05], [0.00, 0.40, 0.40, 0.05, 0.05, 0.05, 0.05], [0.00, 0.50, 0.01, 0.01, 0.01, 0.10, 0.37], [0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01], ])
As fraudster have a much higher probability to perform action_1 repeatably or return to action_1 from other states (e.g. search customer information to steal data).
In total we have 40.000 transaction of which 111 transactions are fraudulent committed by 3 users out of 200 users.
The Jupyter notebook ‘SampleData’ in the project folder will generate this data sample and can be easily modified.
This example is just for educational purposes and it is by construction so simple that we could spot the fraudulent behaviour with very simple methods using feature engineering (e.g. users with highest count of action_1 in average), but in real world applications we would have maybe 100 or 1000 of different actions with more than two different type of users and the fraudulent behaviour would become more complex.
So how can a neural network learn the typical behaviour of our users?
Language models, Word Embeddings and how user sessions/activities are related to it
To determine if a sequence of actions (activity or session) is an anomaly we need to know the underlying distribution. Our five actions can be compared to words in a natural language and user sessions to sentences or paragraphs in a text. So maybe we can solve our problems with techniques which are used in NLP (natural language processing).
In a language model the probability of the next word is depending on the previous words in the same context. During the training of a language model one try to predict the next word given the previous words. It’s like fill in the blank in a text. While minimizing the prediction error the model will learn the conditional probabilities. It’s a supervised learning task and recurrent neural networks with embedding layers are very successful applied to it. So the basic idea is to use a language model network with an embedding layer and feed our sequences into it and use the latent representation (embeddings) to derive user profiles.
There are several papers in which the authors transferring the idea of embeddings and RNNs from NLP into context of user profiling for recommendation systems.
For more details on recurrent networks, language models, embeddings (word2vec) have a look here:
- http://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://www.tensorflow.org/tutorials/representation/word2vec
- https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
- https://medium.com/@florijan.stamenkovic_99541/rnn-language-modelling-with-pytorch-packed-batching-and-tied-weights-9d8952db35a9
Network design and training
For our simple example we use a very simple network architecture:
- 3 dimensional Embedding Layer
- 2 layer LSTM network with 20 nodes per layer and 20% dropouts
- Decoder layer
This is a very simple RNN compared to recent state of the art networks in natural language processing. So there is much space left for improvements. But this design is sufficient to present the idea of this approach.
We split our data in 80% training and 20% validation data and train in mini batches of 100 users sessions per mini batch. Since the sessions differ in their length we apply zero-padding (filling the time series).
We train the network for 20 epochs using RMSProp and learning rate decay with an initial learning rate of 0.05.
Implementation in PyTorch
You can download the complete source code from GitHub (https://github.com/mgroncki/FraudDetection).
This is my first PyTorch project and the implementation was quite simple and the documentation of the project is very good and there are many good tutorials available.
My code it’s based on the example on the official PyTorch language model example (https://www.github.com/pytorch/examples/tree/master/word_language_model) with some modification/simplifications:
First we prepare the sequential data and transfer the action strings into integers. We do it in this example by hand create a dictionary of action to id and apply this dictionary to our list of actions. There also NLP libraries which provide these functionalities (e.g. gensim).
logfile['SessionActivityInt'] = logfile.SessionActivity.map(lambda ls: np.array([action2id[a] for a in ls]+[action2id['start']]))
The result is a column of list of sessions, represented as integers.
In the next step we create a function which generates the mini batches and pad the sequences and convert everything to PyTorch Tensors and copy it to a GPU if available.
def split_train_test(input, device, prop=0.8, seed=42):
np.random.seed(42)
mask = np.random.uniform(size=input.shape[0])0].sum() / y[y>0].size(0)
return y_probs, y_predict, y, loss, acc
def get_batch(i, batch_size, input):
'''
Takes a column/list of activity tensors of variable lenght
and returns the padded i-th minibatch of batch_size activities
'''
data = input[i*batch_size : (i+1) * batch_size]
data = sorted(data, key=len, reverse=True)
x = nn.utils.rnn.pad_sequence([x[:-1] for x in data])
y = nn.utils.rnn.pad_sequence([y[1:] for y in data])
return x, y
We build the network following the example from the official PyTorch example with some slights modifications. Adding support for padding sequences.
class BehaviourNet(nn.Module):
'''
Very simple network consisting of an embedding layer, LSTM layers and a decoder with dropouts
'''
def __init__(self, n_actions=6, embedding_size=3, n_nodes=6, n_layers=2, dropout=0.2,
padding_idx=0, initrange=0.5):
super(VerySimpleBehaviorNet, self).__init__()
self.dropout = nn.Dropout(dropout)
self.embedding = nn.Embedding(n_actions, embedding_size, padding_idx)
self.rnn = nn.LSTM(embedding_size, n_nodes, n_layers, dropout=dropout)
self.decoder = nn.Linear(n_nodes, n_actions)
self.init_weights(initrange)
self.n_nodes = n_nodes
self.n_layers = n_layers
def init_weights(self, initrange=0.1):
self.embedding.weight.data.uniform_(-initrange, initrange)
# Set the first row to zero (padding idx)
self.embedding.weight.data[0,:] = 0
print(self.embedding.weight)
self.decoder.bias.data.zero_()
self.decoder.weight.data.uniform_(-initrange, initrange)
def init_hidden(self, batch_size):
weight = next(self.parameters())
return (weight.new_zeros(self.n_layers, batch_size, self.n_nodes),
weight.new_zeros(self.n_layers, batch_size, self.n_nodes))
def forward(self, input, hidden):
emb = self.dropout(self.embedding(input))
output, hidden = self.rnn(emb, hidden)
output = self.dropout(output)
decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden
We are going to use the standard cross-entropy loss function, which offers support for padded sequences, so there is no worry during the training but for the evaluation we want also to calculate the accuracy of the model on the validation data set and there we need to mask the padded time steps and exclude from the calculation.
def training(model, optimizer, scheduler, loss_function, data, batch_size, n_actions, clipping=0.5):
model.train()
n_batch = int(np.ceil(len(data) // batch_size))
hidden = model.init_hidden(batch_size)
scheduler.step()
total_loss = 0.0
for batch in range(n_batch):
hidden = tuple(h.detach() for h in hidden)
x,y = get_batch(batch, batch_size, data)
optimizer.zero_grad()
output, hidden = model(x, hidden)
output_flatten = output.view(-1, n_actions)
y_flatten = y.view(-1)
loss = loss_function(output_flatten, y_flatten)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), clipping)
optimizer.step()
total_loss += loss
return total_loss / n_batch
def evaluate(model, loss_function, data, n_actions):
model.eval()
batch_size = len(data)
hidden = model.init_hidden(batch_size)
x,y = get_batch(0, batch_size, data)
output, hidden = model(x, hidden)
output_flatten = output.view(-1, n_actions)
y_flatten = y.view(-1)
loss = loss_function(output_flatten, y_flatten)
y_probs = nn.Softmax()(output)
y_predict = t.argmax(output, 2)
y_predict[y==0]=0
acc = (y_predict==y).double()[y>0].sum() / y[y>0].size(0)
return y_probs, y_predict, y, loss, acc
Whats left is the training loop:
modelname = 'model_1'
model = VerySimpleBehaviorNet(initrange=10, n_layers=2, n_nodes=20, n_actions=len(id2action)).to(device)
loss_func = nn.CrossEntropyLoss(ignore_index=0)
optimizer = t.optim.RMSprop(model.parameters(), lr=0.05)
scheduler = t.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)
for epoch in range(20):
training_loss = training(model, optimizer, scheduler, loss_func, train, 100, n_actions=len(id2action))
y_prob, y_pred, y_true, test_loss, test_acc = evaluate(model, loss_func, test, n_actions=len(id2action))
print(f'Epoch {epoch}\nTrain Loss : {training_loss} \t Val loss: {test_loss} \t Val Acc {test_acc}')
Results
First we have a look into the learned latent representation of the actions.

At this moment the interpretation of this representation is, because of the nature of the data, quite difficult (no meaningful actions and random transitions probabilities).
But if we convert a sequence of activities into the same latent space by converting each action into the corresponding embedding vector and calculate the average of all actions in one sequence in each dimension we can observe a quite interesting pattern.
We observe that Role A and Role B user have overlapping transactions but fraudulent user sessions are more likely on the left upper corner.

If we now average all sessions of a user in this 3 dimensional space, we can see that our network learned a representation of the activities which allows us to identify what role each user has and more important the three fraudsters are clearly outliers (red):

Remark / Conclusion
In our example the network was able learn a meaningful representation of the user actions which can then be used to identify the fraudulent users and the suspicious transactions with classical outlier detection methods (in our simple case just through visual inspection). I think the idea of using embeddings to represent user profiles very promising. But be aware in this simple example you can reach the same result with much simpler methods, which would preferable (e.g. counting the number of actions in one session and normalise the vector and a apply a PCA on it). This is a very simple approach and there is much space for further extensions (e.g. add user embedding analog to document2vec) or we can reuse the embeddings in other models (transfer learning).
Using PyTorch for this project was very very straight forward (comparable to using numpy) and much easier to debug compared to the low level api of TensorFlow and good fun. But TensorFlow 2.x will address some of the issues (e.g. default eager mode, cleaner api, etc). Although I am a big fan of TensorFlow and the Estimator API, especially in connection with Google Cloud Platform, it will be definitely not the last time that I used PyTorch.
Thats its for today, and i hope you enjoyed reading this post.
On my bucket list for the next posts is are
- porting the signature detection project from KNIME to Python and train it on the GCP and
-
extending this network and try it on real world data
-
exploring GANs and variational auto-encoders for fraud detection
-
tutorial about up and down sampling methods to handle imbalance data
Haven’t decided whats coming next, so if you have any comments or questions please drop me a message.
So long…








































