Fraud detection: Behavioural modeling and unsupervised anomaly detection with deep learning

11/11/201812/11/2018 ~ Matthias Groncki ~ Leave a comment

Fraud detection is the like looking for a needle in a haystack. The behaviour of a fraudster will differ from the behaviour of a legitimate user but the fraudsters will also try to conceal their activities and they will try to hide in the mass of legitimate transactions. Machine Learning can help to spot these transactions but supervised learning methods may have problems to detect complete new fraud patterns and most of the data isn’t labeled at all, so we can’t apply supervised learning methods.

This tutorial will show how to reuse ideas from language modeling and apply deep learning, recurrent neural networks (LSTM) and embedding layers in particular to learn behavioural patterns/profiles from transactions and detect anomalies in these patterns (which could be a fraudulent transaction). The basic idea is to use a neural network to learn a lower dimensional representation of the input and then apply a classical outlier detection method on this. This approach doesn’t rely on labeled data. The network is implemented in Python using PyTorch.

First one off-topic comment

I decided to clean up my GitHub repository and split it by topics. So there is a new dedicated repository about my fraud detection blog posts (https://www.github.com/mgroncki/frauddetection) and there will another one about Quantitative Finance. Since I don’t want to break all the links in my old post, I will keep the old ones repos and mark them as legacy repos. Maybe there will be also a third repo about general data science topics later. But now let’s come back to the topic.

Problem description

Assuming we have transactional data (e.g. payment history, a log file of an application or website) and we want to identify suspicious (unusual) activities. We will use an RNN to learn user profiles based on their transactional behaviour and search for anomalies in these profiles. We will use an example with artificial data to train and test the network.

In our example the users can login in our system and can perform 5 different actions (action_1, …, action_5) we log all activities together with the user id, time/date of the activity and session id. An example session/activity look like this:

login -> action_1 -> action_2 -> action_5 -> logout

We have two different kind of users (e.g. supervisor and regular staff or retail and wholesale customer, etc) who differ in their behavior.

We simulate two hundred sessions for two hundred users (80% role A and 20% role B) using two different discrete Markov processes.

Brief reminder: In a Markov process the probability of the next action (state) depends only on the current action (state) and the Markov chain can be represented with a stochastic matrix where the entry in the i-th row and j-th colum is transition probability from state i to state j.

Here the transition matrices for our example:

actions = ['start', 'end',
'action_1', 'action_2', 'action_3',
'action_4', 'action_5']

# Normal behavior Role 1
np.array([
[0.00, 0.00, 0.20, 0.20, 0.20, 0.20, 0.20],
[1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
[0.00, 0.01, 0.09, 0.30, 0.30, 0.15, 0.15],
[0.00, 0.60, 0.05, 0.10, 0.05, 0.05, 0.15],
[0.00, 0.50, 0.05, 0.25, 0.05, 0.10, 0.05],
[0.00, 0.60, 0.01, 0.10, 0.10, 0.10, 0.09],
[0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01],
]),

# Normal behavior Role 2
np.array([
[0.00, 0.00, 0.20, 0.10, 0.10, 0.30, 0.30],
[1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
[0.00, 0.10, 0.20, 0.20, 0.20, 0.10, 0.20],
[0.00, 0.70, 0.05, 0.05, 0.05, 0.05, 0.10],
[0.00, 0.70, 0.05, 0.05, 0.05, 0.10, 0.05],
[0.00, 0.50, 0.01, 0.01, 0.01, 0.10, 0.37],
[0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01],
]),

The transition probabilities of both user roles differ slightly and are even the same for some states.

Let’s now talk about the fraudster in our example.

Two percent of our users are potential fraudsters and for each session there is a 20% chance that the potential fraudster will commit fraud. If he is in the state ‘Fraud’ the session will be sampled from the fraud transition matrix.

np.array([
[0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00],
[1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
[0.00, 0.20, 0.70, 0.025, 0.025, 0.025, 0.025],
[0.00, 0.40, 0.40, 0.05, 0.05, 0.05, 0.05],
[0.00, 0.40, 0.40, 0.05, 0.05, 0.05, 0.05],
[0.00, 0.50, 0.01, 0.01, 0.01, 0.10, 0.37],
[0.00, 0.60, 0.09, 0.10, 0.10, 0.10, 0.01],
])

As fraudster have a much higher probability to perform action_1 repeatably or return to action_1 from other states (e.g. search customer information to steal data).

In total we have 40.000 transaction of which 111 transactions are fraudulent committed by 3 users out of 200 users.

The Jupyter notebook ‘SampleData’ in the project folder will generate this data sample and can be easily modified.

This example is just for educational purposes and it is by construction so simple that we could spot the fraudulent behaviour with very simple methods using feature engineering (e.g. users with highest count of action_1 in average), but in real world applications we would have maybe 100 or 1000 of different actions with more than two different type of users and the fraudulent behaviour would become more complex.

So how can a neural network learn the typical behaviour of our users?

Language models, Word Embeddings and how user sessions/activities are related to it

To determine if a sequence of actions (activity or session) is an anomaly we need to know the underlying distribution. Our five actions can be compared to words in a natural language and user sessions to sentences or paragraphs in a text. So maybe we can solve our problems with techniques which are used in NLP (natural language processing).

In a language model the probability of the next word is depending on the previous words in the same context. During the training of a language model one try to predict the next word given the previous words. It’s like fill in the blank in a text. While minimizing the prediction error the model will learn the conditional probabilities. It’s a supervised learning task and recurrent neural networks with embedding layers are very successful applied to it. So the basic idea is to use a language model network with an embedding layer and feed our sequences into it and use the latent representation (embeddings) to derive user profiles.

There are several papers in which the authors transferring the idea of embeddings and RNNs from NLP into context of user profiling for recommendation systems.

For more details on recurrent networks, language models, embeddings (word2vec) have a look here:

Network design and training

For our simple example we use a very simple network architecture:

3 dimensional Embedding Layer
2 layer LSTM network with 20 nodes per layer and 20% dropouts
Decoder layer

This is a very simple RNN compared to recent state of the art networks in natural language processing. So there is much space left for improvements. But this design is sufficient to present the idea of this approach.

We split our data in 80% training and 20% validation data and train in mini batches of 100 users sessions per mini batch. Since the sessions differ in their length we apply zero-padding (filling the time series).

We train the network for 20 epochs using RMSProp and learning rate decay with an initial learning rate of 0.05.

Implementation in PyTorch

You can download the complete source code from GitHub (https://github.com/mgroncki/FraudDetection).

This is my first PyTorch project and the implementation was quite simple and the documentation of the project is very good and there are many good tutorials available.

My code it’s based on the example on the official PyTorch language model example (https://www.github.com/pytorch/examples/tree/master/word_language_model) with some modification/simplifications:

First we prepare the sequential data and transfer the action strings into integers. We do it in this example by hand create a dictionary of action to id and apply this dictionary to our list of actions. There also NLP libraries which provide these functionalities (e.g. gensim).


logfile['SessionActivityInt'] = logfile.SessionActivity.map(lambda ls: np.array([action2id[a] for a in ls]+[action2id['start']]))

The result is a column of list of sessions, represented as integers.
In the next step we create a function which generates the mini batches and pad the sequences and convert everything to PyTorch Tensors and copy it to a GPU if available.

def split_train_test(input, device, prop=0.8, seed=42):
    np.random.seed(42)
    mask = np.random.uniform(size=input.shape[0])0].sum() / y[y>0].size(0)
    return y_probs, y_predict, y, loss, acc

def get_batch(i, batch_size, input):
    '''
    Takes a column/list of activity tensors of variable lenght
    and returns the padded i-th minibatch of batch_size activities 
    '''
    data = input[i*batch_size : (i+1) * batch_size]
    data = sorted(data, key=len, reverse=True)
    x = nn.utils.rnn.pad_sequence([x[:-1] for x in data])
    y = nn.utils.rnn.pad_sequence([y[1:] for y in data])
    return x, y

We build the network following the example from the official PyTorch example with some slights modifications. Adding support for padding sequences.

class BehaviourNet(nn.Module):
    '''
    Very simple network consisting of an embedding layer, LSTM layers and a decoder with dropouts
    '''
    def __init__(self, n_actions=6, embedding_size=3, n_nodes=6, n_layers=2, dropout=0.2, 
                 padding_idx=0, initrange=0.5):
        super(VerySimpleBehaviorNet, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.embedding = nn.Embedding(n_actions, embedding_size, padding_idx)
        self.rnn = nn.LSTM(embedding_size, n_nodes, n_layers, dropout=dropout)
        self.decoder = nn.Linear(n_nodes, n_actions)
        self.init_weights(initrange)
        self.n_nodes = n_nodes
        self.n_layers = n_layers
        
    def init_weights(self, initrange=0.1):
        self.embedding.weight.data.uniform_(-initrange, initrange)
        # Set the first row to zero (padding idx)
        self.embedding.weight.data[0,:] = 0
        print(self.embedding.weight)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)
        
    def init_hidden(self, batch_size):
        weight = next(self.parameters())
        return (weight.new_zeros(self.n_layers, batch_size, self.n_nodes),
                weight.new_zeros(self.n_layers, batch_size, self.n_nodes))
    
    def forward(self, input, hidden):
        emb = self.dropout(self.embedding(input))
        output, hidden = self.rnn(emb, hidden)
        output = self.dropout(output)
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden

We are going to use the standard cross-entropy loss function, which offers support for padded sequences, so there is no worry during the training but for the evaluation we want also to calculate the accuracy of the model on the validation data set and there we need to mask the padded time steps and exclude from the calculation.

def training(model, optimizer, scheduler, loss_function, data, batch_size, n_actions, clipping=0.5):
    model.train()
    n_batch = int(np.ceil(len(data) // batch_size))
    hidden = model.init_hidden(batch_size)
    scheduler.step()
    total_loss = 0.0
    for batch in range(n_batch):
        hidden = tuple(h.detach() for h in hidden)
        x,y = get_batch(batch, batch_size, data)
        optimizer.zero_grad()
        output, hidden = model(x, hidden)
        output_flatten = output.view(-1, n_actions)
        y_flatten = y.view(-1)
        loss = loss_function(output_flatten, y_flatten)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clipping)
        optimizer.step()
        total_loss += loss
    return total_loss / n_batch

def evaluate(model, loss_function, data, n_actions):
    model.eval()
    batch_size = len(data)
    hidden = model.init_hidden(batch_size)
    x,y = get_batch(0, batch_size, data)
    output, hidden = model(x, hidden)
    output_flatten = output.view(-1, n_actions)
    y_flatten = y.view(-1)
    loss = loss_function(output_flatten, y_flatten)
    y_probs = nn.Softmax()(output)
    y_predict = t.argmax(output, 2)
    y_predict[y==0]=0
    acc = (y_predict==y).double()[y>0].sum() / y[y>0].size(0)
    return y_probs, y_predict, y, loss, acc

Whats left is the training loop:

modelname = 'model_1'
model = VerySimpleBehaviorNet(initrange=10, n_layers=2, n_nodes=20, n_actions=len(id2action)).to(device)
loss_func = nn.CrossEntropyLoss(ignore_index=0)
optimizer = t.optim.RMSprop(model.parameters(), lr=0.05)
scheduler = t.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.5)

for epoch in range(20):
    training_loss = training(model, optimizer, scheduler, loss_func, train, 100, n_actions=len(id2action))
    y_prob, y_pred, y_true, test_loss, test_acc = evaluate(model, loss_func, test, n_actions=len(id2action))
    print(f'Epoch {epoch}\nTrain Loss : {training_loss} \t Val loss: {test_loss} \t Val Acc {test_acc}')

Results

First we have a look into the learned latent representation of the actions.

At this moment the interpretation of this representation is, because of the nature of the data, quite difficult (no meaningful actions and random transitions probabilities).

But if we convert a sequence of activities into the same latent space by converting each action into the corresponding embedding vector and calculate the average of all actions in one sequence in each dimension we can observe a quite interesting pattern.

We observe that Role A and Role B user have overlapping transactions but fraudulent user sessions are more likely on the left upper corner.

If we now average all sessions of a user in this 3 dimensional space, we can see that our network learned a representation of the activities which allows us to identify what role each user has and more important the three fraudsters are clearly outliers (red):

Remark / Conclusion

In our example the network was able learn a meaningful representation of the user actions which can then be used to identify the fraudulent users and the suspicious transactions with classical outlier detection methods (in our simple case just through visual inspection). I think the idea of using embeddings to represent user profiles very promising. But be aware in this simple example you can reach the same result with much simpler methods, which would preferable (e.g. counting the number of actions in one session and normalise the vector and a apply a PCA on it). This is a very simple approach and there is much space for further extensions (e.g. add user embedding analog to document2vec) or we can reuse the embeddings in other models (transfer learning).

Using PyTorch for this project was very very straight forward (comparable to using numpy) and much easier to debug compared to the low level api of TensorFlow and good fun. But TensorFlow 2.x will address some of the issues (e.g. default eager mode, cleaner api, etc). Although I am a big fan of TensorFlow and the Estimator API, especially in connection with Google Cloud Platform, it will be definitely not the last time that I used PyTorch.

Thats its for today, and i hope you enjoyed reading this post.

On my bucket list for the next posts is are

porting the signature detection project from KNIME to Python and train it on the GCP and
extending this network and try it on real world data
exploring GANs and variational auto-encoders for fraud detection
tutorial about up and down sampling methods to handle imbalance data

Haven’t decided whats coming next, so if you have any comments or questions please drop me a message.

So long…

Signature Verification with deep learning / transfer learning using Keras and KNIME

26/08/201826/08/2018 ~ Matthias Groncki ~ Leave a comment

In the previous posts we applied traditional Machine Learning methods and Deep Learning in Python and KNIME to detect credit card fraud, in this post we will see how to use a pretrained deep neural networks to classify images of offline signatures into genuine and forged signatures. A neural network like this could support experts to fight cheque fraud. We will use the VGG16 network architecture pertained on ImageNet. The technique we are going to apply is called transfer learning and allows us to use a deep network even if we have only limited data, as in our case. The idea for this post is based on the paper ‘Offline Signature Verification with Convolutional Neural Networks‘ by Gabe Alvarez, Blue Sheffer and Morgan Bryant. We combine native KNIME nodes for the data preparation and extend the workflow with Python code in some nodes using Keras for designing and the network and the transfer learning process.

We will use the same data source for our training set: The signature collection of the ICDAR 2011 Signature Verification Competition (SigComp2011) which contains offline and online signature samples. The offline dataset comprises PNG images, scanned at 400 dpi, RGB color. The data can be downloaded from http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2011_Signature_Verification_Competition_(SigComp2011).

Data Preparation

The offline signatures images have all different size, so in the first step we resize them to have all the same size 150×150 and divide all pixel by 255 to scale our features in the [0,1] space.

With the List Files node we filter all images (pngs) in our training set folder.

In the next step (Rule Node) we mark the first 123 images are forged or fraudulent and the remaining images are genuine signatures.

In the following Metanode we read all images from the files using the image reader (table) node and then standardise the features and resize all images.

In the next step we perform a 80/20 split into training and test data and apply a data augmentation step (using Keras ImageDataGenerator to create more training data) which finalise our data preparation. We will cover the data augmentation in one of the coming posts.

Deep Learning network

We will load the weights of a VGG16 network pertained on ImageNet classification tasks using Keras Applications and strip of the last layers and replace them with our new dense layers and then train and fine tune the parameters our data. Instead of training the complete network which would require a lot more data and training time we will use pretrained weights and leverage the previous experience of the network. We just learn the last layers on our classification task.

Step 1: Load the pretrained network

We use the DL Python Network Creator Node and need to write a few line of Python code to load the network. We are using Keras, which will automatically download the weights.

Step 2: Replace Top Layers and freeze weights

In the next step we modify the network, we add 3 Dense Layer with Dropouts and ReLu and a Sigmoid activation at the end and freeze the weights of the original VGG network and again we need to write a few line of Python code:

Step 3: Train Top Layers

Then we train the new layers for 5 epochs with the Keras Network Learner Node:

This slideshow requires JavaScript.

Step 4 and 5: Unfreeze and fine tune

We modify the resulting network and unfreeze the last layers of the VGG16 network to fine-tune the pre-learned weights (3 layers) and train the network for another 10 epochs.

Apply Network and Test Results

In the last step we apply the network to the test data, convert the predicted probability into a class (p>0.5 -> Forged) and calculate the Confusion Matrix and AUC curve.

Confusion Matrix:

AUC Curve:

The results look very impression, actually a bit to good. We are maybe overfitting the data, since the test data may contains signatures (genuine and forged) from the same reference authors (since there are only 10 reference authors in the complete training set). The author of the paper noted that the performance of their network is very good on signatures of persons whose signatures has been seen in the training phase but that on unseen data it’s only little bit better then a naive baseline. In one of the next posts we will to check the network performance on new unseen signatures and try to train and test the model also on the SigComp2009 data which have signatures of 100 authors and we will look into detail in the data augmentation and maybe we compare this network to some more other network architectures.

So long…

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part III

20/06/201820/06/2018 ~ Matthias Groncki ~ 2 Comments

After a series of posts about exotic option pricing (Asian, Barriers and Bermudans) with TensorFlow and finding optimal hedging strategies with deep learning (using a LSTM network to learn a delta hedge) I will come back to our credit card fraud detection case. In the previous part we have build a logistic regression classifier in TensorFlow to detect fraudulent transactions. We will see that our logistic regression classifier is equivalent to a very simple neural network with exactly one layer with one node and sigmoid activation function. We will extend this simple network to to a deep neural network by adding more hidden layers. We will use the low level API of TensorFlow to build the networks. At the end of the post we will use Keras’s high level API to build a same network with just a few lines of code.

We will continue to use the same data apply the same transformation which we are using since the first part of this series.

As usual you can find the notebook on my GitHub repository.

Deep learning / neural networks in a nutshell

An artificial neural network (ANN) is collection of connected nodes. In the first layer of the network the input of our nodes are the input features. In following layers the output of previous nodes are the input to the nodes in the current layer. If we have more than 1 hidden layer we can call the network a deep neural network.

The picture is generated by a latex script written by Kjell Magne Fauske (http://www.texample.net/tikz/examples/neural-network/) released under Creative common license. Thanks for that.

The output of the node is the composition of the dot or (scalar) product of a weights vector and the input vector and an activation function. Let be X the vector of input features and $w_i$ the weights vector of the node i, then the output of this node is given by

$output_i = \phi(X^Tw_i+ b_i),$

with an activation function $\phi$ and bias $b_i$ .

If a layer consists more of one node the layer can be represented as a matrix multiplication. Such a layer is often called linear or dense layer. Typical choices for activation functions are tanh, relu, sigmoid function.

As we can see from this formula a dense layer with one node and sigmoid function as activation is our logisitc regression model. The matrix product will be the logit and the output of the activation function will be the probability as in a logistic regression model.

Lets review the logistic regression example in a neural network setting, lets start a function which constructs the computational graph for a dense (linear) layer given a input, activation function and number of nodes.

def add_layer(X, n_features, n_nodes, activation=None):
    """
    Build a dense layer with n_features-dimensional input X and n_nodes (output dimensional).

    Parameters:

        X : 2D Input Tensor (n_samples, n_features)
        n_features = number of features in the tensor 
        n_nodes = number of nodes in layer (output dimension)
        activation = None or callable activation function 

    Output:

        Operator which returns a 2D Tensor (n_samples, n_nodes)

    """
    weights = tf.Variable(initial_value=tf.random_normal((n_features,n_nodes), 0, 0.1, seed=42), dtype=tf.float32)
    bias = tf.Variable(initial_value=tf.random_normal((1,n_nodes), 0, 0.1, seed=42), dtype=tf.float32)
    layer = tf.add(tf.matmul(X, weights), bias)
    if activation is None:
        return layer
    else:
        return activation(layer)

We wrapping our training and prediction functions in a class. The constructor of this class builds the computational graph in TensorFlow. The the function create_logit will build the computational graph to compute the logits (in the logisitc regression case: one layer with one node and the identity as activation function). We will override this function at a later point to add more layers to our network.

class model(object):
    def __init__(self, n_features, output_every_n_epochs=1, name='model'):
        self.input = tf.placeholder(tf.float32, shape=(None, n_features))
        self.true_values = tf.placeholder(tf.float32, shape=(None,1))
        self.training = tf.placeholder(tf.bool)
        self.logit = self.create_logit()
        self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.true_values, 
                                                                           logits=self.logit))
        self.predicted_probs = tf.sigmoid(self.logit)
        self.output_every_n_epochs = output_every_n_epochs
        self.name = name
        self.saver = tf.train.Saver()    



    def create_logit(self):
        return  add_layer(self.input, 30, 1)

    def evaluate_loss_and_probs(self, sess, X, y, training=False, output=False):
        loss, probs = sess.run([self.loss, self.predicted_probs], {self.input : X, 
                                                                   self.true_values : y.reshape(-1,1),
                                                                   self.training : training})
        probs.reshape(-1)
        y_hat = (probs > 0.5).reshape(-1)*1
        auc = roc_auc_score(y, probs)
        precision = precision_score(y, y_hat)
        recall = recall_score(y, y_hat)
        fp = np.sum((y!=y_hat) & (y==0)) 
        fpr = fp / (y==0).sum()
        if output:
            print('Loss: %.6f \t AUC %.6f \t Precision %.6f%% \t Recall %.6f%% \t FPR %.6f%%' % (loss, auc, precision*100, recall*100, fpr*100))
        return loss, probs, y_hat, auc, precision, recall



    def train(self, sess, X, y, n_epochs, batch_size, learning_rate):
        init = tf.global_variables_initializer()
        sess.run(init)
        optimizer = tf.train.GradientDescentOptimizer(learning_rate)
        train = optimizer.minimize(self.loss)
        n_samples = X.shape[0]
        n_iter = int(np.ceil(n_samples/batch_size))
        indices = np.arange(n_samples)
        training_losses = []
        training_aucs = []
        for epoch in range(0,n_epochs):
            np.random.shuffle(indices)
            for i in range(n_iter):
                idx = indices[i*batch_size:(i+1)*batch_size]
                x_i = X[idx,:]
                y_i = y[idx].reshape(-1,1)
                sess.run(train, {self.input : x_i, 
                                 self.true_values : y_i,
                                 self.training : True})
            output=False
            if (epoch % self.output_every_n_epochs)==0:
                print(epoch, 'th Epoch')
                output=True
            loss_train_epoch, predict_train_epoch, y_hat, auc_train_epoch, _, _ = self.evaluate_loss_and_probs(sess, X, y, False, output)
            training_losses.append(loss_train_epoch)
            training_aucs.append(auc_train_epoch)
        with plt.xkcd() as style:
            plt.figure(figsize=(7,7))
            plt.subplot(2,1,1)  
            plt.title('Loss')
            plt.plot(range(n_epochs), training_losses)
            plt.xlabel('# Epoch')
            plt.subplot(2,1,2)
            plt.title('AUC')
            plt.plot(range(n_epochs), training_aucs)
            plt.xlabel('# Epoch')
            plt.tight_layout()
            plt.savefig('training_loss_auc_%s.png' % self.name, dpi=300)
        self.saver.save(sess, "./%s/model.ckpt" % self.name)

    def restore(self, sess):
        self.saver.restore(sess, "./%s/model.ckpt" % self.name)

Apply this function to build our Logisitc Regression model

np.random.seed(42)
lr = model(30, 10, 'lr')
n_epochs = 11
batch_size = 100
with tf.Session() as sess:
    lr.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set:')
    _, probs_lr, y_hat_lr, _, _, _ = lr.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

0 th Epoch
Loss: 0.007944   AUC 0.980217    Precision 86.538462%    Recall 56.675063%   FPR 0.015388%
10 th Epoch
Loss: 0.004231   AUC 0.984984    Precision 87.591241%    Recall 60.453401%   FPR 0.014948%
Validation set:
Loss: 0.003721   AUC 0.977169    Precision 89.041096%    Recall 68.421053%   FPR 0.014068%

Backpropagation

In the previous parts we have seen how we can learn the weights (parameter) of our logistic regression model. So we know how to train a network with one layer but how can we train a network with more than one layer?

The concept is called Backpropagation and is basically the application of the chain rule. In the first phase (feed forward phase) the the input is feed into the network through all layers and the loss is calculated. Then in the 2nd or backward phase, the weights are updated recursevly from the last layer to the first.

At the last layer the derivate of the loss is straight forward. For the calculation of the weights in the inner or hidden layers we need the previous calculated derivates.

With the calculated gradients we can apply again a gradient descent method to optimize our weights.

The power of TensorFlow or other deep learning libraries as PyTorch are again the auto gradients. We dont need to worry to calculate the gradients by ourself.

A detailed deriviation of the backpropagation algorithm with an example for a quadratic loss function can be found on wikipedia.

First deep network

Now its time for our first deep neural network. We will add 4 layers with 120, 60, 30 and 1 node.

class model2(model):

    def create_logit(self):
        layer1 = add_layer(self.input, 30, 120)
        layer2 = add_layer(layer1, 120, 60,)
        layer3 = add_layer(layer2, 60, 30)
        layer4 = add_layer(layer3, 30, 1)
        return layer4

np.random.seed(42)
dnn1 = model2(30, 10, 'model1')
n_epochs = 11
batch_size = 100
with tf.Session() as sess:
    dnn1.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn1, y_hat_dnn1, _, _, _ = dnn1.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

The performance of this network is not really good. Actually is quite bad for the complexity of the model.
The AUC on the validation set is worse than the AUC from the logistic regression.

For low FPRs the logistic regession almost always outperforms the deep neural network (DNN). A FPR of 0.1 % means that in we will have 1 false positive in 1000 transactions. If you have millions of transactions even such a low fpr can affect and your customers. In very low FPRs (less than 0.0001) the DNN have a slightly higher true positive rate (TPR).

The problem is that we use the identity as activation function. The logit is still a linear function of the input.
If we want to capture non linear dependencies we have to add a non-linear activation function.
Let’s try the RELU.

Time for non-linearity

class model2b(model):

    def create_logit(self):
        layer1 = add_layer(self.input, 30, 120, tf.nn.relu)
        layer2 = add_layer(layer1, 120, 60, tf.nn.relu)
        layer3 = add_layer(layer2, 60, 30, tf.nn.relu)
        layer4 = add_layer(layer3, 30, 1)
        return layer4

np.random.seed(42)
dnn1b = model2b(30, 10, 'model1b')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
    dnn1b.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn1b, y_hat_dnn1b, _, _, _= dnn1b.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

Another popular choice is tanh. We compare both activation functions with the logistic regression:

We see that both non linear models outperforms the logistic regression. For low FPRs the TPR is signifanct higher.
Assume we would accept a FPR of 0.01 %, then the Recall of our DNN is around 80% vs 50% for the logistic regression.
We can detect much more fraudulent transactions with the same rate of false alarms.

Using TensorFlow layers

Instead of building the computational graph our self (weights, bias tensor, etc) we can use TensorFlow Layers. The function tf.layers.dense build a linear or dense layer. We can specify the number of nodes, the input and the actication function (similar to our own function).

In the next layer we use the TensorFlow function and add on more layers.

class model3(model):

    def create_logit(self):
        layer1 = tf.layers.dense(self.input, 240, activation=tf.nn.tanh)
        layer2 = tf.layers.dense(layer1, 120, activation=tf.nn.tanh)
        layer3 = tf.layers.dense(layer2, 60, activation=tf.nn.tanh)
        layer4 = tf.layers.dense(layer3, 30, activation=tf.nn.tanh)
        layer5 = tf.layers.dense(layer4, 1)
        return layer5

np.random.seed(42)
dnn2 = model3(30, 10, 'model2')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
    dnn2.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn2, y_hat_dnn2, _, _, _= dnn2.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

0 th Epoch
Loss: 0.003000   AUC 0.986239    Precision 82.428941%    Recall 80.352645%   FPR 0.029897%
10 th Epoch
Loss: 0.002036   AUC 0.992393    Precision 95.626822%    Recall 82.619647%   FPR 0.006595%
20 th Epoch
Loss: 0.001598   AUC 0.995232    Precision 93.989071%    Recall 86.649874%   FPR 0.009673%
30 th Epoch
Loss: 0.001273   AUC 0.996695    Precision 99.137931%    Recall 86.901763%   FPR 0.001319%
Validation set
Loss: 0.002425   AUC 0.980571    Precision 91.764706%    Recall 82.105263%   FPR 0.012309%

The model didn’t improve to the previous one. Maybe we are now overfitting. One way to prevent overfitting in DNN are dropouts. Dropouts deactive a proportion of nodes during training randomnly. So we prevent our neural network to memorize the training data. Lets add dropout layers to the previous model.

Lets use a dropout rate of 20%.

class model4(model):

    def create_logit(self):
        layer1 = tf.layers.dense(self.input, 120, activation=tf.nn.tanh)
        layer1 = tf.layers.dropout(layer1, 0.2, training=self.training)
        layer2 = tf.layers.dense(layer1, 60, activation=tf.nn.tanh)
        layer2 = tf.layers.dropout(layer2, 0.2, training=self.training)
        layer3 = tf.layers.dense(layer2, 30, activation=tf.nn.tanh)
        layer3 = tf.layers.dropout(layer3, 0.2, training=self.training)
        layer4 = tf.layers.dense(layer3, 1)
        return layer4

np.random.seed(42)
dnn3 = model4(30, 10, 'model3')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
    dnn3.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
    print('Validation set')
    _, probs_dnn3, y_hat_dnn3, _, _, _= dnn3.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

We see that all our deep learning model outperform the LR model on the validation set. The difference in AUC doesn’t seems very big, but especially for very low FPR the recall is much higher. Where the model with the dropout (DNN3) performs slightly better than the others.

Lets go for model 3 (4 layers with dropout) and let see the AUC Score of the model on the test data.

with tf.Session() as sess:
    dnn3.restore(sess)
    print('Test set')
    _, probs_dnn3_test, y_hat_dnn3_test, _, _, _= dnn3.evaluate_loss_and_probs(sess, X_test, y_test, False, True)

Test set
Loss: 0.001825   AUC 0.991294    Precision 97.619048%    Recall 83.673469%   FPR 0.003517%

This model performs very well on our test set. We have a high Recall with a very low FPR at a threshold of 50%.

Keras

The library Keras offers a very convinient API to TensorFlow (but it also supports other deep learning backends).

We can build the same model in just 6 lines of code. For many standard problems there are predefined loss functions, but we can also write our own loss functions in Keras.

For the model training and the prediction we only need one line of code each.

model = keras.Sequential()
model.add(keras.layers.Dense(120, input_shape=(30,), activation='tanh'))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(60, activation='tanh'))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(30, activation='tanh'))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy')

model.fit(X_train, y_train, epochs=31, batch_size=100)

probs_keras = model.predict(X_test)

Conclusion

In this part we saw how to build and train a deep neural network with TensorFlow using the low level and mid level API and as an outlook we saw how easy the model development is in a high level API like Keras.

For this fraud detection problem a very simple deep network can outperform a classical machine learning algorithm like logistic regression if we looking into the low false positive rate (FPR) regions. If we can accept higher false positive rates all models perform similar.

To decide for a final model, one need to specify the costs of a FP (a genuine transaction which we maybe block or at least investigate) and FN (a fraudulent transaction which we miss), so we can balance the trade-off between Recall (detection power) and FPR.

Our fraud detection problem is as we know a imbalance class problem. We can maybe improve the quality of the logistic regression with use of over-/undersampling of the majority class. Or we can use try other ‘classical’ machine learning methods like random forests or boosting trees, which often outperform a logisitc regression.

Another interesting unsupervised deep learning method to detect anomalies in transactions are auto-encoders.

I think I will cover these topics in later posts. So stay tuned.

As usual you can find the notebook on my GitHub repo, so please download the notebook and play with the model parameter, e.g one could change numbers of epochs we train, apply adaptive learning rates or add more layer or change the numbers of nodes in each layer and play with dropouts to find better models and please share your ideas and results.

So long…

Pricing Bermudan Options in TensorFlow – Learning an optimal Early Exercise Strategy

01/06/201801/06/2018 ~ Matthias Groncki ~ 2 Comments

In the previous post we used TensorFlow to price some exotic options like Asian and Barrier Options and used the automatic differentiation feature to calculate the greeks of the options.

Today we will see how to price a Bermudan option in TensorFlow with the Longstaff-Schwartz (a.k.a American Monte Carlo) algorithm. For simplicity we assume Bermudan equity call option without dividend payments in a Black-Scholes-Merton world.

The complete Jupyter notebook can be found on GitHub.

Let start with a brief recap of Bermudan options and the Longstaff-Schwartz pricing method. For detailed and more formalized description have a look into my previous posts about Bermudan Swaptions and the American Monte Carlo Simulation for CVA calculations or have a look into the book ‘Monte Carlo Methods in Financial-Engineering’ from Paul Glasserman.

A Bermudan option give the holder the right to exercise option not only at the maturity of the option but also earlier on pre specified dates. At each exercise date the holder of the option has to decide if it’s better to exercise the option now (exercise value) or to wait for a later exercise date (continuation value). He has to find a optimal exercise strategy (early exercise boundary of the option). He has to decide if the exercise value is higher than the continuation value of the option. There are some trivial cases:

1) If the option is not in the money its always better to wait and
2) if the option hasn’t been exercised before the last exercise date the Bermudan become an European option.

Therefore the option price have to be higher than the price of an European (because its included).

One approach to price the option is to use Monte-Carlo simulations, but the problem is calculation of the continuation value. On each simulated path we can easily calculate the exercise value of the option at each exercise date given the state of the path, but the continuation value is a conditional expectation given the current underlying price of the path. The naive approach is to calculate this expectation with a Monte-Carlo Simulation again but then we need to run a simulation for each exercise date on each path. These kind of nested simulations can become very slow.

Longstaff-Schwartz method

One solution is the Longstaff-Schwartz method, the basic idea is to approximate the continuation value through a linear regression model.

The pricing consists of two phases:

In the learning phase we go backward through the time starting with the last exercise date (it is trivial to price), then we go to the previous exercise date and we fit a linear model to predict the continuation values (the option value at time t+1 which we just calculated) given the state of the paths (usually a system of polynomials of the state). Then we use the prediction to decide if we exercise or not and update the value of the option at time t (either exercise or continuation value). And we continue these steps that until we reach the first exercise date.

After we learned the models to predict the continuation values we use these linear models to predict the continuation values in the pricing phase. We generate a new set of random paths and go forward through the time and at each exercise date we decide if the exercise depending on the exercise value and the predicted continuation value.

There are two ways to fit the linear model. One could solve the normal equation
or use gradient descent. I decided to go for the gradient descent, it give us the opportunity to exchange the linear model with more sophisticated models like a deep network.

I use three helper functions, the first function get_continuation_function creates the Tensorflow operators needed for training a linear model at an exercise date and a second function feature_matrix_from_state creates a feature matrix for the model from the paths values at a given time step. I use Chebyshev polynomials up to the degree 4 as features, as we can see below the fit could be better. Feel free to play with it and adjust the code.

The third helper function pricing_function create the computational graph for the pricing and it generates for each call date the needed linear model and the training operator with the helper functions and store it in a list of training_functions.

previous_exersies = 0
    npv = 0
    for i in range(number_call_dates-1):
        (input_x, input_y, train, w, b, y_hat) = get_continuation_function()
        training_functions.append((input_x, input_y, train, w, b, y_hat))
        X = feature_matrix_from_current_state(S_t[:, i])
        contValue = tf.add(tf.matmul(X, w),b)
        continuationValues.append(contValue)
        inMoney = tf.cast(tf.greater(E_t[:,i], 0.), tf.float32)
        exercise = tf.cast(tf.greater(E_t[:,i], contValue[:,0]), tf.float32) * inMoney 
        exercises.append(exercise)
        exercise = exercise * (1-previous_exersies)
        previous_exersies += exercise
        npv += exercise*E_t[:,i]
    
    # Last exercise date
    inMoney = tf.cast(tf.greater(E_t[:,-1], 0.), tf.float32)
    exercise =  inMoney * (1-previous_exersies)
    npv += exercise*E_t[:,-1]
    npv = tf.reduce_mean(npv)
    greeks = tf.gradients(npv, [S, r, sigma])

The npv operator is sum of the optimal exercise decisions. At each time we exercise if the exercise value is greater than the predicted continuation value and the option is in the money. We store the information about previous exercise decision since we can only exercise the option once.

In the actual pricing functionbermudanMC_tensorFlow we execute the graph to create the paths, the exercise values for the training path then will iterate backward through the training_functions and learn for each exercise date the linear model.

# Backward iteration to learn the continuation value approximation for each call date
        for i in range(n_excerises-1)[::-1]:
            (input_x, input_y, train, w, b, y_hat) = training_functions[i]
            y = exercise_values[:, i+1:i+2]
            X = paths[:, i]
            X = np.c_[X**1, 2*X**2-1, 4*X**3 - 3 * X]
            X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
            for epoch in range(80):
                _ = sess.run(train, {input_x:X[exercise_values[:,i]>0], 
                                     input_y:y[exercise_values[:,i]>0]})
            cont_value = sess.run(y_hat, {input_x:X, 
                                     input_y:y})
            exercise_values[:, i:i+1] = np.maximum(exercise_values[:, i:i+1], cont_value)
            plt.figure(figsize=(10,10))
            plt.scatter(paths[:,i], y)
            plt.scatter(paths[:,i], cont_value, color='red')
            plt.title('Continuation Value approx')
            plt.ylabel('NPV t_%i'%i)
            plt.xlabel('S_%i'%i)

We take the prices of the underlying at time i and calculate the Chebyshey polynomials and store it in the predictor matrix X. The option value at the previous time is our target value y. We normalise our features and train our linear model with a stochastic gradient descent over 80 epochs. We exclude all samples where the option is not in the money (no decision to make). After we got our prediction for the continuation value we update the value of option at time i to be the maximum of the exercise value and the predicted continuation value.

After we learned the weights for our model we can execute the npv and greek operator to calculate the npv and the derivatives of the npv.

Pricing Example

Assume the spot is at 100 the risk free interest rate is at 3% and the implied Black vol is 20%. Lets price a Bermudan Call with strike level at 120 with yearly exercise dates and maturity in 4 years.

The fit off our continuation value approximation on the training set.

The runtime is between 7-10 second for a Bermudan option with 4 exercise dates. We could speed it up, if we would use the normal equations instead of a gradient descent method.

So thats it for today. As usual is the complete source code as a notebook on GitHub for download. I think with TensorFlow its very easy to try other approximations methods, since we can make use of TensorFlows deep learning abilities. So please feel free to play with the source code and experiment with the feature matrix (other polynomials, or higher degrees) or try another model (maybe a deep network) to get a better fit of the continuation values. I will play a bit with it and come back to this in a later post and present some results of alternative approximation approaches.

So long.

TensorFlow meets Quantitative Finance: Pricing Exotic Options with Monte Carlo Simulations in TensorFlow

22/05/2018 ~ Matthias Groncki ~ 4 Comments

During writing my previous post about fraud detection with logistic regression with TensorFlow and gradient descent methods I had the idea to use TensorFlow for the pricing of path dependent exotic options. TensorFlow uses automatic (algorithmic) differentitation (AD) to calculate the gradients of the computational graph. So if we implement a closed pricing formula in TensorFlow we should get the greeks for ‘free’. With ‘free’ I mean without implementing a closed formula (if available) or without implementing a numerical method (like finite difference) to calculate the greeks. In Theory the automatic differentiation should be a bit times slower than a single pricing but the cost should be independent on the number of greeks we calculate.

Monte Carlo Simulations can benefit of AD a lot, when each pricing is computational costly (simulation) and we have many risk drivers, the calculation of greeks become very challenging. Imagine a interest rate derivate and we want to calculate the delta and gamma and mixed gammas for each pillar on the yield curve, if we use bump-and-revaluate to calculate the greeks we need many revaluations.

For simplicty we focus on equity options in a Black-Scholes model world since I want to focus on the numerical implementation in TensorFLow. Today we will look into the Monte-Carlo pricing of Plain Vanilla Options, Geometric Asian Options and Down-And-Out Barriers. All options in this post are Calls. For a introduction into TensorFlow look into my previous post on Logisitc Regression in TensorFlow.

You will need TensorFlow 1.8 to run the notebook (which is available on my Github)

Plain Vanilla Call

Lets start with a plain vanilla Call. Let me give a brief introduction into options for the non-quant readers, if you familiar with options skip this part and go directly to the implementation.

A call gives us the right to buy a unit of a underlying (e.g Stock) at preagreed price (strike) at a future time (option maturity), regardless of the actual price at that time.So the value the Call at option maturity is

$C = \max(S_T-K,0)$ ,

with $S_T$ the price of the underlying at option maturity $T$ and strike $K$ . Clearly we wouldn’t exercise the option if the actual price is below the strike, so its worthless (out of the money).

Its very easy to calculate the price of the option at maturity but we are more interested into the todays price before maturity.

Black, Scholes and Merton had a very great idea how to solve this problem (Nobel prize worth idea), and come up with a closed pricing formula.

We will first implement the closed formula and then implement the pricing with a Monte Carlo Simulation.

Not for all kind of options and models we have analytical formula to get the price. A very generic method to price options is the Monte-Carlo Simulation. The basic idea is to simulated many possible (random) evolutions (outcomes/realizations) of the underlying price (paths) and price the option of each of these paths and approximate the price with the average of the simulated option prices. Depending on the underlying stochastic process for the underyling in the model we need numerical approximations for the paths.

For a more detailed introduction into derivates have a look into the book ‘Options, Futures and Other Derivates’ by Prof. John C. Hull.

We lets assume the current stock price is 100, the strike is 110 and maturity is in 2 years from now. The current risk free interest rate is 3% and the implied market vol is 20%.

Let implement the Black Scholes pricing formula in Python

import numpy as np
import tensorflow as tf
import scipy.stats as stats
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

Plain Vanilla Call in TensorFlow


def blackScholes_py(S_0, strike, time_to_expiry, implied_vol, riskfree_rate):
    S = S_0
    K = strike
    dt = time_to_expiry
    sigma = implied_vol
    r = riskfree_rate
    Phi = stats.norm.cdf
    d_1 = (np.log(S_0 / K) + (r+sigma**2/2)*dt) / (sigma*np.sqrt(dt))
    d_2 = d_1 - sigma*np.sqrt(dt)
    return S*Phi(d_1) - K*np.exp(-r*dt)*Phi(d_2)

blackScholes_py(100., 110., 2., 0.2, 0.03)

9.73983632580859

The todays price is 9.73 and a pricing with the closed formula is super fast less then 200 mirco seconds.

Now lets implement the same pricing formula in TensorFLow. We use the concept of a function closure. The outer function constructs the static TensorFow graph, and the inner function (which we return) let us feed in our parameters into the graph and execute it and return the results.

def blackScholes_tf_pricer(enable_greeks = True):
    # Build the static computational graph
    S = tf.placeholder(tf.float32)
    K = tf.placeholder(tf.float32)
    dt = tf.placeholder(tf.float32)
    sigma = tf.placeholder(tf.float32)
    r = tf.placeholder(tf.float32)
    Phi = tf.distributions.Normal(0.,1.).cdf
    d_1 = (tf.log(S / K) + (r+sigma**2/2)*dt) / (sigma*tf.sqrt(dt))
    d_2 = d_1 - sigma*tf.sqrt(dt)
    npv =  S*Phi(d_1) - K*tf.exp(-r*dt)*Phi(d_2)
    target_calc = [npv]
    if enable_greeks:
        greeks = tf.gradients(npv, [S, sigma, r, K, dt])
        dS_2ndOrder = tf.gradients(greeks[0], [S, sigma, r, K, dt]) 
        # Calculate mixed 2nd order greeks for S (esp. gamma, vanna) and sigma (esp. volga)
        dsigma_2ndOrder = tf.gradients(greeks[1], [S, sigma, r, K, dt])
        dr_2ndOrder = tf.gradients(greeks[2], [S, sigma, r, K, dt]) 
        dK_2ndOrder = tf.gradients(greeks[3], [S, sigma, r, K, dt]) 
        dT_2ndOrder = tf.gradients(greeks[4], [S, sigma, r, K, dt])
        target_calc += [greeks, dS_2ndOrder, dsigma_2ndOrder, dr_2ndOrder, dK_2ndOrder, dT_2ndOrder]
    
    # Function to feed in the input and calculate the computational graph
    def execute_graph(S_0, strike, time_to_expiry, implied_vol, riskfree_rate):
        with tf.Session() as sess:
            res = sess.run(target_calc, 
                           {
                               S: S_0,
                               K : strike,
                               r : riskfree_rate,
                               sigma: implied_vol,
                               dt: time_to_expiry})
        return res
    return execute_graph

tf_pricer = blackScholes_tf_pricer()
tf_pricer(100., 110., 2., 0.2, 0.03)

[9.739834,
 [0.5066145, 56.411205, 81.843216, -0.37201464, 4.0482087],
 [0.014102802, 0.5310427, 2.8205605, -0.012820729, 0.068860546],
 [0.5310427, -1.2452297, -6.613867, 0.030063031, 13.941332],
 [2.8205605, -6.613866, 400.42563, -1.8201164, 46.5973],
 [-0.012820728, 0.030063028, -1.8201163, 0.011655207, -0.025798593],
 [0.06886053, 13.941331, 46.597298, -0.025798589, -0.62807804]]

We get the price and all first and 2nd order greeks. The code runs roughly 2.3 second on my laptop and roughly 270 ms (without greeks).

tf_pricer = blackScholes_tf_pricer(True)
%timeit tf_pricer(100., 110., 2., 0.2, 0.03)

2.32 s ± 304 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


tf_pricer = blackScholes_tf_pricer(False)
%timeit tf_pricer(100., 110., 2., 0.2, 0.03)

269 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We calculated 30 different greeks and one npv, so for a bump and revaluation we would need 31 times the time for one pricing. So that definetly a speed up but compare to the ‘pure’ python implementation is awefully slow. But its easy to implement and reduce maybe developing time.

Monte Carlo Simulation with TensorFlow

In the Black Scholes model the underlying price follows a geometric Brownian motion and we now the distribution of the prices in the futures given the current price, the risk free interest rate and the implied volatiliy of the underlying. So we are able to sample future prices directly.

We use a helper function create_tf_graph_for_simulate_paths to build the computational graph in TensorFlow which can calculate the future prices given the required inputs and normal distributed random numbers dw with zero mean and a standard deviation of one. The function will return the operator to calculate the random price paths and the placeholder we need to feed in our parameter.

We assume we need to calculate the prices at equidistant points of time. The column of the random matrix dw defines the number of time on each path and the rows the numbers of different paths.

For example a random matrix with the shape (1000, 10) would have 1000 path with the prices of the underlying at times T/10, 2T/10, … T.

def create_tf_graph_for_simulate_paths():
    S = tf.placeholder(tf.float32)
    K = tf.placeholder(tf.float32)
    dt = tf.placeholder(tf.float32)
    T = tf.placeholder(tf.float32)
    sigma = tf.placeholder(tf.float32)
    r = tf.placeholder(tf.float32)
    dw = tf.placeholder(tf.float32)
    S_T = S * tf.cumprod(tf.exp((r-sigma**2/2)*dt+sigma*tf.sqrt(dt)*dw), axis=1)
    return (S, K, dt, T, sigma, r, dw, S_T)

To test the function we create another function which use the function above to create the computational graph and returns another function which feed our parametrization into the graph and run it and return the simulated paths.

def make_path_simulator():
    (S,K, dt, T, sigma, r, dw, S_T) = create_tf_graph_for_simulate_paths()
    def paths(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, seed, n_sims, obs):
        if seed != 0:
            np.random.seed(seed)
        stdnorm_random_variates = np.random.randn(n_sims, obs)
        with tf.Session() as sess:
            timedelta = time_to_expiry / stdnorm_random_variates.shape[1]
            res = sess.run(S_T, 
                           {
                               S: S_0,
                               K : strike,
                               r : riskfree_rate,
                               sigma: implied_vol,
                               dt : timedelta,
                               T: time_to_expiry,
                               dw : stdnorm_random_variates
                         })
            return res
    return paths

path_simulator = make_path_simulator()
paths = path_simulator(100, 110.,  2, 0.2, 0.03, 1312, 10, 400)
plt.figure(figsize=(8,5))
_= plt.plot(np.transpose(paths))
_ = plt.title('Simulated Path')
_ = plt.ylabel('Price')
_ = plt.xlabel('TimeStep')

Now we going to price our plain vanilla call. We will use the same design pattern as above (function closure). The function will create the computational graph and returns a function that generates the random numbers, feed our parameter into the graph and run it.

Actually we just extend the previous function with the pricing of the call on the path payout = tf.maximum(S_T[:,-1] - K, 0) and we discount it and we add some extra lines to calculate the greeks.

def create_plain_vanilla_mc_tf_pricer(enable_greeks = True):
    (S,K, dt, T, sigma, r, dw, S_T) = create_tf_graph_for_simulate_paths()
    payout = tf.maximum(S_T[:,-1] - K, 0)
    npv = tf.exp(-r*T) * tf.reduce_mean(payout)
    target_calc = [npv]
    if enable_greeks:
        greeks = tf.gradients(npv, [S, sigma, r, K, T])
        target_calc += [greeks]
    def pricer(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, seed, n_sims):
        if seed != 0:
            np.random.seed(seed)
        stdnorm_random_variates = np.random.randn(n_sims, 1)
        with tf.Session() as sess:
            timedelta = time_to_expiry / stdnorm_random_variates.shape[1]
            res = sess.run(target_calc, 
                           {
                               S: S_0,
                               K : strike,
                               r : riskfree_rate,
                               sigma: implied_vol,
                               dt : timedelta,
                               T: time_to_expiry,
                               dw : stdnorm_random_variates
                         })
            return res
    return pricer

plain_vanilla_mc_tf_pricer = create_plain_vanilla_mc_tf_pricer()
plain_vanilla_mc_tf_pricer(100, 110, 2, 0.2, 0.03, 1312, 1000000)

[9.734369, [0.5059894, 56.38598, 81.72916, -0.37147698, -0.29203108]]

%%timeit
plain_vanilla_mc_tf_pricer(100, 110, 2, 0.2, 0.03, 1312, 1000000)

392 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


plain_vanilla_mc_tf_pricer = create_plain_vanilla_mc_tf_pricer(False)
%timeit plain_vanilla_mc_tf_pricer(100, 110, 2, 0.2, 0.03, 1312, 1000000)

336 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The MC pricing is faster then the analytical solution in TensorFlow (ok we only have first order derivates) and the
extra costs for the greeks is neglect-able. I really wonder why a MC Simulation is faster than a closed formula.

Lets try some exotic (path dependent) options.

Exotic path dependent option

Asian Options

The fair value of a Asian option depends not only on the price of the underlying at option maturity but also on the values during the lifetime to predefined times. A Asian Call pay the difference between the average price and the strike or nothing.

A asian option which use the geometric mean of the underlying is also called Geometric Asian Option.

We follow the previous design we only make some extensions to the payoff and pricing method (need number of observation points).

def create_geometric_asian_mc_tf_pricer(enable_greeks = True):
    (S,K, dt, T, sigma, r, dw, S_T) = create_tf_graph_for_simulate_paths()
    A = tf.pow(tf.reduce_prod(S_T, axis=1),dt/T)
    payout = tf.maximum(A - K, 0)
    npv = tf.exp(-r*T) * tf.reduce_mean(payout)
    target_calc = [npv]
    if enable_greeks:
        greeks = tf.gradients(npv, [S, sigma, r, K, T])
        target_calc += [greeks]
    def pricer(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, seed, n_sims, obs):
        if seed != 0:
            np.random.seed(seed)
        stdnorm_random_variates = np.random.randn(n_sims, obs)
        with tf.Session() as sess:
            timedelta = time_to_expiry / obs
            res = sess.run(target_calc, 
                           {
                               S: S_0,
                               K : strike,
                               r : riskfree_rate,
                               sigma: implied_vol,
                               dt : timedelta,
                               T: time_to_expiry,
                               dw : stdnorm_random_variates
                         })
            return res
    return pricer

geom_asian_mc_tf_pricer = create_geometric_asian_mc_tf_pricer()
geom_asian_mc_tf_pricer(100, 110, 2, 0.2, 0.03, 1312, 100000, 8)

[4.196899, [0.3719058, 30.388206, 33.445602, -0.29993924, -89.84526]]

The pricing is a bit slower than the plain vanilla case about 470ms and 319ms without greeks. Again the extra cost is neglect-able.

Down and Out Options

Now lets try a down and out call. A down-and-out Call behaves as a normal Call but if the price of the underlying touch or fall below a certain level (the barrier B) at any time until the expiry the option, then it becomes worthless, even if its at expiry in the money.

A down and out call is cheaper than the plain vanilla case, since there is a risk that the option get knocked out before reaching the expiry. It can be used to reduce the hedging costs.

In the Black-Scholes model there is again a closed formula to calculate the price. See https://people.maths.ox.ac.uk/howison/barriers.pdf or https://warwick.ac.uk/fac/soc/wbs/subjects/finance/research/wpaperseries/1994/94-54.pdf .

We want to price this kind of option in TensorFlow with a Monte-Carlo Simulation and let TensorFLow calculate the path derivates with automatic differentitation.

Because of the nature of the product it quite difficult to calculate the npv and greeks in a Monte-Carlo Simulation. Obviously we have a bias in our price since we check if the barrier hold on discrete timepoints while the barrier has to hold in continous time. The closer we get to the Barrier the more off we will be from the analytical price and greeks. The MC price will tend to overestimate the value of the option. See the slides from Mike Giles about Smoking adjoints (automtic differentiation) https://people.maths.ox.ac.uk/gilesm/talks/quant08.pdf.

First we implement the analytical solutions in ‘pure’ Python (actually we rely heavly on numpy) and in TensorFLow.

def analytical_downOut_py(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, barrier):
    S = S_0
    K = strike
    dt = time_to_expiry
    sigma = implied_vol
    r = riskfree_rate
    alpha = 0.5 - r/sigma**2
    B = barrier
    Phi = stats.norm.cdf
    d_1 = (np.log(S_0 / K) + (r+sigma**2/2)*dt) / (sigma*np.sqrt(dt))
    d_2 = d_1 - sigma*np.sqrt(dt)
    bs = S*Phi(d_1) - K*np.exp(-r*dt)*Phi(d_2)
    d_1a = (np.log(B**2 / (S*K)) + (r+sigma**2/2)*dt) / (sigma*np.sqrt(dt))
    d_2a = d_1a - sigma*np.sqrt(dt)
    reflection = (S/B)**(1-r/sigma**2) * ((B**2/S)*Phi(d_1a) - K*np.exp(-r*dt)*Phi(d_2a))
    return max(bs - reflection,0)



def analytical_downOut_tf_pricer(enable_greeks = True):
    S = tf.placeholder(tf.float32)
    K = tf.placeholder(tf.float32)
    dt = tf.placeholder(tf.float32)
    sigma = tf.placeholder(tf.float32)
    r = tf.placeholder(tf.float32)
    B = tf.placeholder(tf.float32)
    Phi = tf.distributions.Normal(0.,1.).cdf
    d_1 = (tf.log(S / K) + (r+sigma**2/2)*dt) / (sigma*tf.sqrt(dt))
    d_2 = d_1 - sigma*tf.sqrt(dt)
    bs_npv =  S*Phi(d_1) - K*tf.exp(-r*dt)*Phi(d_2)
    d_1a = (tf.log(B**2 / (S*K)) + (r+sigma**2/2)*dt) / (sigma*tf.sqrt(dt))
    d_2a = d_1a - sigma*tf.sqrt(dt)
    reflection = (S/B)**(1-r/sigma**2) * ((B**2/S)*Phi(d_1a) - K*tf.exp(-r*dt)*Phi(d_2a))
    npv = bs_npv - reflection
    target_calc = [npv]
    if enable_greeks:
        greeks = tf.gradients(npv, [S, sigma, r, K, dt, B])
        # Calculate mixed 2nd order greeks for S (esp. gamma, vanna) and sigma (esp. volga)
        dS_2ndOrder = tf.gradients(greeks[0], [S, sigma, r, K, dt, B]) 
        #dsigma_2ndOrder = tf.gradients(greeks[1], [S, sigma, r, K, dt, B]) 
        # Function to feed in the input and calculate the computational graph
        target_calc += [greeks, dS_2ndOrder]
    def price(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, barrier):
        """
        Returns the npv, delta, vega and gamma
        """
        with tf.Session() as sess:
            
            res = sess.run(target_calc, 
                           {
                               S: S_0,
                               K : strike,
                               r : riskfree_rate,
                               sigma: implied_vol,
                               dt: time_to_expiry,
                               B : barrier})
        if enable_greeks:
            return np.array([res[0], res[1][0], res[1][1], res[2][0]])
        else:
            return res
    return price

analytical_downOut_py(100., 110., 2., 0.2, 0.03, 80)

9.300732987604157

And again the TensorFlow implementation is super slow compared to the python one. 344 µs (pure Python) vs 1.63 s (TensorFlow with greeks).

We use the python implementation to visualise the npv of option dependent on the barrier.


def surface_plt(X,Y,Z, name='', angle=70):
    fig = plt.figure(figsize=(13,10))
    ax = fig.gca(projection='3d')
    surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
                           linewidth=0, antialiased=False)
    fig.colorbar(surf, shrink=0.5, aspect=5)
    ax.set_title('Down-And-Out Option (%s vs Barrier)' % name)
    ax.view_init(35, angle)
    ax.set_xlabel(name)
    ax.set_ylabel('barrier')
    ax.set_zlabel('price')
    plt.savefig('npv_%s_barrier.png' % name, dpi=300)
    

T = np.arange(0.01, 2, 0.1)
S = np.arange(75, 110, 1)
V = np.arange(0.1, 0.5, 0.01)
B = np.arange(75, 110, 1)
X, Y = np.meshgrid(S, B)
vectorized_do_pricer = np.vectorize(lambda s,b : analytical_downOut_py(s, 110., 2, 0.2, 0.03, b))
Z = vectorized_do_pricer(X, Y)
surface_plt(X,Y,Z,'spot', 70)

X, Y = np.meshgrid(V, B)
vectorized_do_pricer = np.vectorize(lambda x,y : analytical_downOut_py(100, 110., 2, x, 0.03, y))
Z = vectorized_do_pricer(X, Y)
surface_plt(X,Y,Z,'vol', 120)

X, Y = np.meshgrid(T, B)
vectorized_do_pricer = np.vectorize(lambda x,y : analytical_downOut_py(100, 110., x, 0.2, 0.03, y))
Z = vectorized_do_pricer(X, Y)
surface_plt(X,Y,Z,'time', 120)

Now lets implement the Monte Carlo Pricing. First as pure python version.

def monte_carlo_down_out_py(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, barrier, seed, n_sims, obs):
    if seed != 0:
            np.random.seed(seed)
    stdnorm_random_variates = np.random.randn(n_sims, obs)
    S = S_0
    K = strike
    dt = time_to_expiry / stdnorm_random_variates.shape[1]
    sigma = implied_vol
    r = riskfree_rate
    B = barrier
    # See Advanced Monte Carlo methods for barrier and related exotic options by Emmanuel Gobet
    B_shift = B*np.exp(0.5826*sigma*np.sqrt(dt))
    S_T = S * np.cumprod(np.exp((r-sigma**2/2)*dt+sigma*np.sqrt(dt)*stdnorm_random_variates), axis=1)
    non_touch = (np.min(S_T, axis=1) > B_shift)*1
    call_payout = np.maximum(S_T[:,-1] - K, 0)
    npv = np.mean(non_touch * call_payout)
    return np.exp(-time_to_expiry*r)*npv

monte_carlo_down_out_py(100., 110., 2., 0.2, 0.03, 80., 1312, 10000, 500)

Now now as TensorFlow version.

def create_downout_mc_tf_pricer(enable_greeks = True):
    (S,K, dt, T, sigma, r, dw, S_T) = create_tf_graph_for_simulate_paths()
    B = tf.placeholder(tf.float32)
    B_shift = B * tf.exp(0.5826*sigma*tf.sqrt(dt))
    non_touch = tf.cast(tf.reduce_all(S_T > B_shift, axis=1), tf.float32)
    payout = non_touch * tf.maximum(S_T[:,-1] - K, 0)
    npv = tf.exp(-r*T) * tf.reduce_mean(payout)
    target_calc = [npv]
    if enable_greeks:
        greeks = tf.gradients(npv, [S, sigma, r, K, T])
        target_calc += [greeks]
    def pricer(S_0, strike, time_to_expiry, implied_vol, riskfree_rate, barrier, seed, n_sims, obs):
        if seed != 0:
            np.random.seed(seed)
        stdnorm_random_variates = np.random.randn(n_sims, obs)
        with tf.Session() as sess:
            timedelta = time_to_expiry / obs
            res = sess.run(target_calc, 
                           {
                               S: S_0,
                               K : strike,
                               r : riskfree_rate,
                               sigma: implied_vol,
                               dt : timedelta,
                               T: time_to_expiry,
                               B: barrier,
                               dw : stdnorm_random_variates
                         })
            return res
    return pricer

downout_mc_tf_pricer = create_downout_mc_tf_pricer()
downout_mc_tf_pricer(100., 110., 2., 0.2, 0.03, 80., 1312, 10000, 500)

[9.34386, [0.47031397, 54.249084, 75.3748, -0.34261367, -0.2803158]]

The pure Python MC needs 211 ms ± 8.95 ms per npv vs the 886 ms ± 23.4 ms per loop for the TensorFlow MC with 5 greeks. So its not that bad as the performance for the closed formula implementation suggests.

There are many things we could try now: implementing other options, try different models (processes), implement a discretisation scheme to solve a stochastic process numerically in TensorFlow, try to improve the Barrier option with more advanced Monte Carlo techniques (Brownian Bridge). But that’s it for now.

For me it was quite fun to implement the Monte Carlo Simulations and do some simple pricing in TensorFlow. For me it was very suprising and unexpected that the analytical implementations are so slow compared to pure Python. Therefore the Monte Carlo Simulation in TensorFlow seems quite fast. Maybe the bad performance for the closed formula pricings is due to my coding skills.

Nevertheless TensorFlow offers a easy to use auto grad (greeks) feature and the extra cost in a MC Simulation is neglect-able and the automatic differentiation is faster than DF when we need to calculate more than 4-6 greeks. And in some cases we can be with 5 greeks as fast as pure Python as seen the barrier sample. (approx 1 sec for a Tensorflow (npv and 5 greeks) vs 200 ms for Python (single npv). So i assume we can be faster compared to a pure Python implementation when we need to calculate many greeks (pillars on a yield curve or vol surface).

Maybe we can get even more performance out of it if we run it on GPUs. I have the idea to try Bermudan Options in TensorFlow and I also want to give PyTorch a try. Maybe I will pick it up in a later post.

The complete source code is also on GitHub.

So long…

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part II

18/05/2018 ~ Matthias Groncki ~ 5 Comments

We will continue to build our credit card fraud detection model. In the previous post we used scikit-learn to detect fraudulent transactions with a logistic regression model. This time we will build a logistic regression in TensorFlow from scratch. We will start with some TensorFlow basics and then see how to minimize a loss function with (stochastic) gradient descent.

We will fit our model to our training set by minimizing the cross entropy. For the logistic regression is minimizing the cross entropy aquivalent to maximizing the likelihood (see previous part for details). In the next part we will extend this model with some hidden layer and build a deep neural network to detect fraud. And we will see how to use the High-Level API to build the same model much easier and quicker.

As usual you will find the corresponding notebook also on GitHub and kaggle.

First we will load the data, split it into a training and test set and normalize the features as in our previous model.


import numpy as np
import pandas as pd
import tensorflow as tf
import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

credit_card = pd.read_csv('../input/creditcard.csv')

X = credit_card.drop(columns='Class', axis=1)
y = credit_card.Class.values

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Short TensorFlow introduction

TensorFlow uses a dataflow graph to represent the computation in terms of the dependencies between individual operations.

We first define the dataflow graph, then we create a TensorFlow session to run or calculate it.

Start with a very simple graph. Lets say we have a two dimensional vector

$x =(x_1,x_2)$

and want to compute $a x^2 = (ax_1^2, ax_2^2)$ , where $a$ is a constant (e.g.5).

First we define the input, a tf.placeholder object. We will use tf.placeholder to feed our data to our models (like our features and the value we try to predict). We can also use tf.placeholder to feed hyperparameter in our model (e.g. a learning_rate). We have to define the type of the input (in this case it’s a floating point number) and the shape of the input (a two dimensional vector).

Another class is the tf.constant, as the name indicates it represent a constant in our computational graph.

input_x = tf.placeholder(tf.float32)
a = tf.constant(5., tf.float32, name='a', shape=(2,))

Now we define our operator y.

y = a * tf.pow(input_x,2)

We can create a TensorFlow session and run our operator with the ´run()´ method of the session. We have to feed our input vector x as a dictionary into the graph. Lets say $x=(1,2)$

x = np.array([1,2])
with tf.Session() as sess:
    result = sess.run(y, {input_x: x})
    print(result)

[ 5. 20.]

Automatic differentiation in TensorFlow

Most machine learning and deep learning problems are at the end minimization problems. We either have to minimize a loss function or we have to maximize a reward function in the case of reinforced learning. A bread and butter method to solve these kind of problems is the gradient descent (ascent) method.

The power of TensorFlow is the automatic or algorithmic differentiation of our computational graph. We can get the anayltical derviates (or gradients) for almost ‘free’. We dont need to derive the formula for the gradient by ourself or implement it.

The gradient $\nabla f(x)$ is the multidimensional generalization of the derivate of a function. Its the vector of the partial derivates of the function and points into the direction with the strongest increase. If we have have real valued function $f(x)$ with $x$ being a n-dimensional vector, then $f$ is decreasing the fastest when we go from point $x$ into the direction of the negative gradient.

To get the gradient we use the tf.gradientclass. Lets say we want to derive y with respect to our input x. The function call looks like that:

g = tf.gradients(y, [input_x])
grad_y_x = 0
with tf.Session() as sess:
    grad_y_x = sess.run(g,{input_x: np.array([1,2])})
    print(grad_y_x)

[array([10., 20.], dtype=float32)]

Chain Rule in TensorFlow

Lets check a simple example for the chain rule. We come later back to the chain rule when we talk about back propagation in the coming part. Back propagation is one of the key concepts in deep learning. Lets recall the chain rule, which we all learned at high school:

$\frac{d}{dx}f(g(x)) = f'(g(x))g'(x)$

For our example we use our function y and chain it to a new function z=log(y).

The ‘inner’ partial derivate of y with respect to $x_i$ is

$\frac{\partial y}{\partial x_i} = 10x_i$

and the outer one with respect to $y$ is

$\frac{\partial z}{\partial y_i} =\frac{1}{5x_i^2} .$

The partial derivate is $\frac{2x_i }{x_i^2}$ .

With TensorFlow, we can calulate the outer and inner derivate seperatly or in one step.

In our example we will calculate two gradients one with respect to y and one with respect to x. Multiplying elementwise the gradient with respect to y with the gradient of y with respect to x (inner derivative) yield to the gradient of z with respect to x.

z = tf.log(y)
with tf.Session() as sess:
    result_z = sess.run(z,  {input_x: np.array([1,2])})
    print('z =', result_z)
    delta_z = tf.gradients(z, [y, input_x])
    grad_z_y, grad_z_x = sess.run(delta_z,  {input_x: np.array([1,2])})
    print('Gradient with respect to y', grad_z_y)
    print('Gradient with respect to x', grad_z_x)
    print('Manual chain rule', grad_z_y * grad_y_x)

z = [1.609438  2.9957323]
Gradient with respect to y [0.2  0.05]
Gradient with respect to x [2. 1.]
Manual chain rule [[2. 1.]]

Gradient descent method

As mentioned before the gradient is very useful if we need to minimize a loss function.

It’s like hiking down a hill, we walk step by step into the direction of the steepest descent and finally we reach the valley. The gradient provides us the information in which direct we need to walk.

So if we want to minimize the function
$f$ (e.g. root mean squared error, negative likelihood, …) we can apply an iterative algorithm

$x_n = x_{n-1} - \gamma \nabla f(x_{n-1}),$

with a starting point $x_0$. These kind of methods are called gradient descent methods.

Under particular circumstances we can be sure that we reach the global minimum but in general this is not true. Sometimes it can happen that we reach a local minima or a plateau.
To aviod stucking in local minima there are plenty extensions to the plain vanilla gradient descent (e.g. simulated annealing). In Machine Learning literature the dradient descent method is often called Batch Gradient method, because you will use all data points to calculate the gradients.

We will usually multiply the gradient with a factor before we subtract it from our previous value, the so called learning rate. If the learning rate is too large, we will make large steps into the direction but it can happen that we step over the minimum and miss it. If the learning rate is too small the algorithm takes longer to converge. There are extensions which adept the learning rate to the parameters (e.g ADAM, RMSProp or AdaGrad) to achive faster and better convergence (see for example http://ruder.io/optimizing-gradient-descent/index.html).

Example Batch Gradient descent

Lets see how to use it on a linear regression problem. We generate 1000 random observations $y = 2 x_1 + 3x_2 * \epsilon$ , with $\epsilon$ normal distributed with zero mean and a standard deviation of 0.2.

#Generate data
np.random.seed(42)
eps = 0.2 * np.random.randn(1000)
x = np.random.randn(2,1000)
y = 2 * x[0,:] + 3 * x[1,:] + eps

We use a simple linear model to predict y. Our model is

$\hat{y_i} = w_1 x_{i,1} + w_2 x_{i,2},$

for an observation $x_i$ and we want to minimize the mean squared error of our predictions

$\frac{1}{1000} \sum (y_i-\hat{y_i})^2.$

Clearly we could use the well known least square estimators for the weights, but we want to minimize the error with a gradient descent method in TensorFlow.

We use the tf.Variableclass to store the parameters $w$ which we want to learn (estimate) from the data. We specify the shape of the tensor, through the intial values. The inital values are the starting point of our minimization.

Since we have a linear model, we can represent our model with an single matrix multiplication of our observation matrix (row obs, columns features) with our weight (parameter) matrix w.

# Setup the computational graph with loss function
input_x = tf.placeholder(tf.float32, shape=(2,None))
y_true = tf.placeholder(tf.float32, shape=(None,))
w = tf.Variable(initial_value=np.ones((1,2)), dtype=tf.float32)
y_hat = tf.matmul(w, input_x)
loss = tf.reduce_mean(tf.square(y_hat - y_true))

In the next step we are going to apply our batch gradient descent algorithm.

We define gradient of the loss with respect to our weights $w$ grad_loss_w.

We also need to initialize our weights with the inital value (starting point our optimization). TensorFlow has a operator for this tf.global_variables_initializer(). In our session we run the initialization operator first. And then we can apply our algorithm.

We calculate the gradient and apply it to our weights with the function assign().

grad_loss_w = tf.gradients(loss, [w])
init = tf.global_variables_initializer()
losses = np.zeros(10)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    for i in range(0,10):
        # Calculate gradient
        dloss_dw = sess.run(grad_loss_w, {input_x:x,
                                          y_true:y})
        # Apply gradient to weights with learning rate
        sess.run(w.assign(w - 0.1 * dloss_dw[0]))
        # Output the loss
        losses[i] =  sess.run(loss, {input_x:x,
                                     y_true:y})
        print(i+1, 'th Step, current loss: ', losses[i])
    print('Found minimum', sess.run(w))
plt.plot(range(10), losses)
plt.title('Loss')
plt.xlabel('Iteration')
_ = plt.ylabel('RMSE')

Luckily we don’t need to program everytime the same algorithm by ourself. TensorFlow provide many of gradient descent algorithms, e.g.
tf.train.GradientDescentOptimizer, tf.train.AdagradDAOptimizer or tf.train.RMSPropOptimizer (to mention a few). They compute the gradient and apply it to the weights automatically.

In the case of the GradientDescentOptimizerwe only need to specify the learning rate and tell the optimizer which loss function we want to minimize.

We call the method minimize which returns our training or optimization operator. In our loop we just need to run the operator.

tf.train.RMSPropOptimizer
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
losses = np.zeros(10)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    for i in range(0,10):
        _, losses[i] =  sess.run([train, loss], {input_x:x,
                                     y_true:y})
    print('Found minimum', sess.run(w))
plt.plot(range(10), losses)
plt.title('Loss')
plt.xlabel('Iteration')
_ = plt.ylabel('RMSE')

Stochastic Gradient Descent and Mini-Batch Gradient

One extension to batch gradient descent is the stochastic gradient descent. Instead of calculate the gradient for all observation we just randomly pick one observation (without replacement) an evaluate the gradient at this point. We repeat this until we used all data points, we call this an epoch. We repeat that process for several epochs.

Another variant use more than one random data point per gradient. Its the so called mini-batch gradient. Please feel free to play with the batch_size and the learning rate to see the effect of the optimization. One advantage is that we don’t need to keep all data in memory for optimization, especially if we talking about big data. We just need to load small batches at once to calculate the gradient.

np.random.seed(42)
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
n_epochs = 10
batch_size = 25
losses = np.zeros(n_epochs)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    indices = np.arange(x.shape[1])
    for epoch in range(0,n_epochs):
        np.random.shuffle(indices)
        for i in range(int(np.ceil(x.shape[1]/batch_size))):
            idx = indices[i*batch_size:(i+1)*batch_size]
            x_i = x[:,idx]
            x_i = x_i.reshape(2,batch_size)
            y_i = y[idx]
            sess.run(train, {input_x: x_i, 
                             y_true:y_i})
        
        if epoch%1==0: 
            loss_i = sess.run(loss, {input_x: x, 
                             y_true:y})
            print(epoch, 'th Epoch Loss: ', loss_i)
        loss_i = sess.run(loss, {input_x: x, 
                             y_true:y})
        losses[epoch]=loss_i
    print('Found minimum', sess.run(w))
plt.plot(range(n_epochs), losses)
plt.title('Loss')
plt.xlabel('Iteration')
_ = plt.ylabel('RMSE')

Found minimum [[1.9929324 2.9882016]]

Our minimisation algorithm found a solution very very close to the real values.

Logistic Regression in TensorFlow

Now we have all tools to build our Logistic Regression model in TensorFlow.
Its quite similar to our previous toy example. Logisitc regression is also a kind of linear model, it belong to the class of generalized linear models with with the logit as a link function. As we have seen in the previous part we assume in logistic regression that the logits (logarithm of the odds) are linear in the parameters/weights.

Our data set has 30 features, so we adjust the placeholders and the weights accordingly. We have seen that the minimizing the cross entropy is aquivalent to maximizing the likelihood function.. TensorFlow provides us with the loss function sigmod_cross_entropy, so we don’t need to implement the loss function by ourself (let us use this little shortcut, the cross entropy or negative log likelihood is quite easy to implement). The loss function takes the logits and the true lables (response) as inputs. It computes the entropy elementwise, so we have to take the mean or sum of the output of the loss function.

# Setup the computational graph with loss function
input_x = tf.placeholder(tf.float32, shape=(None, 30))
y_true = tf.placeholder(tf.float32, shape=(None,1))
w = tf.Variable(initial_value=tf.random_normal((30,1), 0, 0.1, seed=42), dtype=tf.float32)
logit = tf.matmul(input_x, w)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_true, logits=logit))

To get the prediction we have to use the sigmoid function on the logits.

y_prob = tf.sigmoid(logit)

For the training we can almost reuse the code of the gradient descent example. We just need to adjust the number of iterations (100, feel free to play with this parameter). and the function call. In each iteration we call our training operator, calculate the current loss and the current probabilities and store the information to visualize the training.

Every ten epoch we print the current loss and AUC score.

optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
n_epochs = 100
losses = np.zeros(n_epochs)
aucs = np.zeros(n_epochs)
with tf.Session() as sess:
    # Initialize the variables
    sess.run(init)
    # Gradient descent
    for i in range(0,n_epochs):
        _, iloss, y_hat =  sess.run([train, loss, y_prob], {input_x: X_train,
                                                           y_true: y_train.reshape(y_train.shape[0],1)})
        losses[i] = iloss
        aucs[i] = roc_auc_score(y_train, y_hat)
        if i%10==0:
            print('%i th Epoch Train AUC: %.4f Loss: %.4f' % (i, aucs[i], losses[i]))
    
    # Calculate test auc
    y_test_hat =  sess.run(y_prob, {input_x: X_test,
                                             y_true: y_test.reshape(y_test.shape[0],1)})
    weights = sess.run(w)

0 th Epoch Train AUC: 0.1518 Loss: 0.7446
10 th Epoch Train AUC: 0.8105 Loss: 0.6960
20 th Epoch Train AUC: 0.8659 Loss: 0.6906
30 th Epoch Train AUC: 0.9640 Loss: 0.6893
40 th Epoch Train AUC: 0.9798 Loss: 0.6884
50 th Epoch Train AUC: 0.9816 Loss: 0.6876
60 th Epoch Train AUC: 0.9818 Loss: 0.6868
70 th Epoch Train AUC: 0.9818 Loss: 0.6861
80 th Epoch Train AUC: 0.9819 Loss: 0.6853
90 th Epoch Train AUC: 0.9820 Loss: 0.6845

The AUC score of the test data is around 98%.

Since we have only 30 features we can easily visualize the influence of each feature in our model.

So thats it for today. I hoped you enjoyed reading the post. Please download or fork the notebook on GitHub or on kaggle and play with the code and change the parameters.

In the next post we will add a hidden layer to our model and build a neural network. And we will see how to use the High-Level API to build the same model much easier. If you want to learn more about gradient descent and optimization have a look into the following links

Some lecture note of the Unversity of Toronto: https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Wikipedia

So long.

Some note: I will separate the data science related notebooks from the quant related notebooks on GitHub. In future you will find all Machine Learning, Deep Learning related notebook in a own repository, while the Quant notebooks will be stay in the old repo.

From Logistic Regression in SciKit-Learn to Deep Learning with TensorFlow – A fraud detection case study – Part I

08/05/2018 ~ Matthias Groncki ~ 7 Comments

Its quite a long time since my last post. It has been a busy time for me and a lot of things changed. One obvious change was the overdue change of my blog title from Ipython to Jupyter notebooks. Also has my work life shifted over the time from a Quant to a Data Scientist role. The most recent change is that relocated to Bangkok this year. I am still working in the finance industry and now lead a small Data Science team in the Operational Risk area.

After getting used to the new life here in South East Asia I’ve decided to continue my blog. But there will be slight change in the topics, I plan to write more about Machine Learning and Deep Learning in Python and R and less about pricing and XVAs. But I will have also the chance to look into some pricing model validation in my new role, so there is the chance that there will be some quant related postings coming as well.

In the next three coming posts, we will see how to build a fraud detection (classification) system with TensorFlow. We will start to build a logistic regression classifier in SciKit-Learn (sklearn). In the next step will build a logistic regression classifier in TensorFlow from scratch. In the 3rd post we will add a hidden layer to our logistic regression and build a neural network.

You can find the complete source code on GitHub or on kaggle.

For this example we use public available real world data set. You can find the data on kaggle. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. It contains only numerical input variables which are the result of a PCA transformation. Due to confidentiality issues, there are no more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA.

Install requirements

Easiest way is to install the Anaconda Python distribution from Anaconda Cloud. Windows, Mac and Linux is supported. I use the Python 3.6 64 bit Version. Most of the required packages come already with the basic installation. To install Keras and TensorFlow open the Anaconda prompt (shell) and install the missing packages via conda (package manager, similar to apt-get in ubuntu):

conda install -c conda-forge keras tensorflow

Fraud detection with logistic regression in Scikit-Learn

First we load a required libraries and functions

 This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Pandas is used for loading the data and a powerful libraries for data wrangling. If you are not familiar with pandas check out the tutorials on the pandas project website. numpy is the underlying numerical library for pandas and scikit-learn. seaborn and matplotlib are used for visualisation.

Load and visualize the data

First we load the data and try to get an overview of the data.

We load the csv-file with the command read_csv and store it as data-frame in our memory.

credit_card = pd.read_csv('../input/creditcard.csv')

Next we try to get an overview of the fraud vs non-fraud distribution, we going to use the seaborn countplot function to produce bar chart.

f, ax = plt.subplots(figsize=(7, 5))
sns.countplot(x='Class', data=credit_card)
_ = plt.title('# Fraud vs NonFraud')
_ = plt.xlabel('Class (1==Fraud)')

We can not even see the bar chart of the fraud cases. As we can see we have mostly non-fraudulent transactions. Such a problem is also called inbalanced class problem.

99.8% of all transactions are non-fraudulent. The easiest classifier would always predict no fraud and would be in almost all cases correct. Such classifier would have a very high accuracy but is quite useless.

For such an inbalanced classes we could use over or undersampling methods to try to balance the classes (see inbalance-learn for example: https://github.com/scikit-learn-contrib/imbalanced-learn), but this out of the scope of todays post. We will come back to this in a later post.

As accuracy is not very informative in this case, the AUC (Aera under the curve) a better metric to assess the model quality. The AUC score is in a two class classification class equal to the probability that our classifier will detect a fraudulent transaction given one fraudulent and genuine transaction to choice from. Guessing would have a probability of 50%.

We create now the feature matrix X and the result vector y. We drop the column Class from the data frame and store it in a new data frame X and we select the column Class as our vector y.

X = credit_card.drop(columns='Class', axis=1)
y = credit_card.Class.values

Due to the construction of the dataset (PCA transformed features, which minimizes the correlation between factors), we dont have any highly correlated features. Multicolinearity could cause problems in a logisitc regression.

To test for multicolinearity one could look into the correlation matrix (works only for non categorical features, which we do today) or run partial regressions and compare the standard errors or use pseudo-R^2 values and calculate Variance-Inflation-Factors.


corr = X.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

Short reminder of Logistic Regression

In Logisitic Regression the logits (logs of the odds) are assumed to be a linear function of the features

$L=\log(\frac{P(Y=1)}{1-P(Y=1)}) = \beta_0 + \sum_{i=1}^n \beta_i X_i.$

Solving this equatation for $p=P(Y=1)$ yields to

$p = \frac{\exp(L)}{1-\exp(L)}.$

The parameters $\beta_i$ can be derived by Maximum Likelihood Estimation (MLE). The likelihood for a given $m$ observation $Y_j$ is

$lkl = \prod_{j=1}^m p^{Y_j}(1-p)^{1-Y_j}.$

To find the maximum of the likelihood is equivalent to the minimize the negative logarithm of the likelihood (loglikelihood).

$-llkh = -\sum_{j=1}^m Y_j \log(p) + (1-Y_j) \log(1-p),$

which is numerical more stable. The log-likelihood function has the same form as the cross-entropy error function for a discrete case.

So finding the maximum likelihood estimator is the same problem as minimizing the average cross entropy error function.

In SciKit-Learn uses by default a coordinate descent algorithm to find the minimum of L2 regularized version of the loss function (see. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression).

The main difference between L1 (Lasso) and L2 (Ridge) regulaziation is, that the L1 prefer a sparse solution (the higher the regulazation parameter the more parameter will be zero) while L2 enforce small parameter values.

Train the model

Training and test set

First we split our data set into a train and a validation set by using the function train_test_split.

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y)

Model definition

scaler = StandardScaler()
lr = LogisticRegression()
model1 = Pipeline([('standardize', scaler),
                    ('log_reg', lr)])
model1.fit(X_train, y_train)

As preperation we standardize our features to have zero mean and a unit standard deviation. The convergence of gradient descent algorithm are better. We use the class StandardScaler. The class StandardScaler has the method fit_transform() which learn the mean $\mu_i$ and standard deviation $\sigma_i$ of each feature $i$ and return a standardized version $\frac{x_i - \mu_i}{\sigma}$ . We learn the mean and sd on the training data. We can apply the same standardization on the test set with the function transform().

The logistic regression is implemented in the class LogisticRegression, we will use for now the default parameterization. The model can be fit using the function fit(). After fitting the model can be used to make predicitons predict() or return the estimated the class probabilities predict_proba().

We combine both steps into a Pipeline. The pipline performs both steps automatically. When we call the method fit() of the pipeline, it will invoke the method fit_and_transform() for all but the last step and the method fit() of the last step, which is equivalent to lr.fit(scaler.fit_transform(X_train), y_train)

If we invoke the method predict() of the pipeline its equvivalent to lr.predict(scaler.transform(X_train)).

Training score and Test score

confusion_matrix() returns the confusion matrix, C where $C_{0,0}$ are the true negatives (TN) and $C_{0,1}$ the false positives (FP) and vice-versa for the positives in the 2nd row. We use the function accurary_score() to calculate the accuracy our models on the train and test data. We see that the accuracy is quite high (99,9%) which is expected in such an unbalanced class problem. With the method roc_auc_score()can we get the area under the receiver-operator-curve (AUC) for our simple model.

y_train_hat = model1.predict(X_train)
y_train_hat_probs = model1.predict_proba(X_train)[:,1]
train_accuracy = accuracy_score(y_train, y_train_hat)*100
train_auc_roc = roc_auc_score(y_train, y_train_hat_probs)*100
print('Confusion matrix:\n', confusion_matrix(y_train, y_train_hat))
print('Training accuracy: %.4f %%' % train_accuracy)
print('Training AUC: %.4f %%' % train_auc_roc)

Confusion matrix:
 [[213200     26]
 [   137    242]]
Training accuracy: 99.9237 %
Training AUC: 98.0664 %

y_test_hat = model1.predict(X_test)
y_test_hat_probs = model1.predict_proba(X_test)[:,1]
test_accuracy = accuracy_score(y_test, y_test_hat)*100
test_auc_roc = roc_auc_score(y_test, y_test_hat_probs)*100
print('Confusion matrix:\n', confusion_matrix(y_test, y_test_hat))
print('Training accuracy: %.4f %%' % test_accuracy)
print('Training AUC: %.4f %%' % test_auc_roc)

Confusion matrix:
 [[71077    12]
 [   45    68]]
Training accuracy: 99.9199 %
Training AUC: 97.4810 %

Our model is able to detect 68 fraudulent transactions out of 113 (recall of 60%) and produce 12 false positives (<0.02%) on the test data.

To visualize the Receiver-Operator-Curve we use the function roc_curve. The method returns the true positive rate (recall) and the false positive rate (probability for a false alarm) for a bunch of different thresholds. This curve shows the trade-off between recall (detect fraud) and false alarm probability.

If we classifiy all transaction as fraud, we would have a recall of 100% but also the highest false alarm rate possible (100%). The naive way to minimize the false alarm probability is to classify all transaction as genuine. **


fpr, tpr, thresholds = roc_curve(y_test, y_test_hat_probs, drop_intermediate=True)

f, ax = plt.subplots(figsize=(9, 6))
_ = plt.plot(fpr, tpr, [0,1], [0, 1])
_ = plt.title('AUC ROC')
_ = plt.xlabel('False positive rate')
_ = plt.ylabel('True positive rate')
plt.style.use('seaborn')

plt.savefig('auc_roc.png', dpi=600)

Our model classify all transaction with a fraud probability => 50% as fraud. If we choose the threshold higher, we could reach a lower false positive rate but we would also miss more fraudulent transactions. If we choose the thredhold lower we can catch more fraud but need to investigate more false positives.

Depending on the costs for each error, it make sense to select another threshold.

If we set the threshold to 90% the recall decrease from 60% to 45%. while the false positve rate is the same. We can see that our model assign some non-fraudulent a very high probability to be fraud.

y_hat_90 = (y_test_hat_probs > 0.90 )*1
print('Confusion matrix:\n', confusion_matrix(y_test, y_hat_90))
print(classification_report(y_test, y_hat_90, digits=6))

If we set the threshold down to 10%, we can detect around 75% of all fraud case but almost double our false positive rate (now 25 false alarms)

Confusion matrix:
 [[71064    25]
 [   25    88]]
             precision    recall  f1-score   support

          0     0.9996    0.9996    0.9996     71089
          1     0.7788    0.7788    0.7788       113

avg / total     0.9993    0.9993    0.9993     71202

Where to go from here?

We just scratched the surface of sklearn and logistic regression. For example we could spent much more time with the

feature selection / engineering (which is a bit hard without any background information about the features),
we could try techniques to counter the data inbalance and
we could use cross-validation to fine tune the hyperparameters or
try a different regularization (Lasso/Elastic Net) or
try a different optimizer (stochastic gradient descent or mini-batch sgd)
adjust class weights to adjust the decision boundary (make missed frauds more expansive in the loss function)
and finally we could try different classifer models in sklearn like decision trees, random forrests, knn, naive bayes or support vector machines.

But for now we will stop here and we will implement in the next part the logisitc regression model with stochastic gradient descent in TensorFlow and then extend it to a neural net and we will come back to these points at a later time. But in the mean time feel free to play with the notebook and try to change the parameter and see how the model will change.

So long…

Exposure Simulation Part III / CVA for Bermudan Swaptions

26/06/201626/06/2016 ~ Matthias Groncki ~ 1 Comment

In this post we are going to simulate exposures of a bermudan swaption (with physical settlement) through an backward induction (aka American Monte-Carlo) method. These exposure simulations are often needed at the counterparty credit risk management. We will use this exposures to calculate the Credit Value Adjustment (CVA). One could easily extend this notebook to calculate the PFE or other xVAs. The implementation is based on the paper “Backward Induction for Future Values” by Dr. Antonov et al. .

We assume that the value of a derivate and the default probability of the counterpart are independent (no wrong-way risk). For our example our netting set consists of only one Bermudan swaption (but it can be easily extended). For simplicity we assume a flat yield termstructure.

In our example we consider a 5Y Euribor 6M payer swap at 3 percent. The bermudan swaption allows us to enter this swap on each fixed rate payment date.

Just as a short reminder the unilateral CVA of a derivate is given by:

$CVA = (1-R) \int_0^Tdf(t)EE(t)dPD(t),$

with recovery rate R, portfolio maturity T (the latest maturity of all deals in the netting set), discount factor df(t) at time t, the expected exposure EE(t) of the netting set at time t and the default probability PD(t).

We can approximate the integral by the discrete sum

$CVA = (1-R) \sum_{i=1}^n df(t_i)EE(t_i)(PD(t_{i-1})-PD(t_i)).$

In one of my previous posts we calculated the CVA of a plain vanilla swap. We performed the following steps:

Generate a timegrid T
Generate N paths of the underlying market values which influences the value of the derivate
For each point in time calculate the positive expected exposure
Approximate the integral

As in the plain vanilla case we will use a short rate process and assume its already calibrated.

The first two steps are exactly the same as in the plain vanilla case.

But how can we calculate the expected exposure of the swaption?

In the T-forward measure the expected exposure is given by:

$EE(t) = P(0,T) E^T[\frac{1}{P(t,T)} \max(V(t), 0)]$

with future value of the bermudan swaption $V(t)$ at time $t$ . The npv of a physical settled swaption can be negative at time t if the option has been exercised before time t and the underlying swap has a negative npv at time t.

For each simulated path we need to calculate the npv of the swaption $V(t_i,x_i)$ conditional the state $x_i$ at time $t_i$.

In the previous post we saw a method how to calculate the npv of a bermudan swaption. But for the calculation of the npv we used a Monte Carlo Simulation. This would result again in a nested Monte-Carlo-Simulation which is, for obvious reasons, not desirable.

If we have a look on the future value, we see that is very much the same as the continuation value in the bermudan swaption pricing problem.

But instead of calculate the continuation values only one exercises date we calculate it now on each time in our grid. We use a the same regression based approximation.

We approximate the continuation value through some function of current state of the stochastic process x_i:

$V(t_i, x_i) \approx f(x_i)$ .

A common choice for the function is of the form:

$f(x) = a_0 + a_1 g_1(x)+ a_2 g_2(x) + \dots a_n g_n(x)$

with $g_i (x)$ a polynom of the degree $i$ .

The coefficients of this function f are estimated by the ordinary least square error method.

We can almost reuse the code from the previous post, only a few extensions are needed.

We need to add more times to our time grid (we add a few more points in time after the maturity to have some nicer plots):

date_grid = [today + ql.Period(i, ql.Months) for i in range(0,66)] + calldates + fixing_dates

When we exercise the option we will enter the underlying swap. Therefore we need to update the swaption npvs to the underlying swap npv for all points in time after the exercise time on each grid:

if t in callTimes:
cont_value = np.maximum(cont_value_hat, exercise_values)
swaption_npvs[cont_value_hat < exercise_values, i:] = swap_npvs[cont_value_hat < exercise_values, i:].copy()

For our example we can observe the following simulated exposures:

If we compare it with the exposures of the underlying swap, we can see that the some exposure paths coincide. This is the case when the swaption has been exercised. In some of this cases we observe also negative npvs after exercising the option.

For the CVA calculation we need the positive expected exposure, which can be easily calculated:

swap_npvs[swap_npvs<0] = 0
swaption_npvs[swaption_npvs<0] = 0
EE_swaption = np.mean(swaption_npvs, axis=0)
EE_swap = np.mean(swap_npvs, axis=0)

Given the default probability of the counterparty as a default termstructure we can now calculate the CVA for the bermudan swaption:

# Calculation of the default probs
defaultProb_vec = np.vectorize(pd_curve.defaultProbability)
dPD = defaultProb_vec(time_grid[:-1], time_grid[1:])

# Calculation of the CVA
recovery = 0.4
CVA = (1-recovery) * np.sum(EE_swaption[1:] * dPD)

As usual the complete notebook is available here.

Exposure Simulation / CVA and PFE for multi-callable swaps Part II

29/05/201629/05/2016 ~ Matthias Groncki ~ 3 Comments

Hey Everyone,

today I want to continue my last post and show you today how to calculate the NPV of a bermudan swaption through Monte-Carlo simulation. The techniques I will show you in this post can be easily extended to simulate exposure paths of a multi callable swap. This exposure simulations can be used for a xVA calculation like the Credit Value Adjustment (CVA) or Potential Future Exposure (PFE).

A physical settled Bermudan swaption is a path dependent derivate. The option has multiple exercise dates and the swaption holder has the right to enter the underlying swap at any of this exercise dates.

On each of the exercise dates the holder have to decide whether it is optimal to exercise the option now or continue to hold the option and may exercise it on a later date.

Lets consider a bermudan swaption with two exercise dates.

At the latest exercise date T the payoff is the well known european swaption payoff

$V(T) = \max(S(T), 0)$ ,

where $S(T)$ is the npv of the underlying swap at time t.

At the first exercise date $t_i \le T$ the npv of the swaption given by

$V(t_i) = \max(S(t_i), N(t_i) E [ \frac{V(T)} {N(T)} | F_{t_i}]$ .

and so to the NPV at time 0

$V(0) = N(0) E[\frac{V(t_i)}{N(t_i)} | F_0]$ .

One way to solve problem is performing a Monte-Carlo-Simulation. But a naive Monte Carlo approach would require a nested Monte-Carlo Simulation on each path to calculate the continuation value
$C(t_i ) = N(t_i) E[\frac{V(T)}{N(T)} | F_{t_i} ]$ at time $t_i$ .

Lets say we use 100.000 samples in our simulation, so a bermudan swaption with two exercise dates would require 100.000 x 100.000 samples. Which is very time consuming and grows exponential with the number of exercise dates.

Instead of calculate the continuation value also with a Monte-Carlo simulation we will use a approximation. We use the approximation and algorithm developed by Longstaff and Schwarz.

We approximate the continuation value through some function of some state variable (e.g swap rate, short rate, etc…) $x_i$ :

$C(t_i) \approx f(x_i)$ .

A common choice for the function is of the form:

$f(x) = a_0 + a_1 g_1(x)+ a_2 g_2(x) + \dots a_n g_n(x)$

with $g_i (x)$ a polynom of the degree $i$ .

Lets choice $g_i (x)= x^{i-1}$ and $n=5$ for our example.

The coefficients of this function f are estimated by the ordinary least square error method. Thats why some people call this method also the OLS Monte-Carlo Method.

So how does the the algorithm look like?

We would simulate n paths of our stochastic process (in our case the short rate process) and go backward through time (backward induction).

We start at the last exercise date $T$ and calculate for each path the deflated terminal value of the swaption $V_T$ (exactly like we did in the european case).

Then we go back into time to the next exercise date and approximate the needed continuation value by our ordinary least square estimate.

Our choice for the state variable is the state variable of the short rate process itself.

We choice our coefficient so that the square error of $f(x_i) - V(T)$ is minimized, where $x_i$ is a vector of all simulated states over all paths and $V(T)$ the corresponding vector of the calculated npvs at time T from the previous step.

Now we use this coefficients to calculate the continuation value on each path by inserting the state of the current path into the formula.

On path j the deflated value of the swaption will be

$V_{i,j} = \frac{1}{N(t-i, x_{i,j})} max(S(t_i, x_{i,j}), N(t_i, x_{i,j}) C(x_{i,j})$

where $x_{i,j}$ is the the state of the process at time i on path j.

The npv at time 0 is then

$V(0) = N(0) \frac{1}{n} \sum_j=0^n V_{i,j}$ .

Some remarks:

One could also perform a 2nd MC Simulation (forward in time). First estimate the needed coefficients with a smaller numbers of paths. After train the model with this small set generate a larger sets of paths and go straight forward in time and use the approximation of the continuation value of the first run. On each path you only need to go forward until the holder will exercise the option.
The exercise time by this approach is only approximatively optimal, since its uses a approximation of the continuation value. Therefore our price will be an (asymptotic) lower bound of the real price.
The choice of appropriate functions g_i and the degree n is not obvious and can be tricky (see below).

Now lets have a look how this algorithm could be implemented in Python and Quantlib.

We use the notebook from my previous post as our starting point. We use the same yield curves, model (Gaussian short rate model) and the same underlying swap. The underlying 5y swap starts in 1Y.

We setup a bermudan swaption with 2 exercise dates, the first exercise date is on the start date of the swap and the 2nd date on the start date of the 2nd fixed leg accrual period.

Lets add both dates in our list of exercise dates:

calldates = [ settlementDate, 
              euribor6m.valueDate(ql.Date(5,4,2017))
            ]

We need to evaluate the value of the underlying swap at both exercise dates on each simulated path. To do that we introduce two new auxiliary functions. The first gives us all future (relative to an arbitrary evaluation time t) payment times and amounts from the fixed rate leg of the swap:

def getFixedLeg(swap, t):
    """
    returns all future payment times and amount of the fixed leg of the underlying swap

    Parameter:
        swap (ql.Swap)
        t (float) 

    Return:
        (np.array, np.array) (times, amounts)

    """
    fixed_leg = swap.leg(0)
    n = len(fixed_leg)
    fixed_times=[]
    fixed_amounts=[]
    npv = 0
    for i in range(n):
        cf = fixed_leg[i]
        t_i = timeFromReference(cf.date())
        if t_i > t:
            fixed_times.append(t_i)
            fixed_amounts.append(cf.amount())
    return np.array(fixed_times), np.array(fixed_amounts)

The 2nd will give us all future payment times, accrual start and end times, notionals, gearings and day count fractions of the floating leg. We need all this information to estimate the fixing of the float leg.

def getFloatingLeg(swap, t):

    float_leg = swap.leg(1)
    n = len(float_leg)
    float_times = []
    float_dcf = []
    accrual_start_time = []
    accrual_end_time = []
    nominals = []
    for i in range(n):
        # convert base classiborstart_idx Cashflow to
        # FloatingRateCoupon
        cf = ql.as_floating_rate_coupon(float_leg[i])
        value_date = cf.referencePeriodStart()
        t_fix_i = timeFromReference(value_date)
        t_i = timeFromReference(cf.date()) 
        if t_fix_i >= t:
            iborIndex = cf.index()

            index_mat = cf.referencePeriodEnd()
            # year fraction
            float_dcf.append(cf.accrualPeriod())
            # calculate the start and end time
            accrual_start_time.append(t_fix_i)
            accrual_end_time.append(timeFromReference(index_mat))
            # payment time
            float_times.append(t_i)
            # nominals 
            nominals.append(cf.nominal())
    return np.array(float_times), np.array(float_dcf), np.array(accrual_start_time), np.array(accrual_end_time), np.array(nominals)

With these two function we can evaluate the the underlying swap given the time and state of the process:

def swapPathNPV(swap, t):
    fixed_times, fixed_amounts = getFixedLeg(swap, t)
    float_times, float_dcf, accrual_start_time, accrual_end_time, nominals = getFloatingLeg(swap, t)
    df_times = np.concatenate([fixed_times, 
                           accrual_start_time, 
                           accrual_end_time, 
                           float_times])
    df_times = np.unique(df_times)
    # Store indices of fix leg payment times in 
    # the df_times array
    fix_idx = np.in1d(df_times, fixed_times, True)
    fix_idx = fix_idx.nonzero()
    # Indices of the floating leg payment times 
    # in the df_times array
    float_idx = np.in1d(df_times, float_times, True)
    float_idx = float_idx.nonzero()
    # Indices of the accrual start and end time
    # in the df_times array
    accrual_start_idx = np.in1d(df_times, accrual_start_time, True)
    accrual_start_idx = accrual_start_idx.nonzero()
    accrual_end_idx = np.in1d(df_times, accrual_end_time, True)
    accrual_end_idx = accrual_end_idx.nonzero()
    # Calculate NPV
    def calc(x_t):
        discount = np.vectorize(lambda T: model.zerobond(T, t, x_t))
        dfs = discount(df_times)
        # Calculate fixed leg npv
        fix_leg_npv = np.sum(fixed_amounts * dfs[fix_idx])
        # Estimate the index fixings
        index_fixings = (dfs[accrual_start_idx] / dfs[accrual_end_idx] - 1) 
        index_fixings /= float_dcf
        # Calculate the floating leg npv
        float_leg_npv = np.sum(nominals * index_fixings * float_dcf * dfs[float_idx])
        npv = float_leg_npv - fix_leg_npv
        return npv
    return calc

This functions returns us a pricing function, which takes the current state of the process to price calculate the NPV of the swap at time t. The technique we use here is called closure function. The inner function is aware of the underlying swap and the ‘fixed’ time t and can access all variables defined in the outer function.

Remark on the pricing function
This is a very simple function. Its not possible to calculate the correct NPV of the swap for a time t between to fixing times of the floating leg. The current floating period will be ignored. So use that function only to evaluate the swap NPV only on fixing dates. We will extend this function to be capable to evaluate the NPV on any time t in the next post.

Use of the function at time t=0

The generation of the time grid hasn’t changed from the last time. It just consists of three points.

Also the generation of our sample path is the same as last time. After we generate our path we calculate the deflated payoffs of our swap at time T:

pricer = np.vectorize(swapPathNPV(swap, time_grid[-1]))
cont_value = pricer(y[:,-1]) / numeraires[:,-1]
cont_value[cont_value < 0] = 0

First we generate a vectorised function of the pricing function and use it on the array of our sample paths y and then apply the maximum function on the result.

In the next step we go one step back in time and calculate the deflated exercise value of the swaption at that time:

pricer = np.vectorize(swapPathNPV(swap, time_grid[-2]))
exercise_values = pricer(y[:,-2]) / numeraires[:,-2]
exercise_values[exercise_values < 0] = 0

Now we estimate the coefficients of continuation value function. We use the library statsmodels and fit an OLS model to the data.

states = y[:, -2]
Y = np.column_stack((states, states**2, states**3, states**4))
Y = sm.add_constant(Y)
ols = sm.OLS(cont_value, Y)
ols_result = ols.fit()

With this coefficients we can now calculate the continuation value on each path, given the state:

cont_value_hat = np.sum(ols_result.params * Y, axis=1)

The deflated value of the swaption at the first exercise is the maximum out of exercise value and continuation value:

npv_amc = np.maximum(cont_value_hat, exercise_values)

The npv at time 0 is the mean of the simulated deflated npvs at the first exercise date times the value of the numeraire at time 0:

npv_amc = np.mean(npv_amc) * numeraires[0,0]

To check the quality of our regression function $f$ we can have a look on a scatter plot:

As we can see the regression function doesn’t seems to fit that good to the left tail. So we could either increase the degree of our function, try other polynomial function, change to another state variable or try piecewise regression functions.

As usual you can download the source code from my github account or find it on nbViewer.

In the next post we are going to use this regression based approach to generate exposure paths for a multi callable swap.

I hope you enjoy the post. Till next time.

Exposure simulation / PFE and CVA for multi-callable swaps / Bermudan swaptions… Part 1 of 3

02/05/201502/05/2015 ~ Matthias Groncki ~ 3 Comments

In my previous posts we have seen a Monte-Carlo method to generate market scenarios and calculate the expected exposure, potential future exposure and credit value adjustment for a netting set of plain vanilla swaps. In the next three posts we will add multi-callable swaps (Bermudan swaptions) to the netting set.

Roadmap to multi callable products

In the first part we will see how to use a Monte-Carlo simulation for a single-callable swap (European Swaption) pricing. We don’t worry about the model calibration in this posts. This post is intend to fix the used notation and provide a simple example about basic Monte-Carlo pricing. The techniques presented here will be used and extend in the following two posts.

In the second part we develop a regression based approach (aka backward induction method or American Monte-Carlo or Longstaff-Schwartz method) to calculate the npv of a Bermudan swaption. Further we will calibrate the Gaussian short rate model to fit to a set of market prices of European swaptions.

At this point we will be able to calculate the npv of a multi-callable swap but the Monte-Carlo pricing is not suitable for an exposure simulation. Since a nested pricing simulation in a exposure simulation is very time consuming.

Therefore we will modify the American Monte-Carlo method in the third part and make it useable for our exposure simulation. The approach in the third post will follow the method presented by Drs. Alexandre Antonov, Serguei Issakov and Serguei Mechkov in their research paper ‘Backward Induction for Future Values’ and the methodology presented in the book ‘Modelling, Pricing, and Hedging Counterparty Credit Exposure: A Technical Guide’ by Giovanni Cesari et al.

But for now let’s start with something easy: European swaptions.

European swaption pricing

Since my first post we have been living in a single curve world and our model parameter have been being exogenous. To make things even more easy we have been using a flat yield curve. For now we don’t leave this comfortable world and apply the same setting for this example.

How to create a swaption with QuantLib?

An European payer/receiver swaption with physical delivery is an option that allows the option holder at option expiry to enter a payer/receiver swap. The rate paid/received on the fixed leg equals the strike of the swaption.

Given a plain vanilla swap, one can create an European swaption in the QuantLib with very few lines of code. All we need is the expiry date and the settlement type (cash settled or physical delivery).

def makeSwaption(swap, callDates, settlement):
    if len(callDates) == 1:
        exercise = ql.EuropeanExercise(callDates[0])
    else:
        exercise = ql.BermudanExercise(callDates)
    return ql.Swaption(swap, exercise, settlement)

settlementDate = today + ql.Period("1Y")

swaps = [makeSwap(settlementDate,
                  ql.Period("5Y"),
                  1e6,
                  0.03047096,
                  euribor6m)
        ]

calldates = [euribor6m.fixingDate(settlementDate)]
 
swaptions = [makeSwaption(swap, 
                          calldates, 
                          ql.Settlement.Physical) 
             for swap, fd in swaps]

Monte-Carlo pricing

At option expiry $T_E$ the npv of the swaption is $V(T_E) = \max(Swap(T_E), 0.0)$ with $Swap(T_E)$ donating the value of the underlying swap at expiry.

In the Gaussian short rate model under the T-forward measure the zerobond with maturity in T years is our numeraire $N(t)$ . Under the usual conditions and using the T-forward measure we can calculate the npv at time 0 by

$V(0) = N(0) E[\frac{V(T_E)}{N(T_E)} | F_0].$

In our model the numeraire itself is not deterministic so we have to simulate it too.

The Monte-Carlo pricing will consist of three steps

– generate M paths of the short rate process and
– evaluate the swap npv $V_i$ and calculate the numeraire price $N_i$ at option expiry $T_E$ for each path $i=0\dots,M-1$
– and finally approximate the expected value by $\frac{1}{M} \sum_{i=0}^{M-1} \max(\frac{V_i}{N_i},0.0)$ .

Implementation

Instead of using the QuantLib swap pricer we will do the path pricing in Python. Therefore we need to extract the needed information from the instrument.

We convert all dates into times (in years from today). We use the day count convention Act/365.

mcDC = yts.dayCounter()

def timeFromReferenceFactory(daycounter, ref):
    """
    returns a function, that calculate the time in years
    from a the reference date *ref* to date *dat* 
    with respect to the given DayCountConvention *daycounter*
    
    Parameter:
        dayCounter (ql.DayCounter)
        ref (ql.Date)
        
    Return:
    
        f(np.array(ql.Date)) -> np.array(float)
    """
    def impl(dat):
        return daycounter.yearFraction(ref, dat)
    return np.vectorize(impl)

timeFromReference = timeFromReferenceFactory(mcDC, today)

In the first step we extract all fixed leg cashflows and payment dates to numpy arrays.

That are all information we need to calculate the fixed leg npv on a path. We calculate the discount factors for each payment time and multiply the cashflow array with the discount factors array element-wise. The sum of this result gives us the fixed leg npv.

fixed_leg = swap.leg(0)
n = len(fixed_leg)
fixed_times = np.zeros(n)
fixed_amounts = np.zeros(n)
for i in range(n):
    cf = fixed_leg[i]
    fixed_times[i] = timeFromReference(cf.date())
    fixed_amounts[i] = cf.amount()

For the floating leg npv we extract all payment, accrual period start and end dates. We assume that the index start and end dates coincide with the accruals start and end dates and that all periods are regular. With this information we can estimate all floating cashflows by estimating the index fixing through

$fixing(t_s) = (\frac{df(t_s)}{df(t_e)}-1) \frac{1}{dcf_{idx}(t_s,t_e)},$

with the discount factor $df(t$ ) at time $t$ , the year fraction $dcf_{idx}$ between accrual start time $t_s$ and accrual end time $t_e$ using the index day count convention.

float_leg = swap.leg(1)
n = len(float_leg)
float_times = np.zeros(n)
float_dcf = np.zeros(n)
accrual_start_time = np.zeros(n)
accrual_end_time = np.zeros(n)
nominals = np.zeros(n)
for i in range(n):
    # convert base classiborstart_idx Cashflow to
    # FloatingRateCoupon
    cf = ql.as_floating_rate_coupon(float_leg[i])
    iborIndex = cf.index()
    value_date = cf.referencePeriodStart()
    index_mat = cf.referencePeriodEnd()
    # year fraction
    float_dcf[i] = cf.accrualPeriod()
    # calculate the start and end time
    accrual_start_time[i] = timeFromReference(value_date)
    accrual_end_time[i] = timeFromReference(index_mat)
    # payment time
    float_times[i] = timeFromReference(cf.date())
    # nominals 
    nominals[i] = cf.nominal()

We could extend this about gearings and index spreads, but we set the gearing to be one and the spread to be zero.

To calculate the swap npv we need the discount factors for all future payment times (fixed and floating leg), accrual period start and end dates. We store all times together in one array. To get the discount factors we apply the method zeroBond of the GSR model on this array element-wise.

# Store all times for which we need a discount factor in one array
df_times = np.concatenate([fixed_times,
				       ibor_start_time,
				       ibor_end_time,
				       float_times])
df_times = np.unique(df_times)

# Store indices of fix leg payment times in 
# the df_times array
fix_idx = np.in1d(df_times, fixed_times, True)
fix_idx = fix_idx.nonzero()
# Indices of the floating leg payment times 
# in the df_times array
float_idx = np.in1d(df_times, float_times, True)
float_idx = float_idx.nonzero()
# Indices of the accrual start and end time
# in the df_times array
accrual_start_idx = np.in1d(df_times, ibor_start_time, True)
accrual_start_idx = accrual_start_idx.nonzero()
accrual_end_idx = np.in1d(df_times, ibor_end_time, True)
accrual_end_idx = accrual_end_idx.nonzero()

Our pricing algorithm for the underlying swap is:

# Calculate all discount factors
discount = np.vectorize(lambda T: model.zerobond(T, t, x_t))
dfs = discount(df_times)
# Calculate fixed leg npv
fix_leg_npv = np.sum(fixed_amounts * dfs[fix_idx])
# Estimate the index fixings
index_fixings = (dfs[accrual_start_idx] / dfs[accrual_end_idx] - 1) 
index_fixings /= float_dcf
# Calculate the floating leg npv
float_leg_npv = np.sum(nominals * index_fixings * float_dcf * dfs[float_idx])
npv = float_leg_npv - fix_leg_npv

Our time grid for the simulation consists of two points, today and option expiry.

The path generation is very similar like the one in the previous posts, but this time we not only simulate the underlying process but also the numeraires, and we calculate all needed discount factors on a path.

M = 100000
m = len(time_grid)
x = np.zeros((M, m))
y = np.zeros((M, m))
numeraires = np.zeros((M,m))
dfs = np.zeros((M, m, len(df_times)))

for n in range(0,M):
    numeraires[n, 0] = model.numeraire(0, 0)
    
for n in range(0,M):
    dWs = generator.nextSequence().value()
    for i in range(1, len(time_grid)):
        t0 = time_grid[i-1]
        t1 = time_grid[i]
        e = process.expectation(t0, 
                                x[n,i-1], 
                                dt[i-1])
        std = process.stdDeviation(t0,
                                   x[n,i-1],
                                   dt[i-1])
        x[n,i] = e + dWs[i-1] * std 
        e_0_0 = process.expectation(0,0,t1)
        std_0_0 = process.stdDeviation(0,0,t1)
        y[n,i] = (x[n,i] - e_0_0) / std_0_0
        df = np.vectorize(lambda T : model.zerobond(T, t1, y[n,i]))
        numeraires[n ,i] = model.numeraire(t1, y[n, i])
        dfs[n,i] = df(df_times)

Given the matrix of numeraires and discount factors we can calculate the npv on the path very fast using numpy arrays.

index_fixings = dfs[:,-1, accrual_start_idx][:,0,:] / dfs[:, -1, accrual_end_idx][:,0,:] - 1
index_fixings /= float_dcf
floatLeg_npv = np.sum(index_fixings * float_dcf * dfs[:,-1, float_idx][:,0,:] * nominals, 
                     axis = 1) 
fixedLeg_npv = np.sum(fixed_amounts * dfs[:, -1, fix_idx][:,0,:], axis=1)
npv = (floatLeg_npv - fixedLeg_npv)
# Apply payoff function 
npv[npv < 0] = 0
# Deflate NPV
npv = npv / numeraires[:,-1] 
npv = np.mean(npv) * numeraires[0,0]

Some remarks

To extract the information from the swap we use the method leg. This method is not a part of the QuantLib 1.5, but you could clone my QuantLib fork on GitHub (branch: SwigSwapExtension) and build the Swig binding yourself. I also send a pull request to Luigi. Maybe it will be part of the official QuantLib at a later time.

In the real world there are quotes for European swaptions in terms of implied volatility available and one would like use a model that is consistent with the market quotes. This is done by model calibration (choice the model parameter so that the model give the same premium for the quoted swaptions). Of cause one could use the Monte-Carlo pricing to calibrate the model, but this would be very time consuming process. The Gaussian short rate model provide some faster and very convenient routines for that. In the next part we will see how to calibrate the model and use the calibrated model to price Bermudan swaptions.

As usual you can download the notebook on nbviewer or GitHub.

Stay tuned for the next part coming soon…