After a series of posts about exotic option pricing (Asian, Barriers and Bermudans) with TensorFlow and finding optimal hedging strategies with deep learning (using a LSTM network to learn a delta hedge) I will come back to our credit card fraud detection case. In the previous part we have build a logistic regression classifier in TensorFlow to detect fraudulent transactions. We will see that our logistic regression classifier is equivalent to a very simple neural network with exactly one layer with one node and sigmoid activation function. We will extend this simple network to to a deep neural network by adding more hidden layers. We will use the low level API of TensorFlow to build the networks. At the end of the post we will use Keras’s high level API to build a same network with just a few lines of code.
We will continue to use the same data apply the same transformation which we are using since the first part of this series.
As usual you can find the notebook on my GitHub repository.
Deep learning / neural networks in a nutshell
An artificial neural network (ANN) is collection of connected nodes. In the first layer of the network the input of our nodes are the input features. In following layers the output of previous nodes are the input to the nodes in the current layer. If we have more than 1 hidden layer we can call the network a deep neural network.

The picture is generated by a latex script written by Kjell Magne Fauske (http://www.texample.net/tikz/examples/neural-network/) released under Creative common license. Thanks for that.
The output of the node is the composition of the dot or (scalar) product of a weights vector and the input vector and an activation function. Let be X the vector of input features and the weights vector of the node i, then the output of this node is given by
with an activation function and bias
.
If a layer consists more of one node the layer can be represented as a matrix multiplication. Such a layer is often called linear or dense layer. Typical choices for activation functions are tanh, relu, sigmoid function.

As we can see from this formula a dense layer with one node and sigmoid function as activation is our logisitc regression model. The matrix product will be the logit and the output of the activation function will be the probability as in a logistic regression model.
Lets review the logistic regression example in a neural network setting, lets start a function which constructs the computational graph for a dense (linear) layer given a input, activation function and number of nodes.
def add_layer(X, n_features, n_nodes, activation=None):
"""
Build a dense layer with n_features-dimensional input X and n_nodes (output dimensional).
Parameters:
X : 2D Input Tensor (n_samples, n_features)
n_features = number of features in the tensor
n_nodes = number of nodes in layer (output dimension)
activation = None or callable activation function
Output:
Operator which returns a 2D Tensor (n_samples, n_nodes)
"""
weights = tf.Variable(initial_value=tf.random_normal((n_features,n_nodes), 0, 0.1, seed=42), dtype=tf.float32)
bias = tf.Variable(initial_value=tf.random_normal((1,n_nodes), 0, 0.1, seed=42), dtype=tf.float32)
layer = tf.add(tf.matmul(X, weights), bias)
if activation is None:
return layer
else:
return activation(layer)
We wrapping our training and prediction functions in a class. The constructor of this class builds the computational graph in TensorFlow. The the function create_logit will build the computational graph to compute the logits (in the logisitc regression case: one layer with one node and the identity as activation function). We will override this function at a later point to add more layers to our network.
class model(object):
def __init__(self, n_features, output_every_n_epochs=1, name='model'):
self.input = tf.placeholder(tf.float32, shape=(None, n_features))
self.true_values = tf.placeholder(tf.float32, shape=(None,1))
self.training = tf.placeholder(tf.bool)
self.logit = self.create_logit()
self.loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=self.true_values,
logits=self.logit))
self.predicted_probs = tf.sigmoid(self.logit)
self.output_every_n_epochs = output_every_n_epochs
self.name = name
self.saver = tf.train.Saver()
def create_logit(self):
return add_layer(self.input, 30, 1)
def evaluate_loss_and_probs(self, sess, X, y, training=False, output=False):
loss, probs = sess.run([self.loss, self.predicted_probs], {self.input : X,
self.true_values : y.reshape(-1,1),
self.training : training})
probs.reshape(-1)
y_hat = (probs > 0.5).reshape(-1)*1
auc = roc_auc_score(y, probs)
precision = precision_score(y, y_hat)
recall = recall_score(y, y_hat)
fp = np.sum((y!=y_hat) & (y==0))
fpr = fp / (y==0).sum()
if output:
print('Loss: %.6f \t AUC %.6f \t Precision %.6f%% \t Recall %.6f%% \t FPR %.6f%%' % (loss, auc, precision*100, recall*100, fpr*100))
return loss, probs, y_hat, auc, precision, recall
def train(self, sess, X, y, n_epochs, batch_size, learning_rate):
init = tf.global_variables_initializer()
sess.run(init)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train = optimizer.minimize(self.loss)
n_samples = X.shape[0]
n_iter = int(np.ceil(n_samples/batch_size))
indices = np.arange(n_samples)
training_losses = []
training_aucs = []
for epoch in range(0,n_epochs):
np.random.shuffle(indices)
for i in range(n_iter):
idx = indices[i*batch_size:(i+1)*batch_size]
x_i = X[idx,:]
y_i = y[idx].reshape(-1,1)
sess.run(train, {self.input : x_i,
self.true_values : y_i,
self.training : True})
output=False
if (epoch % self.output_every_n_epochs)==0:
print(epoch, 'th Epoch')
output=True
loss_train_epoch, predict_train_epoch, y_hat, auc_train_epoch, _, _ = self.evaluate_loss_and_probs(sess, X, y, False, output)
training_losses.append(loss_train_epoch)
training_aucs.append(auc_train_epoch)
with plt.xkcd() as style:
plt.figure(figsize=(7,7))
plt.subplot(2,1,1)
plt.title('Loss')
plt.plot(range(n_epochs), training_losses)
plt.xlabel('# Epoch')
plt.subplot(2,1,2)
plt.title('AUC')
plt.plot(range(n_epochs), training_aucs)
plt.xlabel('# Epoch')
plt.tight_layout()
plt.savefig('training_loss_auc_%s.png' % self.name, dpi=300)
self.saver.save(sess, "./%s/model.ckpt" % self.name)
def restore(self, sess):
self.saver.restore(sess, "./%s/model.ckpt" % self.name)
Apply this function to build our Logisitc Regression model
np.random.seed(42)
lr = model(30, 10, 'lr')
n_epochs = 11
batch_size = 100
with tf.Session() as sess:
lr.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
print('Validation set:')
_, probs_lr, y_hat_lr, _, _, _ = lr.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)

0 th Epoch Loss: 0.007944 AUC 0.980217 Precision 86.538462% Recall 56.675063% FPR 0.015388% 10 th Epoch Loss: 0.004231 AUC 0.984984 Precision 87.591241% Recall 60.453401% FPR 0.014948% Validation set: Loss: 0.003721 AUC 0.977169 Precision 89.041096% Recall 68.421053% FPR 0.014068%
Backpropagation
In the previous parts we have seen how we can learn the weights (parameter) of our logistic regression model. So we know how to train a network with one layer but how can we train a network with more than one layer?
The concept is called Backpropagation and is basically the application of the chain rule. In the first phase (feed forward phase) the the input is feed into the network through all layers and the loss is calculated. Then in the 2nd or backward phase, the weights are updated recursevly from the last layer to the first.
At the last layer the derivate of the loss is straight forward. For the calculation of the weights in the inner or hidden layers we need the previous calculated derivates.
With the calculated gradients we can apply again a gradient descent method to optimize our weights.
The power of TensorFlow or other deep learning libraries as PyTorch are again the auto gradients. We dont need to worry to calculate the gradients by ourself.
A detailed deriviation of the backpropagation algorithm with an example for a quadratic loss function can be found on wikipedia.
First deep network
Now its time for our first deep neural network. We will add 4 layers with 120, 60, 30 and 1 node.
class model2(model):
def create_logit(self):
layer1 = add_layer(self.input, 30, 120)
layer2 = add_layer(layer1, 120, 60,)
layer3 = add_layer(layer2, 60, 30)
layer4 = add_layer(layer3, 30, 1)
return layer4
np.random.seed(42)
dnn1 = model2(30, 10, 'model1')
n_epochs = 11
batch_size = 100
with tf.Session() as sess:
dnn1.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
print('Validation set')
_, probs_dnn1, y_hat_dnn1, _, _, _ = dnn1.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)
The performance of this network is not really good. Actually is quite bad for the complexity of the model.
The AUC on the validation set is worse than the AUC from the logistic regression.


For low FPRs the logistic regession almost always outperforms the deep neural network (DNN). A FPR of 0.1 % means that in we will have 1 false positive in 1000 transactions. If you have millions of transactions even such a low fpr can affect and your customers. In very low FPRs (less than 0.0001) the DNN have a slightly higher true positive rate (TPR).
The problem is that we use the identity as activation function. The logit is still a linear function of the input.
If we want to capture non linear dependencies we have to add a non-linear activation function.
Let’s try the RELU.
Time for non-linearity
class model2b(model):
def create_logit(self):
layer1 = add_layer(self.input, 30, 120, tf.nn.relu)
layer2 = add_layer(layer1, 120, 60, tf.nn.relu)
layer3 = add_layer(layer2, 60, 30, tf.nn.relu)
layer4 = add_layer(layer3, 30, 1)
return layer4
np.random.seed(42)
dnn1b = model2b(30, 10, 'model1b')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
dnn1b.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
print('Validation set')
_, probs_dnn1b, y_hat_dnn1b, _, _, _= dnn1b.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)
Another popular choice is tanh. We compare both activation functions with the logistic regression:


We see that both non linear models outperforms the logistic regression. For low FPRs the TPR is signifanct higher.
Assume we would accept a FPR of 0.01 %, then the Recall of our DNN is around 80% vs 50% for the logistic regression.
We can detect much more fraudulent transactions with the same rate of false alarms.
Using TensorFlow layers
Instead of building the computational graph our self (weights, bias tensor, etc) we can use TensorFlow Layers. The function tf.layers.dense build a linear or dense layer. We can specify the number of nodes, the input and the actication function (similar to our own function).
In the next layer we use the TensorFlow function and add on more layers.
class model3(model):
def create_logit(self):
layer1 = tf.layers.dense(self.input, 240, activation=tf.nn.tanh)
layer2 = tf.layers.dense(layer1, 120, activation=tf.nn.tanh)
layer3 = tf.layers.dense(layer2, 60, activation=tf.nn.tanh)
layer4 = tf.layers.dense(layer3, 30, activation=tf.nn.tanh)
layer5 = tf.layers.dense(layer4, 1)
return layer5
np.random.seed(42)
dnn2 = model3(30, 10, 'model2')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
dnn2.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
print('Validation set')
_, probs_dnn2, y_hat_dnn2, _, _, _= dnn2.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)
0 th Epoch Loss: 0.003000 AUC 0.986239 Precision 82.428941% Recall 80.352645% FPR 0.029897% 10 th Epoch Loss: 0.002036 AUC 0.992393 Precision 95.626822% Recall 82.619647% FPR 0.006595% 20 th Epoch Loss: 0.001598 AUC 0.995232 Precision 93.989071% Recall 86.649874% FPR 0.009673% 30 th Epoch Loss: 0.001273 AUC 0.996695 Precision 99.137931% Recall 86.901763% FPR 0.001319% Validation set Loss: 0.002425 AUC 0.980571 Precision 91.764706% Recall 82.105263% FPR 0.012309%
The model didn’t improve to the previous one. Maybe we are now overfitting. One way to prevent overfitting in DNN are dropouts. Dropouts deactive a proportion of nodes during training randomnly. So we prevent our neural network to memorize the training data. Lets add dropout layers to the previous model.
Lets use a dropout rate of 20%.
class model4(model):
def create_logit(self):
layer1 = tf.layers.dense(self.input, 120, activation=tf.nn.tanh)
layer1 = tf.layers.dropout(layer1, 0.2, training=self.training)
layer2 = tf.layers.dense(layer1, 60, activation=tf.nn.tanh)
layer2 = tf.layers.dropout(layer2, 0.2, training=self.training)
layer3 = tf.layers.dense(layer2, 30, activation=tf.nn.tanh)
layer3 = tf.layers.dropout(layer3, 0.2, training=self.training)
layer4 = tf.layers.dense(layer3, 1)
return layer4
np.random.seed(42)
dnn3 = model4(30, 10, 'model3')
n_epochs = 31
batch_size = 100
with tf.Session() as sess:
dnn3.train(sess, X_train, y_train, n_epochs, batch_size, 0.1)
print('Validation set')
_, probs_dnn3, y_hat_dnn3, _, _, _= dnn3.evaluate_loss_and_probs(sess, X_valid, y_valid, False, True)


We see that all our deep learning model outperform the LR model on the validation set. The difference in AUC doesn’t seems very big, but especially for very low FPR the recall is much higher. Where the model with the dropout (DNN3) performs slightly better than the others.
Lets go for model 3 (4 layers with dropout) and let see the AUC Score of the model on the test data.
with tf.Session() as sess:
dnn3.restore(sess)
print('Test set')
_, probs_dnn3_test, y_hat_dnn3_test, _, _, _= dnn3.evaluate_loss_and_probs(sess, X_test, y_test, False, True)
Test set Loss: 0.001825 AUC 0.991294 Precision 97.619048% Recall 83.673469% FPR 0.003517%
This model performs very well on our test set. We have a high Recall with a very low FPR at a threshold of 50%.
Keras
The library Keras offers a very convinient API to TensorFlow (but it also supports other deep learning backends).
We can build the same model in just 6 lines of code. For many standard problems there are predefined loss functions, but we can also write our own loss functions in Keras.
For the model training and the prediction we only need one line of code each.
model = keras.Sequential() model.add(keras.layers.Dense(120, input_shape=(30,), activation='tanh')) model.add(keras.layers.Dropout(0.2)) model.add(keras.layers.Dense(60, activation='tanh')) model.add(keras.layers.Dropout(0.2)) model.add(keras.layers.Dense(30, activation='tanh')) model.add(keras.layers.Dropout(0.2)) model.add(keras.layers.Dense(1, activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy') model.fit(X_train, y_train, epochs=31, batch_size=100) probs_keras = model.predict(X_test)
Conclusion
In this part we saw how to build and train a deep neural network with TensorFlow using the low level and mid level API and as an outlook we saw how easy the model development is in a high level API like Keras.
For this fraud detection problem a very simple deep network can outperform a classical machine learning algorithm like logistic regression if we looking into the low false positive rate (FPR) regions. If we can accept higher false positive rates all models perform similar.
To decide for a final model, one need to specify the costs of a FP (a genuine transaction which we maybe block or at least investigate) and FN (a fraudulent transaction which we miss), so we can balance the trade-off between Recall (detection power) and FPR.
Our fraud detection problem is as we know a imbalance class problem. We can maybe improve the quality of the logistic regression with use of over-/undersampling of the majority class. Or we can use try other ‘classical’ machine learning methods like random forests or boosting trees, which often outperform a logisitc regression.
Another interesting unsupervised deep learning method to detect anomalies in transactions are auto-encoders.
I think I will cover these topics in later posts. So stay tuned.
As usual you can find the notebook on my GitHub repo, so please download the notebook and play with the model parameter, e.g one could change numbers of epochs we train, apply adaptive learning rates or add more layer or change the numbers of nodes in each layer and play with dropouts to find better models and please share your ideas and results.
So long…






