tanh while calculating attention scores

Hey! I was interested into why are you using tanh here:

`attn_src = torch.matmul(F.tanh(h_prime), self.a_src) # bs x n_head x n x 1`

in [BatchMultiHeadGraphAttention, get_layers.py](https://github.com/xptree/DeepInf/blob/master/src/gat_layers.py#L85). Did it stabilize the training? Is it some form of feature normalization?