Hey! I was interested into why are you using tanh here:
attn_src = torch.matmul(F.tanh(h_prime), self.a_src) # bs x n_head x n x 1
in BatchMultiHeadGraphAttention, get_layers.py. Did it stabilize the training? Is it some form of feature normalization?