This repo contains our work for the DS-GA 1012 - Natural Language Understanding course project.
Softmax has been the default choice for the output layer for many of the language models that are used today. Current state-of-the-art transformer-based models like ELMo, GPT, BERT have all been pretrained on the language modeling task, where the contextualized word embeddings produced by transformer networks are fed into a final softmax layer to produce a distribution over words in the vocabulary. However it has been shown by Yang et al. (2017) that the softmax layer limits the expressiveness of neural language models by constraining the output representations to be of low-rank. The authors proposed the use of Mixture of Softmaxes, and showed that this led to a significant improvement in results by 'breaking the softmax bottleneck'. Identifying this bottleneck offers a possible direction to improve the training procedure of current family of transformers methods, where big and over-parameterized models dominate the others. Addressing this bottleneck might potentially reduce the model complexity of the transformer component, by allowing the training signals to flow back to transformer network more efficiently without being limited by the low-rank output layer. Having a mixture of softmaxes allows to have a high-rank model, but on the other hand it also increases the number of parameters significantly (e.g. having a mixture of 20 softmaxes adds more than 300M parameters).
Our proposal is to use an ensemble of hierarchical softmaxes. Using different tree hierarchies for each softmax would allow an increase in the rank of the ensemble, and also decrease the correlation of predictions of individual softmaxes which should result in a more efficient averaging similarly to Random Forests. We also propose to use another feature of Random Forest, and select a random subset of inputs for each softmax to reduce the number of parameters, and to further decrease the correlation of predictions. A shallow hierarchical softmax has been shown to work well in a follow-up paper (Yanget al., 2019), and we hope to improve upon that by exploiting the features of ensemble of trees.