Currently, it is not clear to me if this model can work. After preliminary experiments and observations it doesn't seem like this architecture can learn latent topics very well. So for now, it will be archived.
Implementation of Dieng et al.'s TopicRNN: a neural topic model & RNN hybrid that learns global semantic dependencies via latent topics and local, syntatic dependencies through an RNN.
- The model learns a
betamatrix of size(V x K)whereVis the size of the vocabulary andKis the number of latent topics. Each row inbetarepresents a distinct distribution over the vocabulary. - A variational distribution is learned using word frequencies as input to produce the parameters for the Gaussian distribution in which each topic proportion vector
thetaof lengthkis sampled from. beta * thetathen results in the the logits over the vocabulary at the given time step that allow learned topics to be properly weighted before influencing inference of the next word. Topic additions for each word are zeroed out if the index of the logit belongs to a stop word, this allows only semantically significant words to have influence from the topics.- The topic additions
beta * thetaare added to the vocabulary projection of the RNN hiddenW * htresulting in a final distribution over the vocabulary that is normalized via SoftMax.
The system is built with PyTorch and AllenNLP, which are the main dependencies.
- Python 3.6 (3.6.5+ recommended)
- AllenNLP 0.6.0
It is recommended to first create a virtual environment before installing dependencies.
conda create --name topic_rnn python=3.6
python3 -m venv /path/to/new/virtual/environment
Download PyTorch and AllenNLP via
`pip install -r requirements.txt`
imdb_review_reader.py contains a dataset reader primed to take a .jsonl file where each entry is of the form
{
'id': <integer id>,
'text': <raw text of movie review>,
'sentiment': <integer value representing sentiment>
}
You can download the IMDB 100K dataset here.
Upon extracting the dataset from the tar, the resulting directory will look like
aclImdb/
train/
unsup/
<review id>_<sentiment>.txt
...
pos/
<review id>_<sentiment>.txt
...
neg/
<review id>_<sentiment>.txt
...
test/
pos/
<review id>_<sentiment>.txt
...
neg/
<review id>_<sentiment>.txt
...
...
You can generate the necessary .jsonl files via scripts/generate_imdb_corpus.py needed to reproduce the results of the paper. The script expects the aclImdb file structure above, you can run it by doing
python generate_imdb_corpus.py --data-path <path to aclImdb> --save-dir <directory to save the .jsonl files>
The directory specified by --save-dir will then contain five files: train_unsup.jsonl, valid_unsup.jsonl, train_labeled.jsonl, valid_labeled.jsonl, and test.jsonl. You will need to write the relative path to training/testing .jsonl files within your experiment JSON config.
tests/fixtures/smoke_imdb_language_model.json contains a base specification for TopicRNN (i.e hyperparamters, relative paths to training/testing .jsonl, etc.). The fixtures also includes a subset of the IMDB dataset in the expected format.
Training this simple model can be done right out of the box after installing requirements. To ensure things are running smoothly, run
allennlp train tests/fixtures/smoke_imdb_language_model.json --s /tmp/topic_rnn_imdb_smoke --include-package library
To ensure that the model runs properly with a GPU, change cuda_device under trainer in the config JSON to point to an available device.
So long as the model can save a checkpoint when using either a CPU or GPU, you're good to go.
In any file in experiments, you must specify at minimum
- The dataset reader with
type(i.e.imdb_review_reader) andwords_per_instance(backpropagation-through-time limit) - The relative paths to the training and validation
.jsonlfiles (generate_imdb_corpus.pywill be extended to produce training and validation splits at a later time) - Vocabulary with
max_vocab_size - The model with
type(base implementation oftopic_rnnis currently the only model),text_field_embedder(specify whether to use pretrained embeddings, embedding size, etc.),text_encoder(encoding the utterance via RNN, GRU, LSTM, etc.), andtopic_dim(number of latent topics)
An example, experiments/imdb_language_model.json is provided.
To train the model with an experimental config, run
allennlp train <path to the current experiment's JSON configuration> \
-s <directory for serialization> \
--include-package library
- Tam Dang
This project is licensed under the Apache License - see the LICENSE.md file for details.