Dialog Response Ranking with Reddit Data

DSCI 691: Natural Language Processing with Deep Learning

Drexel University

06/10/2022

Group member 1
- Name: Xi Chen
- Email: xc98@drexel.edu
Group member 2
- Name: Tai Nguyen
- Email: tdn47@drexel.edu
Group member 3
- Name: Tien Nguyen
- Email: thn44@drexel.edu
Group member 4
- Name: Raymond Yung
- Email: raymond.yung@drexel.edu

Project Description

The goal of this project is to build dialog system evaluation models which can use measurements of engagement to rank dialog responses based on. We hope to use the deep learning architectures that we learned in DSCI691 to solve this problem. This repo provides a Pytorch implementation of the models.

Our project falls in the realm of dialog system evaluation as we are trying to predict how likely a dialog response is to elicit a positive reaction from the interlocutor. The models will be trained with Reddit threads and comments, considering feedback metrics including the number of replies on a post and the number of upvotes/downvotes.

The project is inspired by the dialog response ranking models proposed by Microsoft Research NLP Group trained on 100+ millions of human feedback data. It can be used to create more engaging dialog agents by re-ranking the generated response candidates.

Requirements

Python – version 3.9.x
Virtualenv

Installation

Clone the repository from GitHub:

git clone https://github.com/ductai199x/DialogRPT
cd DialogRPT

Use virtualenv to create a python virtual environment:

virtualenv . --python=python3.9
source bin/activate

Install all the pip packages dependencies:
```
pip install -r requirements.txt
```
Download GPT-2 pretrained + finetuned weights:
```
python src/shared.py
```

Use gdown to download the checkpoints for our trained models

gdown --fuzzy https://drive.google.com/file/d/1ChcIv8kZAolLB7GYWQwHqqfm_sLnKKwO/view\?usp\=sharing
unzip lightning_checkpoints.zip

Download the training data

gdown --fuzzy https://drive.google.com/file/d/1qbrxAO8rPqBhT1QGKe5LBX6D4NR2edD3/view\?usp\=sharing
unzip training_data.zip

Download the testing/evaluating data

wget https://xiagnlp2.blob.core.windows.net/dialogrpt/test.zip 
unzip test.zip -d data

Usage

The entry of our code will be available at src/main.py. To get the usage, from the root of the project:

python src/main.py -h

You will see:

usage: main.py [-h] {train,eval,predict} ...

positional arguments:
  {train,eval,predict}
    train               Train existing architectures.
    eval                Evaluate existing architectures.
    predict             Predict using existing architectures.

optional arguments:
  -h, --help            show this help message and exit

To view the available architecture for each functionality (train,eval,predict), run python src/main.py <function> -h.

NOTE: If you don't have a GPU enabled device, then run the follow commands with --cpu.

Training

For example, to train the fully connected model on the updown set, run:

python src/main.py train --arch "FC-GPT" --feedback "updown"

Evaluating

For example, to evaluate the fully connected model on the updown set, run:

python src/main.py eval --arch "FC-GPT" --feedback "updown"

Predicting

For example, to predict with the context of "Can we restart 2020?", seq1 of "I think we should go back to the beginning, and start from the beginning." and seq2 of "I think so, yes.", run :

python src/main.py predict --arch "RPT" --feedback "updown" --context="Can we restart 2020?" --seq1="I think we should go back to the beginning, and start from the beginning." --seq2="I think so, yes."

Data

In order to obtain the data for this project, run python src/downloader.py -h to see the usage and help for downloading the compressed raw data. Then, process the downloaded raw data with python src/data.py (-h to see usage and help messages).

Overall, this will download ~44 G of compressed data.

The testing data can be downloaded using the command in the #7 of the Installation section

Results and Discussion

NOTE: Due to the limited storage and limited time, we are unable to upload the fullyconnected-glove models to the shared directory. Please contact us directly for further information.

The pairwise accuracy and Spearman correlation scores on 5,000 test samples are listed in the tables below.

Baseline Models

Feedback	Method	Pairwise Acc.	Spearman $\rho$
Width	DialogRPT not-finetuned	0.5146	0.0036
	DialogRPT	0.7581	0.4247
Depth	DialogRPT not-finetuned	0.4962	-0.0012
	DialogRPT	0.6893	0.3159
Updown	DialogRPT not-finetuned	0.5059	-0.0018
	DialogRPT	0.6808	0.2619

Our Models

Feedback	Method	Pairwise Acc.	Spearman $\rho$
Width	FullyConnected with GloVe Embeddings	0.5000	0.1937
	FullyConnected with GPT-2 Embeddings	0.6568	0.1900
	CNN with GPT-2 Embeddings	0.6653	0.2170
	LSTM with GPT-2 Embeddings	0.6502	0.1942
Depth	FullyConnected with GloVe Embeddings	0.3667	-0.0864
	FullyConnected with GPT-2 Embeddings	0.6077	0.1388
	CNN with GPT-2 Embeddings	0.6070	0.1410
	LSTM with GPT-2 Embeddings	0.5969	0.1285
Updown	FullyConnected with GloVe Embeddings	0.5444	0.0532
	FullyConnected with GPT-2 Embeddings	0.6122	0.1043
	CNN with GPT-2 Embeddings	0.5972	0.0921
	LSTM with GPT-2 Embeddings	0.5648	0.0573

We built smaller-footprint models for the same task. The purpose of the DialogRPT model is to evaluate the responses generated by dialog generation models. DialogRPT is built using GPT-2, which is very large with 400M parameters. Therefore, while the model performs well, it might not necessarily be efficient enough for direct deployment in the real world due to reasons such as high cost. There is also a very real carbon footprint concern, where training a deep learning model can emit as much as five cars. Therefore, in this project, we tried to build more efficient, smaller-footprint models for the same task while achieving comparable quality.
Fine-tuning significantly improved the performance of DialogRPT. DialogRPT is fine-tuned based on GPT-2 and the fine-tuned models performed significantly better. The fine-tuning was done by applying initialized DialoGPT medium model weights.
We could not reproduce the level of performance of the DialogRPT models. The DialogRPT models are based on the GPT-2 architecture. GPT-2 is a large scale transformer-based language model with some advanced learning concepts like Masked Self Attention, Multiple Heads, Residual Connections, Layer Normalization, etc., making it one of the best text generators out there. Our models are much simpler and lack these features.
GPT-2 Embeddings outperform GloVe. We built models based on the FullyConnected architecture from Assignment 4 of this class using both GPT-2 vs GloVe embeddings. GPT-2 embeddings clearly outperform GloVe embeddings because GloVe is static and has a fixed context-independent representation for each word, whereas GPT-2 embeddings are contextualized.
Dialog response ranking models can be fine-tuned on misuse. Dialog response ranking models, like dialog response generation models, have the potential to be fine-tuned for misuse. Like how the GPT-2 models can be used by extremist groups to generate synthetic propaganda for dangerous ideologies, our models can be trained to predict highly down-voted responses and be used in malicious applications.

Limitations/Challenges

Large dataset: The uncompressed training data is over 100 G in size which requires large storage to store and handle the dataset.
Limited resource: We used Google Cloud and AWS virtual machines to train our baselines and models due to the large dataset size. Unfortunately, we were not provided GPU with these VM which results in longer training time than what we expected. Each model and baseline was required to train with three different tasks (up-down, width, and depth) which required a lot of time for the training and evaluating process. We did not have VMs with GPU until this week.
Inaccessible data: We first followed the instruction from the reference paper to download the dataset. However, the instruction was not up to date and failed to achieve the entire dataset. We created our own data pipeline to introduce multiprocessing in order to run the data pipeline for three different years.
Data preprocessing: We also had to write dataloader.py to speed up the data loading process. Our dataloader.py can load three times as fast as pandas.read_csv due to efficient multiprocessing and prefetching.
Limited time: We had only four weeks to work on the project including generating ideas, understanding datasets, building up pipelines, training and evaluating baselines and models, and completing assignments 4 and 5. If we had more time, we would try to improve our pairwise accuracy and spearman correlation.
Finally, most of the members were new to NLP and deep learning. It took us a little while to fully understand contrastive learning and the DialogRPT model which is significant for our project. However, we worked closely together and were able to finish the project given the limited amount of time.

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
doc		doc
restore		restore
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data.sh		data.sh
readme_old.md		readme_old.md
requirements.txt		requirements.txt
statistics.csv		statistics.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dialog Response Ranking with Reddit Data

Table of Contents

Project Description

Requirements

Installation

Usage

Training

Evaluating

Predicting

Data

Results and Discussion

Limitations/Challenges

About

Uh oh!

Releases

Packages

Languages

License

ductai199x/DialogRPT

Folders and files

Latest commit

History

Repository files navigation

Dialog Response Ranking with Reddit Data

Table of Contents

Project Description

Requirements

Installation

Usage

Training

Evaluating

Predicting

Data

Results and Discussion

Limitations/Challenges

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages