OpenViDial

This repo contains downloading instructions for the OpenViDial dataset in

OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts

along with the code to reproduce results in the paper

Modeling Text-visual Mutual Dependency for Multi-modal dialog Generation

Introduction

When humans converse, what a speaker will say next significantly depends on what he sees. OpenViDial is a largescale multi-module dialogue dataset for this purpose. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.

The following are two short conversations where visual contexts are crucial.

Detailed statistics for OpenViDial

Attribute	value
Number of turns	1.1M
Number of images	1.1M
Vocab size before BPE	70K
Vocab size after BPE	30K
Average length of each episode	14
Average length of each turn	7.6

Download the Dataset

***** New March 12th, 2021: New cnn/rcnn feature on test/valid dataset *****

We fixed the bug of cnn/rcnn features on valid/test dataset and re-run the experiments on the new data. Evaluation metrics are also updated.

The main folder origin_dir contains training/valid/test sets, each of which is made up by the following files:

├──origin_dir
      └── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.    
      └── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
      └── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
            └── 0.jpg
            └── 1.jpg
            └── ...
      └── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
      └── test.*  (i.e., test.dialogue.jsonl, test.origin.txt, test_images)

If you'd like to take a glance at the a sample of the dataset instead of downloading the full dataset, we provide a data sample here

Data download:

Download [train|valid|test].origin.txt and [train|valid|test].dialogue.jsonl here
Download test_images (~ 20G) here
Download valid_images (~ 20G) here
Download train_images: Since train_images is too big (~ 170G), we split it to 12 zip files. Download seperate files zip_train here. Then download and run cat.sh here to include all files in the same directory.
Move all files to origin_dir.

Vanilla Visual Dialog Models

We proposed three models for this dataset. Please refer to the paper for details:

Model #1 - NoVisual: use only dialog texts without visual information

Model #2 - CoarseVisual: use texts and a pretrained ResNet50 on ImageNet to compute 1000-d feature from each picture

Model #3 - FineVisual: use texts and a pretrained Faster R-CNN on Genome to compute 2048-d * K objects features from each picture

Requirements

python >= 3.6
pip install -r requirements.txt

Preprocess directory structure

preprocessed_data_dir is a directory that contains all the preprocessed files (text, image feature mmap, offsets, etc.) generated from origin_data_dir and we use them in training models. The directory structure is shown below.

Note: every train* file or directory should have a 'valid' and a 'test' counterpart, we ignore them below for simplicity.

├──preprocessed_data_dir
      └── train.features.mmap  // numpy mmap array file of shape [num_sents, 1000], each row is a 1000-d ResNet-50 feature
      └── train.objects.mmap  // numpy mmap array file of shape [num_sents, 20, 2048],  faster-rcnn object feature file, each row contain 20 objects feature, which is 2048-d
      └── train.objects_mask.mmap  // numpy mmap array file of shape [num_sents, 20],  faster-rcnn mask file, each row contain 20 objects mask, 1 for valid, 0 for mask
      └── train.offsets.npy  // numpy array file of shape [num_episodes], each item is the offsets of one episode
      └── train.sent_num.npy // numpy array file of shape [num_episodes], each item is the sentence number of one episode

Preprocess text data

We use Moses Tokenizer to tokenize texts and generate offsets arrays: bash ./scripts/preprocess_video_data.sh and followed with byte-pair-encoding and fairseq-preprocess binarization: bash ./scripts/preprocess_text_data.sh

Note: You need to change DATA_DIR, ORIGIN_DIR, OUTPUT_DIR to your own path

Prepare pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features(Used for Model #2 - CoarseVisual)

The compression file of preprocessed ResNet50 features (feature_files.tar.gz) (~3.7G) can be downloaded from here. You can get preprocessed ResNet50 features (*.features.mmap) by command tar zxvf feature_files.tar.gz and move them under preprocessed_data_dir/

Download Faster R-CNN features(Used for Model #3 - FineVisual)

The compression file of preprocessed Faster R-CNN objects features (object_files.tar.gz) (~50G) can be downloaded from here. You can get preprocessed Faster R-CNN objects features (*objects.mmap, *objects_mask.mmap) by command tar zxvf object_files.tar.gz and move them under preprocessed_data_dir/

Checkout

Each of files has a hash value by command md5sum fileName. You can get it from here and we suggest you check each file's hash value before training.

(Optional) Extract features on your own

If you want to extract some feature on your own, or you'd like to know details of extracting visual features, see video_dialogue_model/extract_features/extract_features.md

Train and Evaluate Model #1 - NoVisual

bash scripts/reproduce_baselines/text_only.sh will train and evaluate NoVisual, Remember to change MODEL_DIR and DATA_DIR for your setup. Note: fairseq may use all gpus on your machine and the actual batch size is times by number of gpus. Therefore, if you use multiple gpus, batch size should be devided by number of gpus.

Train and Evaluate Model #2 - CoarseVisual

bash scripts/reproduce_baselines/text_and_img_feature.sh will train and evaluate CoarseVisual. Remember to change MODEL_DIR and DATA_DIR for your setup. Please make sure you use one single gpu to reproduce our results.

Train and Evaluate Model #3 - FineVisual

bash scripts/reproduce_baselines/text_and_img_objects.sh will train and evaluate FineVisual, Remember to change MODEL_DIR and DATA_DIR for your setup. Please make sure you use one single gpu to reproduce our results.

MMI

Prepare training data

For NV seeing ./mmi/text/README.md. The structure of training data used in both CV and FV is same as the former part.

Train and Evaluate Model #4 - MI-NV

bash ./mmi/text/train.sh && bash ./mmi/text/mmi_generate.sh will train and evaluate MI-NV. Remember to change all the MODEL_DIR and DATA_DIR for your setup. Please make sure you use one signle gpu to reproduce our results.

Train and Evaluate Model #5 - MI-CV

bash ./mmi/feature/scrtpts/train_image.sh && bash ./mmi/feature/scrtpts/mmi_feature_generate.sh will train and evaluate MI-CV. Remember to change all the MODEL_DIR and DATA_DIR for your setup. Please make sure you use one signle gpu to reproduce our results.

Train and Evaluate Model #6 - MI-NV

bash ./mmi/feature/scrtpts/train_object.sh && bash ./mmi/feature/scrtpts/mmi_object_generate.sh will train and evaluate MI-FV. Remember to change all the MODEL_DIR and DATA_DIR for your setup. Please make sure you use one signle gpu to reproduce our results.

Other Statistics

get diversity statistics of system output: train/stats.py
get rouge statistics of system output: train/rouge.py

Model benchmark

Model	BLEU-1	BLEU-2	BLEU-4	Dis-1	Dis-2	Dis-3	Dis-4	ROUGE-1	ROUGE-2	ROUGE-4
1-NV	14.06	3.80	0.95	0.0006	0.0019	0.0031	0.0043	0.06787	0.01464	0.00224
2-CV	14.70	4.38	1.14	0.0023	0.0090	0.0178	0.0272	0.08773	0.02067	0.00347
3-FV	14.85	4.61	1.19	0.0026	0.0112	0.0246	0.0406	0.09083	0.02085	0.00329
4-MI-NV	14.27	3.89	0.99	0.0006	0.0022	0.0036	0.0043	0.06918	0.01497	0.00238
5-MI-CV	14.77	4.46	1.16	0.0023	0.0091	0.0181	0.0272	0.08791	0.02077	0.00350
6-MI-FV	14.95	4.67	1.22	0.0027	0.0117	0.0261	0.0433	0.09100	0.02090	0.00338

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenViDial

Introduction

Detailed statistics for OpenViDial

Download the Dataset

Vanilla Visual Dialog Models

Requirements

Preprocess directory structure

Preprocess text data

Prepare pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features(Used for Model #2 - CoarseVisual)

Download Faster R-CNN features(Used for Model #3 - FineVisual)

Checkout

(Optional) Extract features on your own

Train and Evaluate Model #1 - NoVisual

Train and Evaluate Model #2 - CoarseVisual

Train and Evaluate Model #3 - FineVisual

MMI

Prepare training data

Train and Evaluate Model #4 - MI-NV

Train and Evaluate Model #5 - MI-CV

Train and Evaluate Model #6 - MI-NV

Other Statistics

Model benchmark

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
demo_data		demo_data
mmi		mmi
preprocess		preprocess
scripts		scripts
tests		tests
train		train
video_dialogue_model		video_dialogue_model
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

nlptrinh/OpenViDial

Folders and files

Latest commit

History

Repository files navigation

OpenViDial

Introduction

Detailed statistics for OpenViDial

Download the Dataset

Vanilla Visual Dialog Models

Requirements

Preprocess directory structure

Preprocess text data

Prepare pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features(Used for Model #2 - CoarseVisual)

Download Faster R-CNN features(Used for Model #3 - FineVisual)

Checkout

(Optional) Extract features on your own

Train and Evaluate Model #1 - NoVisual

Train and Evaluate Model #2 - CoarseVisual

Train and Evaluate Model #3 - FineVisual

MMI

Prepare training data

Train and Evaluate Model #4 - MI-NV

Train and Evaluate Model #5 - MI-CV

Train and Evaluate Model #6 - MI-NV

Other Statistics

Model benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages