Diffusion based Image Captioning

Research project on image captioning using diffusion language model. Our model is named as CLIP-DiffusionLM.

Inspired by the recent success of the denoising diffusion model on image synthesis tasks, we apply denoising diffusion probabilistic models to text generation in image captioning tasks. We show that our CLIP-Diffusion-LM is capable of generating image captions using significantly fewer inference steps than autoregressive models. On the Flickr8k dataset, the model achieves 0.1876 BLEU-4 score. By training on the combined Flickr8k and Flickr30k dataset, our model achieves 0.2470 BLEU-4 score.

Dataset

We provide the extracted CLIP feature for Flickr8k dataset in repo https://github.com/xu-shitong/flickr8k-CLIP-freature} and can be downloaded as shown in CLIP-DDPM.ipynb file. However, due to file size limit, we do not disclose extracted CLIP feature for Flickr30k dataset. User will need to extract their own.

Model Training

Best model hyperparameter config and training code is in CLIP-DDPM.py file. The model uses configuration of maximum output caption 16, $x_0$-prediction, $\lambda = 0.3$, $lr$ linear decay from 1e-4 to 5e-5, concatenation fusion and non-classification-free guidance. Training time is 5 hours for 15 epochs on Flickr8k and 11 hours for 10 epochs on Flickr30+8k using AdamW optimizer on a single Nvidia A30 GPU.

Acknowledgments

We thank Mu Li and Yi Zhu for sharing their insight in various models in vision and NLP field publicly online, Boyang Gu for providing advice in early stage of the research. The computation resource was supported by Imperial College London.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
img-trial		img-trial
lm-trial		lm-trial
trial_add_concat		trial_add_concat
trial_classification_free		trial_classification_free
trial_lr		trial_lr
trial_rounding_weight		trial_rounding_weight
trial_train_embedding		trial_train_embedding
trial_x_01_prediction		trial_x_01_prediction
.gitattributes		.gitattributes
.gitignore		.gitignore
CLIP-DDPM.ipynb		CLIP-DDPM.ipynb
CLIP-DDPM.py		CLIP-DDPM.py
COCO_BLEU.py		COCO_BLEU.py
README.md		README.md
clip-diffusion-lm.png		clip-diffusion-lm.png
epoch15_lossseries_sum_sample_mean_lr1E-04-5E-05_schedulerlinspace_round3E-01_dynamic-1_clipconcat_class_weight0E+00_class_prob2E-01_train-embedFalse_samplesize100_x_0_predictTrue_X_INTERVAL100_use_x_tTrue_use_x_1True_use_probTrue.txt		epoch15_lossseries_sum_sample_mean_lr1E-04-5E-05_schedulerlinspace_round3E-01_dynamic-1_clipconcat_class_weight0E+00_class_prob2E-01_train-embedFalse_samplesize100_x_0_predictTrue_X_INTERVAL100_use_x_tTrue_use_x_1True_use_probTrue.txt
epoch15_lossseries_sum_sample_mean_lr5E-05-5E-05_schedulerlinspace_round3E-01_dynamic-1_clipconcat_class_weight0E+00_class_prob2E-01_train-embedFalse_samplesize100_x_0_predictTrue_X_INTERVAL100_use_x_tTrue_use_x_1True_use_probTrue.txt		epoch15_lossseries_sum_sample_mean_lr5E-05-5E-05_schedulerlinspace_round3E-01_dynamic-1_clipconcat_class_weight0E+00_class_prob2E-01_train-embedFalse_samplesize100_x_0_predictTrue_X_INTERVAL100_use_x_tTrue_use_x_1True_use_probTrue.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Diffusion based Image Captioning

Dataset

Model Training

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

xu-shitong/diffusion-image-captioning

Folders and files

Latest commit

History

Repository files navigation

Diffusion based Image Captioning

Dataset

Model Training

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages