A GAN-based Expectation-Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions.
Implementation of the proposed algorithm in the paper ZSCRGAN: A GAN-based Expectation-Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions(to be presented at the ACM International Conference on Information and Knowledge Management (CIKM2020)) by Anurag Roy, Vinay Kumar Verma, Kripabandhu Ghosh, Saptarshi Ghosh. The proposed model performs zero-shot retrieval of images from their textual descriptions . The following image gives a schematic view of our proposed model:

ZSCRGAN is a novel Zero-Shot cross modal text to image retrieval model. The model does this by learning a joint probability distribution of text embeddings and relevant image embeddings, maximizing which ensures high similarity between text embeddings and relevant image embeddings. To learn this distribution we use an Expectation-Maximization based training approach of the model involving a Generative Adversarial Network(GAN) which is learnt in the E-step and a Common Space Embedding Mapper(CSEM), which is updated in the M-step. The GAN is used to generate a representative image embedding(a latent variable) and the CSEM is used map them to a common space embedding.
If you use the codes, please refer to the following paper:
@inproceedings{roy-cikm20,
author = {Roy, Anurag and Verma, Vinay and Ghosh, Kripabandhu and Ghosh, Saptarshi},
title = {{ZSCRGAN: A GAN-based Expectation-Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions}},
booktitle = {{Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM)}},
year = {2020}
}
python version: python 2.7
packages:
tensorflow-gpueasydictscipysixnumpyprettytensorpyYAMLscikit-learn
To install the dependencies run pip install -r requirements.txt
Hyperparameters and options in run_exp.py:
batch_sizebatch size used during traininggf_dimhidden layer dimension of generatordf_dimhidden layer dimension of discriminatorembed_dimdimension of mu and sigma eachCSEM_lrlearning rate of CSEMgenerator_lrlearning rate of generatordiscriminator_lrlearning rate of discriminatorepochsnumber of epochskl_div_coefficientcoefficient of the kl-divergence lossmm_reg_coeffcoefficient of the max-margin regularizerz_dimdimension of the noise vectorclip_valclipping values of the discriminator in WGANdatasetraining dataset folder name
Following are the datasets on which our experminets have been run:
Download the zip files and extract it inside the datasets folder.
To run the model on a particular dataset use the command:
python2 run_exp.py --dataset <dataset_folder_name>
Precision@50 on the test set will be outputted after every 100 iterations of the E-step and the M-step.
The retrieved results can be found inside the retrieved_res/<dataset_folder_name>_res/ folder. For example the command to run the model on the CUB dataset will be python2 run_exp.py --dataset CUB. This will create a folder retrieved_res/CUB_res/ containing files with the the retrieval results. The name of the file will be of the form acc<Prec@50>.pkl. For example the retrieval result which had a Prec@50 value of 0.521 will be saved in the file acc0.521.pkl. The output pickle file contains a list of key value pairs with class id of the text embedding being the key and list of class ids of the retrieved images being the value. For example one element of the list will be:
{3: [3, 3, 3, 25, 25, 48, 3, 25, 3, 25, 25, 25, 3, 3, 3, 48, 3, 22, 25, 25, 3, 48, 25, 22, 48, 25, 3, 3, 25, 3, 48, 3, 48, 3, 3, 48, 3, 3, 48, 3, 25, 3, 3, 25, 48, 3, 3, 3, 25, 3]}
where the key 3 is the class id of the text embedding and the value of list of class ids correspond to those of the retrieved images.
Some parts of the code have been borrowed from the StackGAN code repository.