This project implements an image captioning model using BERT for text embeddings and VGG16 for image feature extraction. The model architecture consists of an encoder and a decoder.
- Introduction
- Creating a Small Dataset
- Preprocessing Pipelines
- Model Architecture
- Training and Evaluation
This project aims to generate captions for images using a combination of BERT for text embeddings and VGG16 for image feature extraction. The model architecture includes an encoder to process image features and a decoder to generate captions.
To make initial experiments easier, we create a small dataset by sampling 1% of the total images and their corresponding captions. This helps in reducing the computational load and speeds up the experimentation process.
-
Directories and Files:
Images: Directory containing the original images.Small_Images: Directory to store the sampled images.captions.txt: File containing the original captions.small_captions.txt: File to store the sampled captions.
-
Script to Create Small Dataset:
- The
data.pyscript samples 1% of the images and their corresponding captions and saves them in theSmall_Imagesdirectory andsmall_captions.txtfile.
- The
import os
import random
from PIL import Image
# Directories
images_dir = "Images"
small_images_dir = "Small_Images"
captions_file = "captions.txt"
small_captions_file = "small_captions.txt"
# Create the directory for small images if it doesn't exist
os.makedirs(small_images_dir, exist_ok=True)
# List all images in the directory
images = os.listdir(images_dir)
# Calculate 1% of the total images
images_to_keep = max(1, len(images) // 100)
# Select 1% of the images randomly
sampled_images = random.sample(images, images_to_keep)
# Copy the sampled images to the new directory
for image in sampled_images:
img = Image.open(f'{images_ir}/{image}')
img.save(f'{small_images_dir}/{image}')
# Read the original captions file
with open(captions_file, 'r') as file:
lines = file.readlines()
# Filter captions for the sampled images
header = lines[0] # Keep the header
sampled_captions = [header] + [line for line in lines[1:] if line.split(',')[0] in sampled_images]
# Write the sampled captions to a new file
with open(small_captions_file, 'w') as file:
file.writelines(sampled_captions)
print(f'Sampled {images_to_keep} images out of {len(images)} total images.')
print(f'Sampled captions written to {small_captions_file}.')Image Preprocessing Pipeline The image preprocessing pipeline uses VGG16 to extract features from the images. The features are extracted from the fc2 layer of VGG16.
The text preprocessing pipeline uses BERT to encode the captions. The BERT tokenizer and model are used to preprocess and encode the captions.
The model architecture consists of an encoder and a decoder. The encoder processes the image features, and the decoder generates captions.
The encoder uses a pre-trained VGG16 model to extract features from the images.
The decoder uses BERT embeddings and an LSTM to generate captions.
The model is trained using the preprocessed image features and encoded captions. The training process involves defining a data generator, creating a dataset, and fitting the model.
The main.py script integrates the preprocessing pipelines and model architecture to train and evaluate the model.