This project, built under the EVOASTRA Internship, focuses on generating human-like captions for images using Deep Learning.
The system uses a CNN encoder (InceptionV3) to extract visual features and an LSTM decoder to generate natural language captions.
Additionally, the Streamlit interface provides English → Hindi translation and Read-Aloud (Text-to-Speech) support.
To design and implement a complete deep-learning pipeline that:
- Extracts visual features from images using a pretrained CNN.
- Generates text captions using an LSTM-based decoder.
- Supports bilingual captioning (English and Hindi).
- Reads captions aloud using a built-in TTS system.
- Provides an interactive front-end using Streamlit.
- Loaded and cleaned the Flickr8k captions dataset.
- Converted raw
captions.txtinto structured CSV format. - Added
<start>and<end>tokens. - Tokenized and padded captions for uniform length.
- Prepared final datasets for training.
- Used InceptionV3 to extract 2048-dimensional feature vectors.
- Saved features for efficient training.
- Converted text captions into numerical sequences.
- Applied padding for consistent input shape.
- Built an Encoder–Decoder architecture:
- CNN encoder → extract visual features
- LSTM decoder → generate captions word-by-word
- Combined image embeddings with text embeddings.
- Trained the model with image-caption pairs.
- Used Adam optimizer and tuned hyperparameters.
- Saved trained weights for inference.
- Measured performance using:
- BLEU
- METEOR
- CIDEr
- Performed manual testing on unseen images.
- Implemented English → Hindi translation using MarianMT.
- Added Text-to-Speech (gTTS) for both English and Hindi.
- Added a language selection radio button in Streamlit.
- Enabled dual-language captioning and voice output.
| Component | Technology |
|---|---|
| Language | Python 3 |
| Deep Learning | TensorFlow / Keras |
| Translation | HuggingFace Transformers |
| Image Processing | Pillow, OpenCV |
| Deployment | Streamlit |
| Audio Output | gTTS |
| Utilities | Pandas, NumPy, Matplotlib |
EVOASTRA/
│
├── convert_to_csv.py
├── captions.txt
├── captions_10k.csv
├── images/
├── feature_extraction.py
├── caption_preprocessing.py
├── caption_sequence_generation.py
├── train_model.py
├── models/
├── app.py
└── README.md
- 8,000 images
- 5 captions per image (40,000 total)
- Real, human-written captions
- Suitable for vision-language tasks
Dataset Source: Kaggle – Flickr8k
python convert_to_csv.py
python feature_extraction.py
python caption_preprocessing.py
python caption_sequence_generation.py
python train_model.py
streamlit run app.py
| Image | Generated Caption |
|---|---|
| "A brown dog running in the grass." | |
| "A red car parked beside the road." |
Install all dependencies:
pip install -r req.txttensorflow
numpy
pandas
Pillow
tqdm
matplotlib
streamlit
transformers
gTTS
sentencepiece
- Add Vision Transformer (ViT + GPT) for advanced captioning.
- Build Flask-based web app for image uploads.
- Add CLIP-based image-caption retrieval.
- Implement BLEU and CIDEr evaluation dashboard.
- Harsh Pandey — Data Processing, Caption Cleaning, Project Workflow, English & Hindi Translation Features
- Anish Mehra - Image Preprocessing, README, feature additon "Read Aloud" on streamlit using gtts
- Hitesh – Worked on training the image captioning model, Worked on training the image captioning model.
- Om – Worked on training the image captioning model.
- Chandrika - Frontend through streamlit
- Florence - Presentation
- Supriya - Report
This project is for educational and research purposes only. Dataset © Flickr8k authors, used under academic usage terms.