Zero-shot retail product recognition and label understanding using CLIP + OCR.
Dataset: https://github.com/marcusklasson/GroceryStoreDataset
Multi-Object Grocery Product Recognition using Detection, Visual Retrieval, and OCR
VisionShelfIQ is an end-to-end computer vision system for recognizing grocery products in real-world images. Unlike traditional single-label classification, this project focuses on multi-object perception under clutter, combining:
- Object detection (YOLO)
- Vision–language retrieval (CLIP)
- Few-shot visual prototypes
- OCR-based semantic refinement
- Retrieval-centric evaluation (Recall@K, MRR)
The system is designed to reflect real retail scenarios, where images may contain multiple products, partial views, background clutter, and visually similar items.
- Detect multiple products in a single image
- Rank candidate product identities instead of forcing a single prediction
- Separate perception errors from ranking errors
- Evaluate the system using retrieval metrics, not just accuracy
- Demonstrate system decomposition, not just model usage
The pipeline is intentionally modular:
Input Image
│
▼
Object Detection (YOLO)
│
▼
Image Cropping
│
▼
Visual Recognition (CLIP + Prototypes)
│
▼
Aggregation & Ranking
│
▼
OCR-based Semantic Refinement
│
▼
Final Ranked Product Predictions
Each stage is evaluated independently and jointly to expose failure modes.
- Swiss grocery products
- Strong on dairy and packaged goods
- Reference + raw images
- Link to the dataset
- US grocery and household products
- Rich brand diversity
- Used to improve packaged-goods coverage
- Link to the dataset
| Split | Purpose |
|---|---|
| Reference images | Build visual prototypes |
| Raw / in-situ images | End-to-end evaluation |
No supervised training is performed; the system is zero-shot / few-shot.
For each product:
- CLIP image embeddings are extracted from reference images
- Embeddings are averaged into a visual prototype
- Prototypes are stored and reused (not recomputed at runtime)
This enables scalable few-shot recognition without retraining.
OCR is used as a semantic refinement layer, not a primary classifier.
- Text is extracted from detected crops
- Unicode-normalized and cleaned
- Matched against candidate labels using string similarity
- Used to re-rank, not override, visual predictions
This helps resolve visually ambiguous products (e.g., similar packaging).
In multi-object scenes:
- There is no single “correct” prediction
- Ranking matters more than top-1 correctness
Therefore, VisionShelfIQ is evaluated using retrieval metrics.
| Metric | Meaning |
|---|---|
| Recall@K | Is the ground-truth label present in top-K predictions? |
| MRR | How high is the ground-truth ranked? |
Metrics are reported at three system stages:
- Image-only (no detection)
- Detection + aggregation
- Detection + aggregation + OCR
Recall@1: 0.309
Recall@3: 0.464
Recall@5: 0.537
MRR: 0.395
Interpretation: CLIP retrieves semantically similar products, but clutter reduces dominance.
Recall@1: 0.090
Recall@3: 0.134
Recall@5: 0.138
MRR: 0.111
Interpretation: Object detection introduces ambiguity by detecting multiple valid objects. This exposes the difficulty of identifying the “product of interest” in cluttered scenes.
Recall@1: 0.248
Recall@3: 0.369
Recall@5: 0.380
MRR: 0.305
Interpretation: OCR recovers semantic cues lost by visual detection, improving recall by ~2.7× over detection alone.
- Detection is the main bottleneck, not visual embedding quality
- OCR acts as a semantic safety net
- Retrieval-based evaluation exposes system behavior better than accuracy
- Reference-set performance is treated as a sanity check, not a claim
- System-level thinking in computer vision
- Proper decomposition of detection, recognition, and ranking
- Few-shot learning without retraining
- Honest evaluation of real-world failure modes
- Metrics aligned with application goals
- Improve aggregation using detection confidence and crop size
- Add error categorization (detection miss vs ranking miss vs OCR rescue)
- Replace prototype storage with FAISS for scale
- Optimize inference via ONNX / TensorRT
- Add Streamlit UI for interactive demos
- Python
- PyTorch
- OpenCLIP
- YOLOv8
- EasyOCR
- OpenCV
- FAISS (planned)
VisionShelfIQ Built as a portfolio-grade project for Computer Vision / Machine Learning Engineer roles.