Skip to content

Few-Shot/Zero-shot retail product intelligence recognition using YOLO + CLIP + OCR.

License

Notifications You must be signed in to change notification settings

tapasya234/VisionShelfIQ

Repository files navigation

VisionShelfIQ

Zero-shot retail product recognition and label understanding using CLIP + OCR.

Dataset: https://github.com/marcusklasson/GroceryStoreDataset

VisionShelfIQ

Multi-Object Grocery Product Recognition using Detection, Visual Retrieval, and OCR


Overview

VisionShelfIQ is an end-to-end computer vision system for recognizing grocery products in real-world images. Unlike traditional single-label classification, this project focuses on multi-object perception under clutter, combining:

  • Object detection (YOLO)
  • Vision–language retrieval (CLIP)
  • Few-shot visual prototypes
  • OCR-based semantic refinement
  • Retrieval-centric evaluation (Recall@K, MRR)

The system is designed to reflect real retail scenarios, where images may contain multiple products, partial views, background clutter, and visually similar items.


Key Goals

  • Detect multiple products in a single image
  • Rank candidate product identities instead of forcing a single prediction
  • Separate perception errors from ranking errors
  • Evaluate the system using retrieval metrics, not just accuracy
  • Demonstrate system decomposition, not just model usage

System Architecture

The pipeline is intentionally modular:

Input Image
   │
   ▼
Object Detection (YOLO)
   │
   ▼
Image Cropping
   │
   ▼
Visual Recognition (CLIP + Prototypes)
   │
   ▼
Aggregation & Ranking
   │
   ▼
OCR-based Semantic Refinement
   │
   ▼
Final Ranked Product Predictions

Each stage is evaluated independently and jointly to expose failure modes.

Datasets

Grocery Store Dataset (GSD)

  • Swiss grocery products
  • Strong on dairy and packaged goods
  • Reference + raw images
  • Link to the dataset

GroZi-120

  • US grocery and household products
  • Rich brand diversity
  • Used to improve packaged-goods coverage
  • Link to the dataset

Dataset Usage

Split Purpose
Reference images Build visual prototypes
Raw / in-situ images End-to-end evaluation

No supervised training is performed; the system is zero-shot / few-shot.


Visual Prototypes (Few-Shot Retrieval)

For each product:

  • CLIP image embeddings are extracted from reference images
  • Embeddings are averaged into a visual prototype
  • Prototypes are stored and reused (not recomputed at runtime)

This enables scalable few-shot recognition without retraining.


OCR Integration

OCR is used as a semantic refinement layer, not a primary classifier.

  • Text is extracted from detected crops
  • Unicode-normalized and cleaned
  • Matched against candidate labels using string similarity
  • Used to re-rank, not override, visual predictions

This helps resolve visually ambiguous products (e.g., similar packaging).


Evaluation Methodology

Why Not Accuracy?

In multi-object scenes:

  • There is no single “correct” prediction
  • Ranking matters more than top-1 correctness

Therefore, VisionShelfIQ is evaluated using retrieval metrics.


Metrics

Metric Meaning
Recall@K Is the ground-truth label present in top-K predictions?
MRR How high is the ground-truth ranked?

Metrics are reported at three system stages:

  1. Image-only (no detection)
  2. Detection + aggregation
  3. Detection + aggregation + OCR

Results — Raw / In-Situ Dataset

IMAGE-ONLY (No Detection)

Recall@1: 0.309
Recall@3: 0.464
Recall@5: 0.537
MRR:      0.395

Interpretation: CLIP retrieves semantically similar products, but clutter reduces dominance.


DETECTION + AGGREGATION

Recall@1: 0.090
Recall@3: 0.134
Recall@5: 0.138
MRR:      0.111

Interpretation: Object detection introduces ambiguity by detecting multiple valid objects. This exposes the difficulty of identifying the “product of interest” in cluttered scenes.


+ OCR (Full System)

Recall@1: 0.248
Recall@3: 0.369
Recall@5: 0.380
MRR:      0.305

Interpretation: OCR recovers semantic cues lost by visual detection, improving recall by ~2.7× over detection alone.


Key Insights

  • Detection is the main bottleneck, not visual embedding quality
  • OCR acts as a semantic safety net
  • Retrieval-based evaluation exposes system behavior better than accuracy
  • Reference-set performance is treated as a sanity check, not a claim

What This Project Demonstrates

  • System-level thinking in computer vision
  • Proper decomposition of detection, recognition, and ranking
  • Few-shot learning without retraining
  • Honest evaluation of real-world failure modes
  • Metrics aligned with application goals

Limitations & Future Work

  • Improve aggregation using detection confidence and crop size
  • Add error categorization (detection miss vs ranking miss vs OCR rescue)
  • Replace prototype storage with FAISS for scale
  • Optimize inference via ONNX / TensorRT
  • Add Streamlit UI for interactive demos

Technologies Used

  • Python
  • PyTorch
  • OpenCLIP
  • YOLOv8
  • EasyOCR
  • OpenCV
  • FAISS (planned)

Author

VisionShelfIQ Built as a portfolio-grade project for Computer Vision / Machine Learning Engineer roles.

About

Few-Shot/Zero-shot retail product intelligence recognition using YOLO + CLIP + OCR.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published