VisionShelfIQ

Zero-shot retail product recognition and label understanding using CLIP + OCR.

Dataset: https://github.com/marcusklasson/GroceryStoreDataset

VisionShelfIQ

Multi-Object Grocery Product Recognition using Detection, Visual Retrieval, and OCR

Overview

VisionShelfIQ is an end-to-end computer vision system for recognizing grocery products in real-world images. Unlike traditional single-label classification, this project focuses on multi-object perception under clutter, combining:

Object detection (YOLO)
Vision–language retrieval (CLIP)
Few-shot visual prototypes
OCR-based semantic refinement
Retrieval-centric evaluation (Recall@K, MRR)

The system is designed to reflect real retail scenarios, where images may contain multiple products, partial views, background clutter, and visually similar items.

Key Goals

Detect multiple products in a single image
Rank candidate product identities instead of forcing a single prediction
Separate perception errors from ranking errors
Evaluate the system using retrieval metrics, not just accuracy
Demonstrate system decomposition, not just model usage

System Architecture

The pipeline is intentionally modular:

Input Image
   │
   ▼
Object Detection (YOLO)
   │
   ▼
Image Cropping
   │
   ▼
Visual Recognition (CLIP + Prototypes)
   │
   ▼
Aggregation & Ranking
   │
   ▼
OCR-based Semantic Refinement
   │
   ▼
Final Ranked Product Predictions

Each stage is evaluated independently and jointly to expose failure modes.

Datasets

Grocery Store Dataset (GSD)

Swiss grocery products
Strong on dairy and packaged goods
Reference + raw images
Link to the dataset

GroZi-120

US grocery and household products
Rich brand diversity
Used to improve packaged-goods coverage
Link to the dataset

Dataset Usage

Split	Purpose
Reference images	Build visual prototypes
Raw / in-situ images	End-to-end evaluation

No supervised training is performed; the system is zero-shot / few-shot.

Visual Prototypes (Few-Shot Retrieval)

For each product:

CLIP image embeddings are extracted from reference images
Embeddings are averaged into a visual prototype
Prototypes are stored and reused (not recomputed at runtime)

This enables scalable few-shot recognition without retraining.

OCR Integration

OCR is used as a semantic refinement layer, not a primary classifier.

Text is extracted from detected crops
Unicode-normalized and cleaned
Matched against candidate labels using string similarity
Used to re-rank, not override, visual predictions

This helps resolve visually ambiguous products (e.g., similar packaging).

Evaluation Methodology

Why Not Accuracy?

In multi-object scenes:

There is no single “correct” prediction
Ranking matters more than top-1 correctness

Therefore, VisionShelfIQ is evaluated using retrieval metrics.

Metrics

Metric	Meaning
Recall@K	Is the ground-truth label present in top-K predictions?
MRR	How high is the ground-truth ranked?

Metrics are reported at three system stages:

Image-only (no detection)
Detection + aggregation
Detection + aggregation + OCR

Results — Raw / In-Situ Dataset

IMAGE-ONLY (No Detection)

Recall@1: 0.309
Recall@3: 0.464
Recall@5: 0.537
MRR:      0.395

Interpretation: CLIP retrieves semantically similar products, but clutter reduces dominance.

DETECTION + AGGREGATION

Recall@1: 0.090
Recall@3: 0.134
Recall@5: 0.138
MRR:      0.111

Interpretation: Object detection introduces ambiguity by detecting multiple valid objects. This exposes the difficulty of identifying the “product of interest” in cluttered scenes.

+ OCR (Full System)

Recall@1: 0.248
Recall@3: 0.369
Recall@5: 0.380
MRR:      0.305

Interpretation: OCR recovers semantic cues lost by visual detection, improving recall by ~2.7× over detection alone.

Key Insights

Detection is the main bottleneck, not visual embedding quality
OCR acts as a semantic safety net
Retrieval-based evaluation exposes system behavior better than accuracy
Reference-set performance is treated as a sanity check, not a claim

What This Project Demonstrates

System-level thinking in computer vision
Proper decomposition of detection, recognition, and ranking
Few-shot learning without retraining
Honest evaluation of real-world failure modes
Metrics aligned with application goals

Limitations & Future Work

Improve aggregation using detection confidence and crop size
Add error categorization (detection miss vs ranking miss vs OCR rescue)
Replace prototype storage with FAISS for scale
Optimize inference via ONNX / TensorRT
Add Streamlit UI for interactive demos

Technologies Used

Python
PyTorch
OpenCLIP
YOLOv8
EasyOCR
OpenCV
FAISS (planned)

Author

VisionShelfIQ Built as a portfolio-grade project for Computer Vision / Machine Learning Engineer roles.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.devcontainer		.devcontainer
models		models
setup_verification		setup_verification
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionShelfIQ

VisionShelfIQ

Overview

Key Goals

System Architecture

Datasets

Grocery Store Dataset (GSD)

GroZi-120

Dataset Usage

Visual Prototypes (Few-Shot Retrieval)

OCR Integration

Evaluation Methodology

Why Not Accuracy?

Metrics

Results — Raw / In-Situ Dataset

IMAGE-ONLY (No Detection)

DETECTION + AGGREGATION

+ OCR (Full System)

Key Insights

What This Project Demonstrates

Limitations & Future Work

Technologies Used

Author

About

Uh oh!

Releases

Packages

Languages

License

tapasya234/VisionShelfIQ

Folders and files

Latest commit

History

Repository files navigation

VisionShelfIQ

VisionShelfIQ

Overview

Key Goals

System Architecture

Datasets

Grocery Store Dataset (GSD)

GroZi-120

Dataset Usage

Visual Prototypes (Few-Shot Retrieval)

OCR Integration

Evaluation Methodology

Why Not Accuracy?

Metrics

Results — Raw / In-Situ Dataset

IMAGE-ONLY (No Detection)

DETECTION + AGGREGATION

+ OCR (Full System)

Key Insights

What This Project Demonstrates

Limitations & Future Work

Technologies Used

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages