DocScope-R1

Note

HF Demo: https://huggingface.co/spaces/prithivMLmods/DocScope-R1

A powerful multi-modal AI application that combines three state-of-the-art vision-language models for comprehensive image and video analysis. DocScope-R1 provides OCR capabilities, detailed scene understanding, and video content analysis through an intuitive Gradio interface.

Screenshot 2025-10-16 at 11-27-20 DocScope-R1 - a Hugging Face Space by prithivMLmods

Important

note: remove kernels and flash_attn3 implementation if you are using it on non-hopper architecture gpus.

Features

Multi-Model Support: Choose from three specialized models for different tasks
Image Analysis: Upload images for OCR, scene description, and detailed captioning
Video Processing: Analyze videos with frame-by-frame understanding
Real-time Streaming: Get responses as they are generated
Advanced Controls: Fine-tune generation parameters for optimal results

Supported Models

1. Cosmos-Reason1-7B (NVIDIA)

Purpose: Physical common sense understanding and embodied decision making
Best for: Reasoning about physical interactions and spatial relationships
Model: nvidia/Cosmos-Reason1-7B

2. DocScope OCR-7B

Purpose: Document-level optical character recognition
Best for: Text extraction from documents and long-context vision-language understanding
Model: prithivMLmods/docscopeOCR-7B-050425-exp

3. Captioner-Relaxed-7B

Purpose: Detailed image captioning and description
Best for: Generating comprehensive descriptions for text-to-image training data
Model: Ertugrul/Qwen2.5-VL-7B-Captioner-Relaxed

4. visionOCR-3B-061125

Purpose: Detailed image OCR and description
Best for: Generating comprehensive structured inference for the image
Model: prithivMLmods/visionOCR-3B-061125

Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended)
At least 16GB RAM
20GB+ free disk space for models

Dependencies

git+https://github.com/huggingface/transformers.git@v4.57.6
git+https://github.com/huggingface/accelerate.git
git+https://github.com/huggingface/peft.git
transformers-stream-generator
huggingface_hub
qwen-vl-utils
sentencepiece
opencv-python
torch==2.8.0
torchvision
matplotlib
requests
kernels
hf_xet
spaces
pillow
gradio # - gradio@6.3.0
av

Clone Repository

git clone https://github.com/PRITHIVSAKTHIUR/DocScope-R1.git
cd DocScope-R1

Usage

Running the Application

python app.py

The application will start and provide you with a local URL (typically http://127.0.0.1:7860) to access the web interface.

Image Analysis

Select the "Image Inference" tab
Enter your query in the text box
Upload an image
Choose your preferred model
Adjust advanced parameters if needed
Click "Submit"

Example Queries:

"Perform OCR on the text in the image"
"Explain the scene in detail"
"Describe all objects and their relationships"

Video Analysis

Select the "Video Inference" tab
Enter your query describing what you want to analyze
Upload a video file
Select the appropriate model
Configure generation parameters
Click "Submit"

Example Queries:

"Explain the advertisement in detail"
"Identify the main actions in the video"
"Describe the sequence of events"

Configuration

Advanced Parameters

Max New Tokens (1-2048): Maximum length of generated response
Temperature (0.1-4.0): Controls randomness in generation
Top-p (0.05-1.0): Nucleus sampling parameter
Top-k (1-1000): Limits vocabulary for each step
Repetition Penalty (1.0-2.0): Reduces repetitive outputs

Environment Variables

MAX_INPUT_TOKEN_LENGTH: Maximum input context length (default: 4096)

Technical Details

Video Processing

Videos are automatically downsampled to 10 evenly spaced frames for analysis. Each frame is processed with its timestamp and combined into a comprehensive understanding of the video content.

Model Architecture

All models are based on the Qwen2.5-VL architecture with different fine-tuning objectives:

Half-precision (float16) inference for efficiency
GPU acceleration with CUDA support
Streaming text generation for real-time responses

Performance Optimization

Models are loaded once at startup
GPU memory is efficiently managed
Streaming responses provide immediate feedback
Automatic device detection (CUDA/CPU)

File Structure

DocScope-R1/
├── app.py              # Main application file
├── README.md           # This file
├── requirements.txt    # Python dependencies
├── images/            # Example images
│   ├── 1.jpg
│   └── 2.jpg
└── videos/            # Example videos
    ├── 1.mp4
    └── 2.mp4

System Requirements

Minimum Requirements

GPU: 8GB VRAM (RTX 3070 or equivalent)
RAM: 16GB system memory
Storage: 25GB free space
CPU: Multi-core processor (Intel i5/AMD Ryzen 5 or better)

Recommended Requirements

GPU: 12GB+ VRAM (RTX 4070 Ti or better)
RAM: 32GB system memory
Storage: SSD with 50GB free space
CPU: High-performance processor (Intel i7/AMD Ryzen 7 or better)

Troubleshooting

Common Issues

CUDA Out of Memory

Reduce max_new_tokens
Lower the input resolution
Use CPU inference (slower but works with limited VRAM)

Model Loading Errors

Ensure stable internet connection for initial model download
Check available disk space
Verify Hugging Face access for gated models

Video Processing Issues

Ensure video format is supported (MP4, AVI, MOV)
Check video file isn't corrupted
Reduce video length for large files

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

Acknowledgments

NVIDIA for the Cosmos-Reason1-7B model
Qwen team for the base architecture
Hugging Face for the transformers library
Gradio team for the interface framework

Contact

For questions, issues, or collaborations, please open an issue on GitHub or contact the maintainer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocScope-R1

Features

Supported Models

1. Cosmos-Reason1-7B (NVIDIA)

2. DocScope OCR-7B

3. Captioner-Relaxed-7B

4. visionOCR-3B-061125

Installation

Prerequisites

Dependencies

Clone Repository

Usage

Running the Application

Image Analysis

Video Analysis

Configuration

Advanced Parameters

Environment Variables

Technical Details

Video Processing

Model Architecture

Performance Optimization

File Structure

System Requirements

Minimum Requirements

Recommended Requirements

Troubleshooting

Common Issues

Contributing

Development Setup

Acknowledgments

Contact

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
images		images
videos		videos
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

License

PRITHIVSAKTHIUR/DocScope-R1

Folders and files

Latest commit

History

Repository files navigation

DocScope-R1

Features

Supported Models

1. Cosmos-Reason1-7B (NVIDIA)

2. DocScope OCR-7B

3. Captioner-Relaxed-7B

4. visionOCR-3B-061125

Installation

Prerequisites

Dependencies

Clone Repository

Usage

Running the Application

Image Analysis

Video Analysis

Configuration

Advanced Parameters

Environment Variables

Technical Details

Video Processing

Model Architecture

Performance Optimization

File Structure

System Requirements

Minimum Requirements

Recommended Requirements

Troubleshooting

Common Issues

Contributing

Development Setup

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages