A modular ROS 2 package for vision-language model (VLM) based perception, supporting zero-shot object detection, entity property recognition (gestures, emotions), and relation detection.
Key Features:
- Modular ROS 2 Architecture: Separate nodes for prompt management and inference
- Plugin-Based Adapters: Dynamically load vision model adapters via YAML configuration
- OpenAI-Compatible API Support: Included adapter works with any OpenAI-compatible endpoint
- Multi-Output Parsing: Extracts bounding boxes, gestures, emotions, and spatial relations
- Configuration-Driven: YAML-based model selection and parameter tuning
- Production-Ready: Real-time performance metrics, diagnostics, and monitoring GUI
- Install ROS 2 dependencies:
sudo apt install ros-jazzy-cv-bridge ros-jazzy-vision-msgs- Set up Python environment:
The package requires Python dependencies that should be installed in a virtual environment to avoid conflicts with system packages.
# Create and activate virtual environment
cd /path/to/vlm-detections
python3 -m venv vlm_venv
source vlm_venv/bin/activate
# Install package dependencies
pip install -r requirements.txt
# Install colcon for building
pip install colcon-common-extensions larkNote: The base package only requires minimal dependencies. If using the standalone testing app (see standalone_app/README.md), additional dependencies like transformers and torch may be needed.
- Build the package:
cd /path/to/ros2_workspace
# Build using the virtual environment's Python
/path/to/vlm-detections/vlm_venv/bin/python -m colcon build --packages-select vlm_detections perception_pipeline_msgs --symlink-install
source install/setup.bash- Configure your model:
Edit config/model_config.yaml:
model: "OpenAI Vision (API)"
model_variant: "Qwen/Qwen2.5-VL-32B-Instruct-AWQ"
threshold: 0.25
device: "auto"
generation_params:
max_new_tokens: 400
temperature: 0.0Configure API endpoint in config/adapters_config.yaml:
adapters:
"OpenAI Vision (API)":
module: "vlm_detections.adapters.openai_vision_adapter"
class: "OpenAIVisionAdapter"
constructor_args:
base_url: "http://your-api-endpoint:8000/v1"- Launch the detection node and prompt manager nodes:
# Activate your virtual environment first
cd /path/to/vlm-detections
source vlm_venv/bin/activate
# Set API key (optional)
export OPENAI_API_KEY=your_api_key
# Launch
ros2 launch vlm_detections vlm_node.launch.py- Launch the monitoring GUI (optional):
In a separate terminal:
# Activate your virtual environment
cd /path/to/vlm-detections
source vlm_venv/bin/activate
# Launch GUI
ros2 launch vlm_detections vlm_gui.launch.pyThe GUI provides real-time visualization of detections, performance metrics, and control to pause/resume inference.
The package provides three main ROS 2 nodes that work together:
Purpose: Manages prompt templates and publishes images with prompts for inference.
Subscribes:
/camera/image_raw(or configured topic) -sensor_msgs/CompressedImage/people-strawberry_ros_msgs/People(optional, for annotation)/faces-strawberry_ros_msgs/Faces(optional, temporary solution)
Publishes:
/vlm_prompts-perception_pipeline_msgs/VLMPrompts- Batched images with system/user prompts
- Supports multiple prompt templates per image
/vlm_debug_image-sensor_msgs/Image(debug visualization)
Parameters:
input_image_topic(string, default:/camera/image_raw) - Input image topicpeople_topic(string, default:/people) - People detection topic for annotationfaces_topic(string, default:/faces) - Faces detection topic for annotationoutput_topic(string, default:/vlm_prompts) - Output prompts topicfps(double, default: 5.0) - Image sampling rate in Hzbatch_capacity(int, default: 1) - Number of images to batch togetherprompts_dictionary(string, default: "") - Path to prompts YAML fileenable_people_annotation(bool, default: false) - Annotate images with people bounding boxesenable_faces_annotation(bool, default: false) - Annotate images with face bounding boxes (temporary)people_sync_tolerance(double, default: 0.2) - Time tolerance in seconds for syncing people datadebug_image_topic(string, default:/vlm_debug_image) - Debug visualization topic
Timing Configuration:
The node provides fine-grained control over inference timing through two parameters:
-
fps: Image sampling frequency
- Higher fps = more frequent sampling = lower sample period
- Example: fps=10.0 → sample every 0.1s
-
batch_capacity: Number of images per inference call
- batch_capacity=1: Publish immediately after each sample (real-time mode)
- batch_capacity>1: Accumulate images before publishing (batch mode)
-
Effective inference period:
(1/fps) × batch_capacity- Example 1: fps=5.0, batch_capacity=1 → 0.2s inference period (5 Hz)
- Example 2: fps=5.0, batch_capacity=4 → 0.8s inference period (1.25 Hz)
- Example 3: fps=10.0, batch_capacity=2 → 0.2s inference period (5 Hz)
This decoupling allows you to:
- Control VLM inference rate without changing image sampling
- Process video sequences by batching frames
- Balance latency vs. temporal context
Configuration:
prompts_dictionary.yaml- Prompt templates per model- Loads from package share directory
Responsibilities:
- Subscribe to image sources
- Apply prompt templates with variable substitution
- Batch images with prompts
- Optionally annotate images with people/face bounding boxes
- Publish
VLMPromptsmessages
Note: Prompt augmentation capabilities (e.g., variable substitution, dynamic class lists) are currently work in progress. The node currently loads static prompt templates from the dictionary file.
Purpose: Performs VLM inference on images with prompts, extracts structured outputs.
Subscribes:
/vlm_prompts-perception_pipeline_msgs/VLMPrompts- Receives images + prompts from PromptManagerNode
Publishes:
/vlm_detections-perception_pipeline_msgs/VLMOutputs- Structured detections, gestures, emotions, relations
/vlm_detections/image-sensor_msgs/Image- Annotated image with bounding boxes
/vlm_detections/status-std_msgs/String- JSON status (FPS, latency, detection count)
/diagnostics-diagnostic_msgs/DiagnosticArray- System health metrics
Services:
/vlm_detections/set_pause-std_srvs/SetBool- Pause/resume inference
Parameters:
prompts_topic(string, default:/vlm_prompts) - Input topicpaused(bool, default: false) - Start pausedoutput_image_topic(string, default:/vlm_detections/image)output_raw_topic(string, default:/vlm_detections)model_config_file(string, default: "") - Custom config pathdevice(string, default: "auto") - Inference device
Responsibilities:
- Load VLM adapter dynamically from configuration
- Receive VLMPrompts messages
- Perform inference with configured model
- Parse JSON outputs (detections, properties, relations)
- Publish structured ROS messages
- Monitor performance and publish diagnostics
Purpose: Real-time monitoring and control interface.
Subscribes:
/vlm_detections- Displays structured outputs/vlm_detections/image- Shows annotated images/vlm_detections/status- Monitors performance
Publishes:
- None (monitoring only)
Services Called:
/vlm_detections/set_pause- Control inference
Responsibilities:
- Display real-time detections
- Show performance metrics (FPS, latency)
- Visualize entity properties and relations
- Provide pause/resume controls
Typical Pipeline:
┌─────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Camera │─────>│ PromptManager │─────>│ VLMDetection │─────>│ Downstream │
│ │ │ Node (C++) │ │ Node (Python) │ │ Nodes │
└─────────────┘ └──────────────────┘ └──────────────────┘ └──────────────┘
│ │
↓ ↓
/vlm_prompts /vlm_detections
(images + prompts) (structured outputs)
│
↓
/vlm_detections/image
(annotated image)
With Monitoring GUI:
┌─────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Camera │─────>│ PromptManager │─────>│ VLMDetection │
│ │ │ Node │ │ Node │
└─────────────┘ └──────────────────┘ └─────────┬────────┘
│
↓
┌──────────────┐
│ VLMMonitor │
│ GUI │
└──────────────┘
vlm_detections/
├── core/
│ ├── adapter_base.py # BaseVisionAdapter interface
│ ├── runtime.py # Dynamic adapter loading
│ ├── parsed_items.py # Data structures
│ └── visualize.py # Drawing utilities
├── adapters/
│ └── openai_vision_adapter.py # OpenAI-compatible API adapter
├── utils/
│ ├── json_parser.py # Extract JSON from mixed text
│ ├── bbox_parser.py # Parse bounding boxes
│ ├── entity_parser.py # Parse properties/relations
│ └── config_loader.py # Load YAML configs
├── ros_node.py # VLMDetectionNode implementation
└── ros_gui.py # VLMMonitorGUI implementation
config/ # Package configuration
├── adapters_config.yaml # Adapter registry
├── model_config.yaml # Model selection & parameters
└── model_variants.yaml # Available model variants
src/
└── prompt_manager_node.cpp # PromptManagerNode (C++)
The ROS node uses three main YAML configuration files located in the /config directory at the package root:
Controls which model to use and inference settings. This is the main configuration file loaded by the ROS node.
Location: config/model_config.yaml
Structure:
# Model selection
model: "OpenAI Vision (API)" # Model name (must match entry in adapters_config.yaml)
model_variant: "Qwen/Qwen2.5-VL-32B-Instruct-AWQ" # Specific model variant
# Detection parameters
threshold: 0.25 # Confidence threshold for detections
# Device configuration
device: "auto" # Inference device: auto, cuda, cpu
# Generation parameters (model-specific)
generation_params:
max_new_tokens: 400 # Maximum output tokens
temperature: 0.0 # Sampling temperature (0.0 = deterministic)
top_p: 1.0 # Nucleus sampling thresholdNotes:
generation_paramsare model-specific and validated by each adapter
Defines available model adapters and their configuration. This enables the dynamic loading system.
Location: config/adapters_config.yaml
Structure:
adapters:
"OpenAI Vision (API)": # Display name used in model selection
module: "vlm_detections.adapters.openai_vision_adapter" # Python module path
class: "OpenAIVisionAdapter" # Adapter class name
constructor_args: # Optional: passed to __init__(**kwargs)
base_url: "http://100.115.56.116:8000/v1" # API endpoint
# Can reference environment variables: "${API_KEY}"How It Works:
runtime.pyreads this file and dynamically imports adapters usingimportlib- Constructor args are passed to the adapter's
__init__()method - Environment variables can be referenced with
${VAR_NAME}syntax - New adapters can be added without modifying core code
Adding Custom Adapters:
adapters:
"MyCustomModel":
module: "my_package.my_adapter"
class: "MyCustomAdapter"
constructor_args:
api_key: "${MY_API_KEY}"
timeout: 30Defines which model variants (versions/sizes) are available for each adapter.
Location: config/model_variants.yaml
Structure:
"OpenAI Vision (API)":
- "Qwen/Qwen2.5-VL-32B-Instruct-AWQ"
- "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"Purpose:
- Validates that selected
model_variantexists for the chosenmodel - Provides dropdown options in GUI applications
- Used by
runtime.pyfunctions:default_variant_for(),ensure_valid_variant()
Example with Multiple Models:
"OpenAI Vision (API)":
- "Qwen/Qwen2.5-VL-32B-Instruct-AWQ"
- "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8"
"MyCustomModel":
- "myorg/model-v1"
- "myorg/model-v2-large"The ROS node loads configuration in this order:
- Custom config file (if specified via
model_config_fileparameter) - Package default (
config/model_config.yaml) - Fallback defaults (hardcoded in
runtime.py)
Override at launch:
ros2 launch vlm-detections vlm_node.launch.py \
model_config_file:=/path/to/custom_config.yamlThe package uses a dynamic adapter loading system that allows adding new models without modifying core code.
Adapter Protocols:
The package defines an adapter protocol for prompt-based vision-language models:
Creating a Prompt-Based VLM adater:
from vlm_detections.core.adapter_base import PromptBasedVLM
from vlm_detections.core.parsed_items import Detection, EntityProperty, EntityRelation
from typing import List, Tuple
import numpy as np
class MyVLMAdapter(PromptBasedVLM):
def __init__(self, model_id: str, **kwargs):
self.model_id = model_id
# Handle additional constructor args from config
def name(self) -> str:
return f"MyVLM ({self.model_id})"
def load(self, device: str = "auto") -> None:
# Load vision-language model
self.model = ...
self.processor = ...
def infer(
self,
image_bgr: np.ndarray,
user_prompt: str,
system_prompt: str = "",
threshold: float = 0.1
) -> Tuple[List[Detection], List[EntityProperty], List[EntityRelation], str]:
# Run inference with natural language prompts
# Parse outputs using utils/bbox_parser.py and utils/entity_parser.py
# Return detections, properties, relations, and raw text
return detections, properties, relations, raw_textRegister your adapter in config/adapters_config.yaml:
adapters:
"MyModel":
module: "my_adapters_package.my_adapter"
class: "MyAdapter"
constructor_args: # Optional, passed to __init__
base_url: "http://localhost:8000"
api_key: "${MY_API_KEY}" # Can use environment variablesAdd model variants in config/model_variants.yaml:
"MyModel":
- "org/mymodel-small"
- "org/mymodel-large"This approach allows you to:
- Keep proprietary adapters in separate packages
- Load adapters from different sources without code changes
- Configure adapter-specific parameters via YAML
- Easily switch between public and private adapter sets
Adapters expose model-specific generation parameters for fine-tuning:
Common parameters (subset varies by adapter):
max_new_tokens/max_tokens: Maximum output lengthtemperature: Sampling randomness (0.0 = deterministic, 1.0 = creative)top_p: Nucleus sampling thresholdnum_beams: Beam search width (Florence-2, InternVL)do_sample: Enable/disable sampling
Implementation in adapters:
def generation_config_spec(self) -> Dict[str, Dict[str, object]]:
return {
"max_new_tokens": {"type": "int", "default": 400, "min": 16, "max": 4096, "step": 16},
"temperature": {"type": "float", "default": 0.0, "min": 0.0, "max": 2.0, "step": 0.1},
"top_p": {"type": "float", "default": 1.0, "min": 0.0, "max": 1.0, "step": 0.05},
}
def update_generation_params(self, params: Dict[str, object]) -> None:
self.gen_params.update(params)
# Apply in model.generate(**self.gen_params)ROS Integration:
- Configuration via
model_config.yaml - Dynamic parameter updates via ROS parameter server
Important: Always activate your virtual environment before launching ROS nodes:
source vlm_venv/bin/activateLaunch detection node:
ros2 launch vlm_detections vlm_node.launch.pyLaunch with custom config:
ros2 launch vlm_detections vlm_node.launch.py \
model_config_file:=/path/to/custom_config.yaml \
device:=cudaLaunch with monitoring GUI:
ros2 launch vlm_detections vlm_gui.launch.pyPublished by PromptManagerNode, consumed by VLMDetectionNode:
VLMPrompts:
header: std_msgs/Header
images[]: sensor_msgs/Image # Batch of images
system_prompts[]: string # System prompt per image
user_prompts[]: string # User prompt per image
classes[]: string[] # Optional: classes per image
Published by VLMDetectionNode:
VLMOutputs:
header: std_msgs/Header
outputs[]: VLMOutput # One per input image/prompt
VLMOutput:
vlm_metadata: VLMMetadata # Model name, prompts
detections[]: Detection2D # Bounding boxes
person_gestures[]: PersonGesture # Gestures (waving, pointing, etc.)
person_emotions[]: PersonEmotion # Emotions (happy, sad, etc.)
actor_relations[]: ActorRelation # Spatial relations
threshold: float32
raw_output: string # Raw model text
caption: string # Optional caption
Example Detection:
detections:
- bbox: [x1, y1, x2, y2]
label: "person"
score: 0.95
uuid: 1Example Gesture:
person_gestures:
- gesture_id: 0 # WAVING
person_uuid: 1
person_trackid: 1Example Relation:
actor_relations:
- relation_id: 2 # IS_LOOKING_AT
subject_uuid: 1
object_uuid: 2VLMDetectionNode extracts structured data from VLM text outputs:
1. Bounding Box Detections
- Parsed by
utils/bbox_parser.py - Supports multiple JSON formats (Qwen-style, OpenAI-style)
- Example:
{"bbox_2d": [x1, y1, x2, y2], "label": "person", "score": 0.95}
2. Entity Properties (Gestures, Emotions)
- Parsed by
utils/entity_parser.py - Example:
{"entity": "person_1", "property": "gesture", "value": "waving", "score": 0.95} - Mapped to
PersonGestureandPersonEmotionmessages
3. Entity Relations
- Parsed by
utils/entity_parser.py - Example:
{"subject": "person_1", "predicate": "looking_at", "object": "person_2"} - Mapped to
ActorRelationmessage
JSON Extraction:
json_parser.pyextracts JSON from mixed text/code blocks- Handles nested structures:
{"entities": [...], "properties": [...], "relations": [...]} - Robust to markdown formatting and code fences
Node not publishing outputs:
- Check
/vlm_promptstopic:ros2 topic echo /vlm_prompts - Verify PromptManagerNode is running
- Check logs:
ros2 topic echo /rosout
No detections found:
- Lower threshold in
config/model_config.yaml - Check raw output in logs or
/vlm_detections/status - Verify prompt format matches model expectations
Model fails to load:
- Check GPU memory:
nvidia-smi - Try CPU:
device: cpuin config - Verify API endpoint is accessible (for OpenAI adapter)
Performance issues:
- Monitor
/diagnosticstopic for latency metrics - Reduce image resolution from camera
- Use smaller model variant
- Enable model quantization (if supported)
Parsing errors:
- Check logs for JSON extraction failures
- Verify model outputs valid JSON
- Test prompt with standalone app first (see
standalone_app/README.md)