Detection VLM in ROS

This package integrates two complementary vision-language model (VLM) modalities:

Open-vocabulary object detection with 3D spatial grounding.
Binary visual question-answering (Yes/No) with reasoning.

Both wrapped with ROS2/ROS nodes.

Important Note: Only ROS2 Humble and ROS Noetic are currently supported. This instructions are for ROS2 Humble. For ROS Noetic check this branch.

Setup

General Requirements

These instructions assume that ros-humble-desktop is installed on Ubuntu 22.04.

Building

Build the repository:

mkdir -p vlm_ws/src
cd vlm_ws/src

git clone git@github.com:ntnu-arl/detection_vlm.git detection_vlm
rosdep install --from-paths . --ignore-src -r -y

cd ..
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release

Python Virtual Environment

It is highly recommended to set up a Python virtual environment to run ROS Python nodes:

cd vlm_ws/src/detection_vlm/detection_vlm_python
python3.8 -m venv --system-site-packages detection_vlm_env
source detection_vlm_env/bin/activate
pip install -U pip
pip install -r requirements.txt

Usage

Detection VLM

Object detection is performed using either an open-vocabulary object detector (YOLOe) or a VLM-based detector (GPT-4V via API call) initialized with a set of labels or a description of the objects to detect. These models operate on an image and produce 2D bounding boxes. In parallel, a downsampled voxel grid derived from a LiDAR point cloud and odometry estimates is maintained. Accordingly, LiDAR points are projected into the camera frame using the current pose estimate and the camera projection matrix. Valid points are clustered to identify those that fall within each 2D detection/mask. This produces aligned 2D detections and corresponding 3D bounding volumes.

The detection vlm can be run with:

ros2 launch detection_vlm_ros detection_vlm.launch.yaml

Input topics and necessary frame names (for TF querying) are set in detection_vlm.launch.yaml.

Note that in this launch file you can set which config file to use. We provide two config file examples:

YOLOe: detection_yoloe.yaml
OpenAI: detection_vlm.yaml

Q&A VLM

For high-level semantic assessment, a VLM (GPT-4V via API call) processes the front-camera image together with a binary “Yes/No” question. For example, queries related to assessing safety or navigation-related properties of the scene (e.g., is the exit of this environment blocked). The model returns the binary answer, alongside a color-coded confidence overlay on the input image, and a brief explanation of its reasoning.

The detection vlm can be run with:

export OPENAI_API_KEY=<Your OpenAI API key>
ros2 launch detection_vlm_ros reasoning_vlm.launch.yaml

We provide a config file example here: reasoning_vlm.yaml

License

Released under BSD-3-Clause.

Contact

For questions or support, reach out via GitHub Issues or contact the authors directly:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
detection_vlm_msgs		detection_vlm_msgs
detection_vlm_python		detection_vlm_python
detection_vlm_ros		detection_vlm_ros
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Detection VLM in ROS

Table of Contents

Setup

General Requirements

Building

Python Virtual Environment

Usage

Detection VLM

Q&A VLM

License

Contact

About

Uh oh!

Releases

Packages

Languages

License

ntnu-arl/detection_vlm

Folders and files

Latest commit

History

Repository files navigation

Detection VLM in ROS

Table of Contents

Setup

General Requirements

Building

Python Virtual Environment

Usage

Detection VLM

Q&A VLM

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages