This package integrates two complementary vision-language model (VLM) modalities:
- Open-vocabulary object detection with 3D spatial grounding.
- Binary visual question-answering (Yes/No) with reasoning.
Both wrapped with ROS2/ROS nodes.
Important Note: Only ROS2 Humble and ROS Noetic are currently supported. This instructions are for ROS2 Humble. For ROS Noetic check this branch.
These instructions assume that ros-humble-desktop is installed on Ubuntu 22.04.
Build the repository:
mkdir -p vlm_ws/src
cd vlm_ws/src
git clone git@github.com:ntnu-arl/detection_vlm.git detection_vlm
rosdep install --from-paths . --ignore-src -r -y
cd ..
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=ReleaseIt is highly recommended to set up a Python virtual environment to run ROS Python nodes:
cd vlm_ws/src/detection_vlm/detection_vlm_python
python3.8 -m venv --system-site-packages detection_vlm_env
source detection_vlm_env/bin/activate
pip install -U pip
pip install -r requirements.txtObject detection is performed using either an open-vocabulary object detector (YOLOe) or a VLM-based detector (GPT-4V via API call) initialized with a set of labels or a description of the objects to detect. These models operate on an image and produce 2D bounding boxes. In parallel, a downsampled voxel grid derived from a LiDAR point cloud and odometry estimates is maintained. Accordingly, LiDAR points are projected into the camera frame using the current pose estimate and the camera projection matrix. Valid points are clustered to identify those that fall within each 2D detection/mask. This produces aligned 2D detections and corresponding 3D bounding volumes.
The detection vlm can be run with:
ros2 launch detection_vlm_ros detection_vlm.launch.yamlInput topics and necessary frame names (for TF querying) are set in detection_vlm.launch.yaml.
Note that in this launch file you can set which config file to use. We provide two config file examples:
- YOLOe: detection_yoloe.yaml
- OpenAI: detection_vlm.yaml
For high-level semantic assessment, a VLM (GPT-4V via API call) processes the front-camera image together with a binary “Yes/No” question. For example, queries related to assessing safety or navigation-related properties of the scene (e.g., is the exit of this environment blocked). The model returns the binary answer, alongside a color-coded confidence overlay on the input image, and a brief explanation of its reasoning.
The detection vlm can be run with:
export OPENAI_API_KEY=<Your OpenAI API key>
ros2 launch detection_vlm_ros reasoning_vlm.launch.yamlWe provide a config file example here: reasoning_vlm.yaml
Released under BSD-3-Clause.
For questions or support, reach out via GitHub Issues or contact the authors directly: