Skip to content

Conversation

@skoudmar
Copy link
Contributor

@skoudmar skoudmar commented Mar 6, 2025

This PR adds support for tracing by importing tracepoints used by rclcpp and creating new ones.

ROS 2 tracing is built on the LTTng tracing framework.

Motivation

Tracing ROS applications can help identify performance bottlenecks or measure and identify causes of latencies.

Available tools to analyze traces:

Overview

Traces of R2R application without this PR already include tracepoints
from rcl and rmw, but these are not sufficient for extracting
useful information. Therefore, this PR extends R2R in the following
areas:

  • Creates the r2r_tracing module, which
    • Imports tracepoints from ROS package tracetools (a part of ros2_tracing).
    • Defines new R2R specific tracepoints (for Node::spin_once).
    • Uses the tracing feature flag. Without it nothing is imported/linked, and tracepoints are no-ops.
  • Calls tracepoint functions defined in r2r_tracing from r2r code.
  • Adds methods Node::subscribe_traced, Node::create_service_traced,
    and Timer::on_tick. The first two are similar to Node::subscribe
    and Node::create_service, but add tracepoints around user-supplied
    callback. Timer::on_tick adds tracepoints to timer callback.
  • Updates R2R so that ROS entities are created in a location with
    stable address.

Stable memory address

Without this PR, ROS entities are first created on the stack, so initialization events like ros2::rcl_publisher_init use the stack address as their tracepoint event. Besides the address, these initialization events usually contain helpful information such as topic or period. R2R then moves the entity to the heap using Box or Arc. When a traced function, for example, rcl_publish, is later called on the entity, the tracepoint event references the heap memory address instead. Since the heap address differs from the original stack address, the traced function cannot be correctly matched with the corresponding initialization event.

This PR solves the problem for publishers (cb10d28), subscriptions (3f24514), services (e9dece1), and clients (c926fa6). For timers, the problem was already solved in PR #88.

Importing tracepoints from ros2_tracing

To add tracepoints compatible with rclcpp to the R2R code, we first need to import them. This is handled by dynamically loading tracetools library from ros2_tracing. tracetools is a part of ROS, and is depended on by, e.g., rcl or rclcpp.

The following tracepoints are imported:

  • ros2:rclcpp_subscription_init
  • ros2:rclcpp_timer_link_node
  • ros2:rclcpp_subscription_callback_added
  • ros2:rclcpp_service_callback_added
  • ros2:rclcpp_timer_callback_added
  • ros2:rclcpp_callback_register
  • ros2:rclcpp_publish
  • ros2:rclcpp_take
  • ros2:callback_start
  • ros2:callback_end

Reference: https://github.com/ros2/ros2_tracing/blob/rolling/doc/design_ros_2.md

Defining new tracepoints

In rclcpp, sampling of ready entities is announced by tracepoint rclcpp_executor_wait_for_work. However, this tracepoint is tied to the C++ executor lifecycle, and R2R does not include executors. In R2R, sampling is done by Node::spin_once. Therefore, to be able to get information about sampling, we define new tracepoints:

  • r2r:spin_start
  • r2r:spin_wake
  • r2r:spin_timeout
  • r2r:spin_end

These are executed according to the following table:

Spin result Tracepoint sequence
Some entities are ready (spin_start) -> (spin_wake) -> (spin_end)
Wait timeouted (spin_start) -> (spin_timeout)

The tracepoints are generated by crate lttng-ust-generate. Its release version is old and does not support usize, so I have created a PR (sapphire-arches/lttng-ust-rs#5) which fixes it, but the maintainer is not active, so my fork is used instead.

Additionally, the generated code needs to link with lttng-ust library which is only available on Linux. This is the main reason the tracing support is feature-gated.

tracing feature flag

Generation, inclusion, and linking are controlled by the tracing feature flag in r2r_tracing, which is also exported by r2r.

Without the feature flag tracing, nothing is imported/linked, and tracepoints are no-ops.

New callback tracing functions

In order to support tracing of callback execution, new functions Node::subscribe_traced, Node::create_service_traced, and Timer::on_tick are added.

Example usage is provided in r2r/examples/callback_tracing.rs.

CC: @wentasah

@m-dahl
Copy link
Collaborator

m-dahl commented Apr 1, 2025

Hi Skoudmar. Thanks for the contribution. I just wanted you to know that I have seen this, but I haven't had time to test it (on mac right now). I am not familiar with the ros2 tracing features but I think this looks solid and non-intrusive so definitely open to accepting.

What do you think about the slight API change when you want to trace (i.e changing from stream to callback style)? I realize that you need to call trace_callback_end somewhere, but it becomes a bit of a mismatch and forces you to change more code than simply tacking on _traced to your existing code when you want to enable tracing for something. I also see that we could end up with a lot of different combinations (eg if you wanted to trace a "raw" subscriber). Do you think we can achieve the same thing by wrapping the stream? Upon wrapping we set up the id and we put the tracing logic in poll_next. Then we can have a Traced trait and you would just add .traced() on your existing code. Do you think it could work?

I hope to be able to test it a bit soon.

@skoudmar
Copy link
Contributor Author

skoudmar commented Apr 1, 2025

Hi @m-dahl, thanks for your feedback!

The callback approach is necessary because ROS's existing tracing API was designed specifically with callbacks in mind. An async function can pause execution via await. This execution pattern can't be properly represented by existing tracepoints. Creating new tracepoints would, in this case, break compatibility with existing tracing tools.

While I did consider wrapping the stream as you suggested, there's a fundamental mismatch between stream semantics and callback semantics. Treating poll_next exits and entries as callback boundaries isn't technically accurate - there may be multiple other async operations happening between these calls that aren't part of what we'd want to consider the callback execution.

However, I agree that wrapping the stream is better for maintainability. So, using a combined approach of using a wrapped stream and also having callbacks offers benefits in maintainability and correctness. This PR already includes something similar for timers.

What about something like this:

// Subscribe as usual
let stream = node.subscribe("/topic", QosProfile::default())?;

// Option 1 (untraced): Use the stream normally like now
let untraced_future = stream.for_each(|msg| {
    println!("Untraced: {msg:?}");
    async {
        // some async code
    }
});

// Option 2 (traced): Convert to traced callback style
// `traced_callback` is a new function on the stream wrapper.
let traced_future = stream.traced_callback(|msg| {
    println!("Traced: {msg:?}");
    // some sync code
});

What do you think? I'd like to hear your feedback before implementing this.

@m-dahl
Copy link
Collaborator

m-dahl commented Apr 17, 2025

Hi, again sorry for not being so responsive. Yes I can see that there is a fundamental difference in the stream vs cb style here. I have wondered when the stream design would come back to bite us and it seems the time has come :) But its not so bad, and I think the wrapping approach you present is fine. If you would like to implement it I will merge this PR.

Another thing, should we gate the feature also on cfg(target_os = "linux")?

This change is necessary because rcl uses address of the rcl handle in
tracepoints. If the location changes, further publish events will not
be correctly assigned to the publisher.
This change is necessary because rcl uses address of the rcl handle in
tracepoints. If the location changes further `take` events will not
be correctly assigned to the subscriber.
This prevents another service initializing at the same location.
Therefore preventing conflicts in a trace.
This prevents another client initializing at the same location.
Therefore preventing conflicts in a trace.
skoudmar and others added 7 commits May 7, 2025 12:47
The crate contains tracepoints; either defined directly or imported
from ros2_tracing project (tracetools package).

The crate define a `tracing` feature which should only be used on Linux
because it uses LTTng only available on Linux.
The TracingId is used as a wrapper for the pointer to allow implementing
Send and Sync traits.

The pointer is used only as an ID and is never dereferenced, therefore, it is safe.
Location and arguments of tracepoints is consistent with rclcpp
The StreamWithTracingData wrapper implements both Stream and FusedStream similarly to Receiver
@skoudmar
Copy link
Contributor Author

skoudmar commented May 7, 2025

Hi, I have updated the implementation (and rebased it on the current master) to implement the Stream wrapper. However, I am not sure about its name: StreamWithTracingData. Is it fine or any suggestions?

I have also added a compile error when the feature is used on non-Linux inside build.rs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants