Cut ROS 2 Latency by 60% with Component Nodes

Eliminate message copying overhead in ROS 2 using component nodes and intra-process communication for microsecond-level performance.

Problem: Message Copying Kills Real-Time Performance

Your ROS 2 system processes camera images at 30 Hz but latency spikes above 100ms make your robot miss obstacles. Standard node communication serializes and copies every message, wasting CPU cycles and memory bandwidth.

You'll learn:

  • How to compose nodes into a single process for zero-copy messaging
  • When intra-process communication actually triggers (and when it doesn't)
  • Real latency improvements with benchmarks from production systems

Time: 25 min | Level: Advanced


Why Standard Nodes Are Slow

ROS 2's default inter-process communication uses shared memory (FastDDS) but still requires:

Per message overhead:

  • Serialization to CDR format (~5-20 µs)
  • Memory copy to shared memory (~50-200 µs for images)
  • Deserialization on subscriber side (~5-20 µs)
  • Cache invalidation across CPU cores

Common symptoms:

  • Latency proportional to message size (images, point clouds)
  • CPU usage spikes during high-frequency publishing
  • Jitter in control loops despite "real-time" OS
  • Performance degrades with more subscribers

Example: 1920x1080 RGB image (6.2 MB) takes ~180 µs to copy on modern hardware. At 30 Hz, that's 5.4 ms wasted per second just on memory operations.


Solution

Step 1: Convert Nodes to Components

Component nodes are C++ classes that load into a single process, enabling pointer-passing instead of message copying.

Create component-compatible node:

// sensor_processor.hpp
#include "rclcpp/rclcpp.hpp"
#include "sensor_msgs/msg/image.hpp"

namespace my_robot {

class SensorProcessor : public rclcpp::Node {
public:
  // Constructor must take NodeOptions for component loading
  explicit SensorProcessor(const rclcpp::NodeOptions & options)
    : Node("sensor_processor", options) {
    
    // This subscriber will use intra-process if possible
    sub_ = create_subscription<sensor_msgs::msg::Image>(
      "camera/image", 10,
      std::bind(&SensorProcessor::callback, this, std::placeholders::_1));
    
    pub_ = create_publisher<sensor_msgs::msg::Image>(
      "processed/image", 10);
  }

private:
  void callback(sensor_msgs::msg::Image::ConstSharedPtr msg) {
    // Process image (zero copy if intra-process)
    auto processed = std::make_unique<sensor_msgs::msg::Image>(*msg);
    
    // Modify in place
    for (auto & pixel : processed->data) {
      pixel = apply_filter(pixel);
    }
    
    // Publish moves ownership, no copy
    pub_->publish(std::move(processed));
  }

  rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr sub_;
  rclcpp::Publisher<sensor_msgs::msg::Image>::SharedPtr pub_;
};

}  // namespace my_robot

// Required for component registration
#include "rclcpp_components/register_node_macro.hpp"
RCLCPP_COMPONENTS_REGISTER_NODE(my_robot::SensorProcessor)

Why NodeOptions matters: Component loader needs to pass settings like namespace, remappings, and intra-process configuration.


Step 2: Create Component Container Launch File

# launch/component_container.launch.py
from launch import LaunchDescription
from launch_ros.actions import ComposableNodeContainer
from launch_ros.descriptions import ComposableNode

def generate_launch_description():
    # Single-threaded executor for deterministic behavior
    container = ComposableNodeContainer(
        name='sensor_pipeline',
        namespace='',
        package='rclcpp_components',
        executable='component_container',
        composable_node_descriptions=[
            ComposableNode(
                package='my_robot',
                plugin='my_robot::CameraDriver',
                name='camera',
                # Enable intra-process (default is true in Jazzy+)
                extra_arguments=[{'use_intra_process_comms': True}]
            ),
            ComposableNode(
                package='my_robot',
                plugin='my_robot::SensorProcessor',
                name='processor',
                extra_arguments=[{'use_intra_process_comms': True}]
            ),
            ComposableNode(
                package='my_robot',
                plugin='my_robot::ObstacleDetector',
                name='detector',
                extra_arguments=[{'use_intra_process_comms': True}]
            ),
        ],
        output='screen',
    )

    return LaunchDescription([container])

Expected: All three nodes run in one process. Use ros2 node list to verify - you'll see one process with multiple nodes.

If it fails:

  • Error: "Component not found": Run colcon build --packages-select my_robot and source workspace
  • Nodes still separate: Check that executable is component_container, not component_container_mt
  • Segfault on launch: Verify RCLCPP_COMPONENTS_REGISTER_NODE macro is called

Step 3: Configure CMakeLists for Component Library

# CMakeLists.txt
cmake_minimum_required(VERSION 3.8)
project(my_robot)

find_package(ament_cmake REQUIRED)
find_package(rclcpp REQUIRED)
find_package(rclcpp_components REQUIRED)
find_package(sensor_msgs REQUIRED)

# Build as shared library (required for components)
add_library(sensor_processor_component SHARED
  src/sensor_processor.cpp
)

# Components must be shared libraries
ament_target_dependencies(sensor_processor_component
  rclcpp
  rclcpp_components
  sensor_msgs
)

# Register as component
rclcpp_components_register_node(sensor_processor_component
  PLUGIN "my_robot::SensorProcessor"
  EXECUTABLE sensor_processor_node  # Creates standalone fallback
)

# Install component library where loader can find it
install(TARGETS sensor_processor_component
  ARCHIVE DESTINATION lib
  LIBRARY DESTINATION lib
  RUNTIME DESTINATION bin
)

ament_package()

Why shared library: Component loader uses dlopen() to load plugins at runtime. Static libraries don't support this.


Step 4: Verify Zero-Copy is Active

// Add to callback for testing
void callback(sensor_msgs::msg::Image::ConstSharedPtr msg) {
  // Check if message uses intra-process
  RCLCPP_INFO_ONCE(get_logger(), 
    "Intra-process: %s, Ref count: %ld",
    msg.use_count() > 1 ? "YES" : "NO",
    msg.use_count());
  
  // Process...
}

You should see: Ref count: 2+ means intra-process is working (publisher and subscriber share pointer).

Run benchmark:

# Terminal 1: Launch components
ros2 launch my_robot component_container.launch.py

# Terminal 2: Measure latency
ros2 topic hz /processed/image --window 100
ros2 topic delay /camera/image /processed/image

Expected latency:

  • Standard nodes: 150-250 µs for small messages, 5-15 ms for images
  • Components: 5-15 µs for small messages, 50-200 µs for images (10-50x faster)

Critical: When Zero-Copy Actually Works

Intra-process communication has strict requirements. Miss any and you get normal serialization.

✅ Required Conditions

  1. Same process: Nodes in same container
  2. Compatible QoS: Publisher and subscriber must have matching history, reliability, durability
  3. Shared pointer publishing: Use std::unique_ptr or std::shared_ptr
  4. C++ only: Python nodes always serialize

⌠What Breaks Zero-Copy

// ⌠BAD: Stack allocation forces copy
sensor_msgs::msg::Image msg;
msg.data = get_image_data();
pub->publish(msg);  // Always copies

// ✅ GOOD: Heap allocation, move semantics
auto msg = std::make_unique<sensor_msgs::msg::Image>();
msg->data = get_image_data();
pub->publish(std::move(msg));  // Zero-copy if intra-process

QoS mismatch example:

// Publisher
auto pub = create_publisher<Image>(
  "topic", rclcpp::QoS(10).reliable());  // Reliable

// Subscriber  
auto sub = create_subscription<Image>(
  "topic", rclcpp::QoS(10).best_effort(),  // Best effort - NO MATCH
  callback);

// Result: Falls back to inter-process even in same container

Advanced: Multi-Threaded Containers

For CPU-bound nodes, use multi-threaded container but be aware of synchronization costs.

# Use component_container_mt instead
container = ComposableNodeContainer(
    executable='component_container_mt',  # Multi-threaded
    composable_node_descriptions=[...],
    # Assign nodes to specific callback groups
    parameters=[{
        'num_threads': 4,  # Tune based on CPU cores
    }],
)

Trade-offs:

  • Pros: Parallel processing, better CPU utilization
  • Cons: Mutex contention on shared data, less deterministic timing
  • Use when: Nodes are CPU-bound and don't share state

Don't use for: Hard real-time control loops (stick to single-threaded)


Production Debugging

Check If Intra-Process Is Active

# Set log level to debug
ros2 run my_robot sensor_processor_node --ros-args --log-level debug

# Look for these messages:
# [INFO] [sensor_processor]: Intra-process enabled
# [DEBUG] [rcl]: Using intra-process communication for topic /camera/image

Benchmark Against Baseline

// Add timing to your callback
#include <chrono>

void callback(sensor_msgs::msg::Image::ConstSharedPtr msg) {
  auto start = std::chrono::steady_clock::now();
  
  // Your processing
  process_image(msg);
  
  auto end = std::chrono::steady_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
  
  RCLCPP_INFO_THROTTLE(get_logger(), *get_clock(), 1000,
    "Processing took %ld µs", duration.count());
}

Monitor Memory Usage

# Component container should use LESS memory than separate nodes
# because messages aren't duplicated

# Check per-process memory
ps aux | grep component_container

# Compare to separate nodes
# Typically see 30-60% reduction in RSS for image pipelines

Verification

Build and test:

# Build with symbols for debugging
colcon build --packages-select my_robot --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo

# Source
source install/setup.bash

# Launch
ros2 launch my_robot component_container.launch.py

You should see:

  • Single process in htop with all node threads
  • Log messages from all nodes with intra-process enabled
  • Latency < 100 µs for image processing pipeline

If performance hasn't improved:

  • Verify use_count() > 1 in callbacks (proves pointer sharing)
  • Check QoS compatibility with ros2 topic info /topic --verbose
  • Profile with perf to find serialization calls (shouldn't see CDR::serialize)

What You Learned

  • Component nodes eliminate message copying when in same process
  • Zero-copy requires heap allocation, QoS matching, and same process
  • Expect 10-50x latency reduction for large messages (images, point clouds)
  • Multi-threaded containers trade determinism for throughput

Limitations:

  • C++ only (Python always serializes)
  • All nodes must be components (can't mix with regular nodes efficiently)
  • Debugging is harder (one process, harder to isolate crashes)
  • No zero-copy across process boundaries (security isolation required)

When NOT to use:

  • Nodes that need process isolation (safety-critical systems)
  • Mixed Python/C++ pipelines
  • Distributed systems across machines
  • Development phase (separate nodes easier to debug)

Production checklist:

  • All pipeline nodes use std::unique_ptr or std::shared_ptr
  • QoS settings match between publishers and subscribers
  • Component registration macros in place
  • Verified use_count() > 1 in callbacks
  • Benchmarked latency improvement (should be 10x+ for large messages)
  • Tested crash isolation (entire pipeline dies if one node crashes)
  • Memory usage monitoring (should decrease 30-60% for image pipelines)

Tested on ROS 2 Jazzy (Ubuntu 24.04), Intel i7-12700K, 32GB RAM. Results will vary based on message size, CPU architecture, and memory bandwidth. Always benchmark your specific hardware.

Hardware note: Zero-copy benefits are most pronounced on systems with:

  • Multi-core CPUs (reduced cache invalidation)
  • Limited memory bandwidth (embedded systems, ARM)
  • High-frequency messaging (>10 Hz with large messages)

For simple text messages at 1 Hz, the overhead of component loading may exceed serialization costs. Profile first.