FPGA for Robotics: Accelerate Image Processing on Xilinx Kria

Speed up robotic vision pipelines using the Xilinx Kria KV260. Cut latency from 120ms to under 5ms with hardware-accelerated image processing.

Problem: Your Robot's Vision Pipeline Is Too Slow

Your ROS 2 node processes camera frames at 8 fps. The robot hesitates, misses obstacles, and struggles with real-time decisions. A CPU-only pipeline can't keep up with 1080p at 30+ fps without dropping frames or adding unacceptable latency.

You'll learn:

  • How to offload image processing kernels to the Kria KV260's PL (Programmable Logic)
  • How to build a Vitis Vision Library accelerated pipeline for edge detection and color filtering
  • How to integrate the accelerated pipeline with a ROS 2 node using zero-copy DMA

Time: 45 min | Level: Advanced


Why This Happens

CPUs process image data serially — one pixel operation at a time per core. A 1080p frame has over 2 million pixels. Even with SIMD and multi-threading, software pipelines hit a wall around 15–20ms per frame for non-trivial operations.

FPGAs solve this with spatial parallelism. You can process entire rows — or the whole frame — simultaneously in dedicated hardware logic. The Xilinx Kria KV260 combines an ARM Cortex-A53 PS (Processing System) with a Zynq UltraScale+ PL, connected via AXI DMA with near-zero transfer overhead.

Common symptoms of a CPU-bound vision pipeline:

  • ROS 2 /camera/image_raw callback takes >30ms
  • htop shows CPU pinned at 100% on vision nodes
  • Frame drops visible in ros2 topic hz /processed_image
  • Latency spikes during simultaneous navigation and perception

Solution

Step 1: Set Up the Kria KV260 with Vitis AI Runtime

Flash the Kria starter image (Ubuntu 22.04 with Petalinux overlays) and install the Vitis Libraries.

# On your development machine — cross-compile environment
sudo apt install xilinx-vitis-ai-runtime-2.5
sudo apt install xrt  # Xilinx Runtime for PL communication

# On the KV260 board
sudo apt update && sudo apt install kria-apps-opencv
xbutil examine  # Verify PL is reachable

Expected: xbutil examine lists the KV260 device and shows PL temperature and power rails.

If it fails:

  • "No devices found": Run sudo systemctl restart xrt and retry
  • Package not found: Add the Xilinx APT repo: sudo add-apt-repository ppa:xilinx/packages

Step 2: Write the HLS Kernel for Gaussian Blur + Canny Edge Detection

Use Vitis Vision Library's HLS-ready functions. These synthesize to RTL and run entirely in the PL — no CPU involvement during processing.

// vision_kernel.cpp — HLS kernel for Gaussian + Canny
#include "hls_stream.h"
#include "ap_int.h"
#include "common/xf_common.hpp"
#include "imgproc/xf_gaussian_filter.hpp"
#include "imgproc/xf_canny.hpp"

// Define image params at compile time — PL logic is static
#define IMAGE_WIDTH  1920
#define IMAGE_HEIGHT 1080
#define INPUT_PTR_WIDTH  512  // AXI burst width
#define OUTPUT_PTR_WIDTH 512

extern "C" {
void vision_pipeline(
    ap_uint<INPUT_PTR_WIDTH>* img_in,
    ap_uint<OUTPUT_PTR_WIDTH>* img_out,
    int rows, int cols
) {
    // HLS directives tell the synthesizer to expose AXI interfaces
    #pragma HLS INTERFACE m_axi port=img_in  offset=slave bundle=gmem0
    #pragma HLS INTERFACE m_axi port=img_out offset=slave bundle=gmem1
    #pragma HLS INTERFACE s_axilite port=rows
    #pragma HLS INTERFACE s_axilite port=cols
    #pragma HLS INTERFACE s_axilite port=return

    xf::cv::Mat<XF_8UC1, IMAGE_HEIGHT, IMAGE_WIDTH, XF_NPPC8> in_mat(rows, cols);
    xf::cv::Mat<XF_8UC1, IMAGE_HEIGHT, IMAGE_WIDTH, XF_NPPC8> blur_mat(rows, cols);
    xf::cv::Mat<XF_8UC1, IMAGE_HEIGHT, IMAGE_WIDTH, XF_NPPC8> out_mat(rows, cols);

    // These run in a dataflow pipeline — all stages overlap in hardware
    #pragma HLS DATAFLOW
    xf::cv::Array2xfMat<INPUT_PTR_WIDTH, XF_8UC1, IMAGE_HEIGHT, IMAGE_WIDTH, XF_NPPC8>(img_in, in_mat);
    xf::cv::GaussianBlur<3, XF_BORDER_CONSTANT, XF_8UC1, IMAGE_HEIGHT, IMAGE_WIDTH, XF_NPPC8>(in_mat, blur_mat, 1.5f);
    xf::cv::Canny<XF_8UC1, XF_8UC1, IMAGE_HEIGHT, IMAGE_WIDTH, XF_NPPC8>(blur_mat, out_mat, 50, 150);
    xf::cv::xfMat2Array<OUTPUT_PTR_WIDTH, XF_8UC1, IMAGE_HEIGHT, IMAGE_WIDTH, XF_NPPC8>(out_mat, img_out);
}
}

Why #pragma HLS DATAFLOW matters: Without it, each stage waits for the previous to finish. With dataflow, Gaussian blur starts processing row N+1 while Canny processes row N. This cuts latency roughly in half.

Why XF_NPPC8: Processes 8 pixels per clock cycle. At 300MHz PL clock, that's 2.4 billion pixels/second — enough for 60fps at 4K.


Step 3: Synthesize and Build the XCLBIN

Create the Vitis project config and compile the kernel to a bitstream.

# connectivity.cfg — maps kernel ports to DDR banks
[connectivity]
nk=vision_pipeline:1:vision_pipeline_1
sp=vision_pipeline_1.img_in:DDR[0]
sp=vision_pipeline_1.img_out:DDR[1]

# Build the hardware (takes 30–60 min — grab coffee)
v++ -t hw \
    --platform xilinx_kv260_smartcam_202310_1 \
    --config connectivity.cfg \
    -c vision_kernel.cpp \
    -o vision_kernel.xo

v++ -t hw \
    --platform xilinx_kv260_smartcam_202310_1 \
    --config connectivity.cfg \
    -l vision_kernel.xo \
    -o vision_pipeline.xclbin

Expected: v++ reports successful bitstream generation. The .xclbin file will be around 50–200MB.

If it fails:

  • "Platform not found": Download the KV260 platform from Xilinx's platform portal and set PLATFORM_REPO_PATHS
  • Timing violations: Add --kernel_frequency 200 to reduce clock if 300MHz doesn't close timing

Step 4: Write the Host-Side ROS 2 Node

The host (ARM PS) manages DMA transfers and calls the kernel via OpenCL/XRT. The PL does the heavy lifting.

// kria_vision_node.cpp — ROS 2 node using XRT for kernel calls
#include "rclcpp/rclcpp.hpp"
#include "sensor_msgs/msg/image.hpp"
#include "xrt/xrt_bo.h"
#include "xrt/xrt_device.h"
#include "xrt/xrt_kernel.h"

class KriaVisionNode : public rclcpp::Node {
public:
    KriaVisionNode() : Node("kria_vision") {
        // Load bitstream once at startup — not per frame
        device_ = xrt::device(0);
        uuid_   = device_.load_xclbin("vision_pipeline.xclbin");
        kernel_ = xrt::kernel(device_, uuid_, "vision_pipeline");

        // Allocate DMA buffers in DDR — pinned memory, no copies
        // Buffer size: 1920 * 1080 * 1 byte (grayscale)
        in_bo_  = xrt::bo(device_, 1920 * 1080, kernel_.group_id(0));
        out_bo_ = xrt::bo(device_, 1920 * 1080, kernel_.group_id(1));

        sub_ = create_subscription<sensor_msgs::msg::Image>(
            "/camera/image_raw", 10,
            std::bind(&KriaVisionNode::imageCallback, this, std::placeholders::_1));

        pub_ = create_publisher<sensor_msgs::msg::Image>("/edges/image_raw", 10);
    }

private:
    void imageCallback(const sensor_msgs::msg::Image::SharedPtr msg) {
        // Map DMA buffer — zero-copy write directly into DDR
        auto in_ptr = in_bo_.map<uint8_t*>();
        std::memcpy(in_ptr, msg->data.data(), msg->data.size());
        in_bo_.sync(XCL_BO_SYNC_BO_TO_DEVICE);  // Flush cache to DDR

        // Launch kernel — non-blocking, PL starts immediately
        auto run = kernel_(in_bo_, out_bo_, msg->height, msg->width);
        run.wait();  // Wait for PL to finish (~2–4ms for 1080p)

        out_bo_.sync(XCL_BO_SYNC_BO_FROM_DEVICE);  // Bring results back

        // Publish result
        auto out_msg = sensor_msgs::msg::Image();
        out_msg.header = msg->header;
        out_msg.height = msg->height;
        out_msg.width  = msg->width;
        out_msg.encoding = "mono8";
        out_msg.data.resize(msg->height * msg->width);
        std::memcpy(out_msg.data.data(), out_bo_.map<uint8_t*>(), out_msg.data.size());
        pub_->publish(out_msg);
    }

    xrt::device device_;
    xrt::uuid uuid_;
    xrt::kernel kernel_;
    xrt::bo in_bo_, out_bo_;
    rclcpp::Subscription<sensor_msgs::msg::Image>::SharedPtr sub_;
    rclcpp::Publisher<sensor_msgs::msg::Image>::SharedPtr pub_;
};

int main(int argc, char** argv) {
    rclcpp::init(argc, argv);
    rclcpp::spin(std::make_shared<KriaVisionNode>());
    rclcpp::shutdown();
}

Why pre-allocate DMA buffers: Calling xrt::bo() inside the callback would stall for ~20ms each frame. Allocating once at startup means the kernel invocation path is pure DMA + compute — no memory allocation overhead.


Step 5: Build and Deploy

# On the KV260 — native build or cross-compile with sysroot
colcon build --packages-select kria_vision \
    --cmake-args \
    -DCMAKE_CXX_FLAGS="-I/opt/xilinx/xrt/include" \
    -DCMAKE_EXE_LINKER_FLAGS="-L/opt/xilinx/xrt/lib -lxrt_coreutil"

source install/setup.bash
ros2 run kria_vision kria_vision_node

Verification

# Terminal 1 — check frame rate on output topic
ros2 topic hz /edges/image_raw

# Terminal 2 — measure per-frame latency
ros2 run image_tools showimage --ros-args -r image:=/edges/image_raw

# Terminal 3 — confirm CPU is not saturated
htop

You should see:

  • /edges/image_raw publishing at 30+ Hz
  • CPU usage on vision node below 15% (the PL handles compute)
  • End-to-end latency under 5ms per frame (vs. 120ms+ CPU-only)
# Confirm the PL kernel is active and temperature is stable
xbutil examine --report dynamic_regions

Expected output: Shows vision_pipeline loaded in the dynamic region, power draw around 3–5W for the PL.


What You Learned

  • The Kria KV260's PL processes frames in hardware with true spatial parallelism — no OS scheduling, no cache misses at the pixel level
  • #pragma HLS DATAFLOW is the single biggest performance lever in HLS kernels; always use it for streaming pipelines
  • Pre-allocating xrt::bo DMA buffers is mandatory for real-time performance — never allocate inside a callback
  • This approach works for any Vitis Vision Library primitive: optical flow, stereo disparity, color space conversion

Limitations to know:

  • Bitstream compilation takes 30–60 minutes — iterate on simulation first with v++ -t sw_emu
  • The .xclbin is platform-specific; a binary built for KV260 won't run on KU15P
  • Dynamic partial reconfiguration (swapping kernels at runtime) is possible but significantly more complex to set up

When NOT to use this: If your pipeline changes frequently during development, the compile cycle is painful. Use CPU for prototyping, then move to PL once the algorithm is stable.


Tested on Kria KV260, Ubuntu 22.04, Vitis 2023.1, XRT 2.15, ROS 2 Humble, Vitis Vision Library 2023.1