Keywords: SAM2, Instance Segmentation, TensorRT, ONNX Runtime, Optimization
I’m Hunter (Cheng Haoxuan), an AI engineer at TIER IV. I'm working on the Co-MLOps project, focusing on building scalable and efficient auto-labeling pipelines. Co-MLOps is an open, collaborative initiative aimed at generating and sharing high-quality, diverse datasets to accelerate the democratization of autonomous driving. In this blog post, I’m excited to share our journey optimizing SAM2 for real-world deployment, specifically focusing on high-performance inference using ONNX Runtime and TensorRT—both implemented from scratch in C++.
To enable efficient, large-scale GPU deployment, we first developed a custom C++ inference framework based on ONNX Runtime. This required overcoming limitations in the official C++ API, including unsupported operations and integration gaps. We also implemented batch processing support to improve GPU utilization and boost throughput, carefully ensuring the model and post-processing logic could correctly handle batched inputs. Through these efforts, we achieved a runtime of 356 ms per frame (around 100 object boxes) on a single Nvidia L40s GPU.
We then built a TensorRT-based C++ inference engine to push performance even further. By utilizing FP16 inference, Tensor Core acceleration, and custom memory optimizations, we successfully reduced the per-frame runtime to 123 ms (prototype is about 5000 ms), delivering a speedup of up to 3x compared to ONNX Runtime. As a reference, we also offer benchmarks for a desktop GPU (Nvidia GeForce RTX 3070 Ti) and an edge device (Nvidia Jetson Orin) in later sections.
As a result of these efforts, we not only successfully integrated SAM2 into our auto-labeling pipeline, but also achieved a 40x speedup in per-frame generation time, turning a powerful research model into a scalable, real-world solution. The thumbnail image above shows the inference result of our optimized SAM2 on a set of surround frames captured by our in-house Driving Recording System (DRS), a custom sensor kit designed for collecting high-resolution images in real-world driving environments.
You can find the related code on GitHub.
High-quality labeled data is the backbone of modern AI development. However, manual data annotation is time-consuming, expensive, and difficult to scale—especially for complex tasks like instance segmentation (a computer vision task that detects and segments each individual object instance within an image). As the demand for large, diverse, and accurately labeled datasets continues to grow across industries, the need for a scalable and efficient approach to handle data has become increasingly urgent, making auto-labeling a game-changer.
For instance segmentation, leveraging a foundation model is not only helpful, but essential. These models, pre-trained on massive datasets, can generalize well to new domains with minimal adaptation. Among them, SAM2 stood out as a strong candidate: It offers powerful segmentation capabilities with impressive accuracy and generality.
An overview of our auto-labeling pipeline for instance segmentation is illustrated in Fig. 1. This framework is inspired by prior research, which extends SAM2 by integrating object detection-based prompt generation, enabling fully automated, memory-efficient video segmentation for real-world applications.
However, it also comes with some serious limitations—the capability of real-time inference is limited, and the high inference latency leads to significant operational costs. The inference speed is particularly problematic when considering real-world deployment. In our case, automatic annotation is expected to run on GPU clusters at scale. A time-consuming inference may result in high GPU resource demands and cost. Without optimization, the system would struggle to keep up with large-scale data processing demands, making it impractical for our production workflow.
To bridge this gap between research and real-world deployment, we decided to take on the task of adapting SAM2 for our needs. Our goal was to retain the strengths of the model while making it fast, portable, and ready for large-scale auto-labeling.
SAM2 (Segment Anything Model v2) is a foundation model for image segmentation, developed to produce high-quality, prompt-based masks with improved efficiency and generalization. As shown in Fig. 2, given a point prompt or a bounding box prompt from an image as input, SAM2 can generate a precise segmentation mask for the corresponding object or region in the image. This makes it especially valuable for applications such as auto-labeling, where fast, consistent, and accurate generation of segmentation masks is essential at scale.
Compared to its predecessor (SAM), SAM2 introduces several enhancements:
These improvements make SAM2 a more practical and scalable choice for production scenarios like ours, where segmentation needs to be performed over large driving datasets with domain-specific prompts and class constraints.
To prepare SAM2 for high-performance inference, we first converted the PyTorch model to ONNX format using torch.onnx.export. While the basic export process was supported by community examples, SAM2’s dynamic input shapes and custom layers introduced some challenges. For example, parts of the decoder required adjustment to ensure compatibility with ONNX inference engines. These modifications, particularly for supporting ONNX Runtime C++ API and TensorRT, will be discussed in detail later in this blog.
After conversion, we developed a Python-based ONNX Runtime prototype to quickly validate the model’s behavior. This prototype allowed us to:
There are many open-source references available for ONNX model conversion and Python ONNX Runtime inference, which made this stage relatively smooth. However, for production deployment, Python’s limitations in speed and scalability became a bottleneck.
We then moved on to building native C++ frameworks to meet high-speed processing and large-scale deployment requirements, described in the next sections.
Building an efficient inference pipeline in C++ required more than just plugging in the exported ONNX models. Due to the complexity of SAM2 and limitations in certain versions of ONNX Runtime and TensorRT APIs, we made minor modifications to parts of the model architecture. These changes ensured compatibility and stability across inference backends, and will be discussed further in later sections.
Since we had already validated the model’s correctness and logic using the Python ONNX Runtime prototype, we began by developing a native C++ inference framework based on ONNX Runtime. This allowed us to reuse the validated model and focus on achieving better runtime performance and system integration.
Compared to the Python version, the C++ implementation of ONNX Runtime offers significantly faster inference due to lower overhead, better memory control, and more efficient multi-threading. These characteristics are particularly important in high-throughput scenarios like auto-labeling, where every millisecond counts.
Initially, we were able to get the ONNX Runtime inference working smoothly in FP32 using the CUDA Execution Provider. The results were stable, but the performance was still not ideal for our auto-labeling pipeline. To push for higher throughput, we explored FP16 inference using ONNX Runtime.
However, despite switching to FP16, the speedup was marginal. After some investigation, we found that ONNX Runtime’s CUDA provider doesn’t fully take advantage of FP16 acceleration.
To overcome this limitation, we decided to build a TensorRT-based C++ inference framework from scratch. TensorRT provides enhanced support for FP16 inference by leveraging Nvidia’s Tensor Cores, enabling more efficient computation and finer control over the optimization pipeline.
Our inference pipeline—shared by both ONNX Runtime and TensorRT implementations—follows a clear and modular structure:
1. Input preparation
Preprocess raw images (resize, normalize) to match the encoder’s input format. Prompts (bounding boxes) are encoded into tensor structures.
2. Model loading and memory allocation
Load the ONNX model into memory. For ONNX Runtime, this involves creating a session with the CUDA Execution Provider. For TensorRT, this step includes parsing the ONNX model and building the inference engine. Memory buffers are allocated for inputs and outputs.
3. Forward inference
The encoder processes the image and produces intermediate features. These features, along with the encoded prompts, are fed into the decoder to generate segmentation masks.
4. Post-processing
The output masks are resized, thresholded, and formatted to integrate with downstream components like dataset exporters or annotation tools.
This pipeline structure is designed to support high efficiency while offering two well-integrated inference options, allowing us to choose between ONNX Runtime and TensorRT, based on performance needs and deployment constraints.
While this initial framework provided a solid foundation for running SAM2 in C++, it quickly became clear that further optimization was necessary to meet the performance requirements of real-time, high-volume auto-labeling. In the following section, we dive into the key optimizations we implemented—such as batching, multi-threading, and memory reuse—to significantly improve inference speed without sacrificing accuracy.
We applied the following techniques in our deployment framework.
The subsequent experiment was conducted entirely on a single L40s GPU. The target image we chose is a front-view picture captured by our test car, which contains 135 detected traffic-related objects (135 box prompts).
The prototype inference framework's inefficiency stems from processing each bounding box as a single prompt. This necessitates executing the decoder for each of the N bounding boxes within a frame, which is time-consuming. To improve time efficiency, batch processing of the decoder's bounding box input is essential. Fig. 3 and Fig. 4 show the difference in framework pipeline between prototype and decoder-batch-processing type. The decoder was exported with a dynamic input size, enabling dynamic batch-size processing during runtime.
Fig. 3. Prototype inference framework pipeline
Fig. 4. Pipeline with decoder batch processing
Fig. 5. Pipeline with decoder batch processing (with limitation)
As shown in Fig. 5, to accommodate the GPU memory constraints of the hardware, we have opted to implement a flexible batch size that allows for processing 'M' boxes per iteration, rather than batching all boxes together. The value of 'M' can be adjusted to align with the specific limitations of your GPU memory.
This optimization significantly reduces the total time consumed by the decoder. The results are illustrated in Fig. 6.
Speed evaluation
Fig. 6. Speed evaluation for decoder batch processing
Decoder batch processing evaluation
As shown in Fig. 7, a batch size of 64 is most efficient for this environment.
Fig. 8. Without encoder batch processing
In order to improve overall throughput, we initially attempted to batch multiple images during the encoder stage, which is illustrated in Fig. 8 and Fig. 9. The assumption was that leveraging batch processing could reduce per-image inference time due to more efficient GPU utilization.
We modified the encoder to accept batches of images and carefully ensured all layers, including normalization and positional encoding, behaved correctly with larger input dimensions.
Although we tested different batch sizes (1 to 3) for the encoder, the per-image processing time remained the same—or even worsened slightly at batch size 3—despite higher GPU memory usage. Detailed benchmarking results are summarized in the chart below. This is likely because the encoder is computationally intensive, and even at batch size 1, the L40s GPU’s compute units were already near saturation. As a result, increasing the batch size did not lead to better throughput. However, with more powerful GPUs like the H100, which offer significantly more compute capacity, batching may yield better performance improvements.
Encoder batch processing evaluation
As shown in Table 1 and Fig. 10, not only did the encoder fail to gain speed from batching, but GPU memory consumption increased and overall efficiency actually decreased. Due to this, we chose not to implement encoder-side batch processing in the final production pipeline.
During the deployment phase, we identified a few bottlenecks in post-processing and image drawing. As shown in Fig. 11, some of these image-level computations were relatively heavy and became a limiting factor in inference speed. To address this, we applied OpenMP-based parallelization to offload and accelerate the more time-consuming parts.
Single thread time evaluation (ms)
Fig. 11. Time evaluation for pipeline (single thread)
Multi-threads for decoder’s post-processing
Multi-thread time evaluation (ms)
Fig. 12. Time evaluation for pipeline (multi-threads)
Inference pipeline optimization
The inference pipeline of SAM2, illustrated in Fig. 13, highlights the decoder's batch processing mode. This mode necessitates multiple decoder starts due to batch size limitations, leading to substantial cold start times for each initial batch. To optimize this process, we restructured the code to initialize the decoder at frame 1 and subsequently reuse it for ensuing frames.
The graph in Fig. 14 depicts the performance evaluation of executing one frame (excluding the initial frame) with the decoder needing three executions due to batch size limitations. After optimization, the execution time of the initial batch is reduced to half of its previous duration. Essentially, the optimization efforts have led to a substantial speedup, particularly in the initial batch processing within the frame execution cycle.
Engine pipeline (ms)
Fig. 14. Inference time evaluation for one frame
*This time evaluation contains not only engine forward time, but also pre-processing, image I/O time, etc.
TensorRT offers more flexibility in GPU memory allocation than ONNX Runtime, which requires precise memory allocation for each input tensor during engine inference. In contrast, TensorRT supports dynamic batch processing, which means the engine can handle varying batch sizes at runtime without regeneration. It also allows for the initial allocation of maximum memory, which can be reused for subsequent batches. This eliminates the need for decoder memory allocation for each batch, saving time. Fig. 15 illustrates the TensorRT’s inference pipeline before and after the optimization.
As shown in Fig. 16, the overall execution time for a single frame was improved by 1.6x after GPU memory allocation optimization.
Engine pipeline (ms)
Fig. 16. Inference time evaluation for one frame
*This time evaluation contains not only engine forward time, but also pre-processing, image I/O time, etc.
Lastly, an overall performance evaluation is presented to demonstrate the performance of each milestone. The experimental environment is shown below.
Table 2 below shows the benchmark result of our experiment. The columns "encoder" and "decoder" only include the TensorRT inference engine forward time for one frame. The column “whole” shows the entire inference time per frame, which includes TensorRT inference engine forward, image read/write, and pre/post-process.
Table 2. Speed evaluation on different frameworks
*The original PyTorch version takes over 10 minutes to process a single frame with many boxes, making it unsuitable for testing such images and thus is excluded from this comparison.
Speed evaluation
Fig. 19. Speed evaluation on different frameworks
The optimization process resulted in a 40x increase in speed for executing one frame, which significantly improved the auto-labeling functionality. We also listed 2 benchmarks for other devices besides L40s, shown in Table 3.
Table 3. Speed evaluation on other devices (ms)
These include the RTX 3070 Ti, a consumer-grade desktop GPU, and the Jetson Orin, an edge device. While their performance doesn’t match the L40s, both still demonstrated respectable inference speeds, validating the effectiveness of our optimizations across different hardware environments. For the Jetson Orin, we ensured maximum performance by setting the power mode to MAXN and enabling Jetson Clocks, with both CPU and GPU running at full speed during evaluation.
To ensure that our performance optimizations did not compromise accuracy, we evaluated the optimized SAM2 inference pipeline on the MOSE validation set (311 images). As shown in Table 4, there is no degradation in segmentation quality compared to the original PyTorch model—confirming that our acceleration efforts (including batch processing, multithreading, and FP16 inference) maintained full model fidelity while significantly improving speed.
The successful adaptation of SAM2 for our auto-labeling pipeline demonstrates how a powerful research model can be transformed into a scalable, production-ready solution. Through careful model conversion, operator-level optimization, batch processing, multi-threaded execution, GPU memory management, and the development of high-performance C++ inference engines with ONNX Runtime and TensorRT, we achieved a 40x speedup in per-frame processing, without compromising segmentation quality. These dramatically improve inference throughput and reduce GPU costs, making large-scale, high-frequency annotation both feasible and cost-effective. Furthermore, we are planning to fully integrate this optimized pipeline into the Co-MLOps auto-labeling framework, enabling automated generation of high-quality, diverse labeled datasets, which is a key step toward accelerating the democratization of autonomous driving.
TIER IV is always on the lookout for passionate individuals to join our journey. If you're interested in working on large-scale data systems, edge sensor platforms, or scalable infrastructure to support real-world autonomous driving, check out these open positions on the Co-MLOps team.
Visit our careers page to view all job openings.
If you’re uncertain about which roles align best with your experience, or if the current job openings don’t quite match your preferences, register your interest here. We’ll get in touch if a role that matches your experience becomes available, and schedule an informal interview.
Inquiries
Social Media
X (Japan/Global) | LinkedIn | Facebook | Instagram | YouTube
More