High-performance SAM2 inference framework with TensorRT

Written by TIER IV | 25-Jun-2025 01:00:00

Keywords: SAM2, Instance Segmentation, TensorRT, ONNX Runtime, Optimization

I’m Hunter (Cheng Haoxuan), an AI engineer at TIER IV. I'm working on the Co-MLOps project, focusing on building scalable and efficient auto-labeling pipelines. Co-MLOps is an open, collaborative initiative aimed at generating and sharing high-quality, diverse datasets to accelerate the democratization of autonomous driving. In this blog post, I’m excited to share our journey optimizing SAM2 for real-world deployment, specifically focusing on high-performance inference using ONNX Runtime and TensorRT—both implemented from scratch in C++.

To enable efficient, large-scale GPU deployment, we first developed a custom C++ inference framework based on ONNX Runtime. This required overcoming limitations in the official C++ API, including unsupported operations and integration gaps. We also implemented batch processing support to improve GPU utilization and boost throughput, carefully ensuring the model and post-processing logic could correctly handle batched inputs. Through these efforts, we achieved a runtime of 356 ms per frame (around 100 object boxes) on a single Nvidia L40s GPU.

We then built a TensorRT-based C++ inference engine to push performance even further. By utilizing FP16 inference, Tensor Core acceleration, and custom memory optimizations, we successfully reduced the per-frame runtime to 123 ms (prototype is about 5000 ms), delivering a speedup of up to 3x compared to ONNX Runtime. As a reference, we also offer benchmarks for a desktop GPU (Nvidia GeForce RTX 3070 Ti) and an edge device (Nvidia Jetson Orin) in later sections.

As a result of these efforts, we not only successfully integrated SAM2 into our auto-labeling pipeline, but also achieved a 40x speedup in per-frame generation time, turning a powerful research model into a scalable, real-world solution. The thumbnail image above shows the inference result of our optimized SAM2 on a set of surround frames captured by our in-house Driving Recording System (DRS), a custom sensor kit designed for collecting high-resolution images in real-world driving environments.

You can find the related code on GitHub.

Background

High-quality labeled data is the backbone of modern AI development. However, manual data annotation is time-consuming, expensive, and difficult to scale—especially for complex tasks like instance segmentation (a computer vision task that detects and segments each individual object instance within an image). As the demand for large, diverse, and accurately labeled datasets continues to grow across industries, the need for a scalable and efficient approach to handle data has become increasingly urgent, making auto-labeling a game-changer.

Fig. 1. Auto-labeling pipeline

For instance segmentation, leveraging a foundation model is not only helpful, but essential. These models, pre-trained on massive datasets, can generalize well to new domains with minimal adaptation. Among them, SAM2 stood out as a strong candidate: It offers powerful segmentation capabilities with impressive accuracy and generality.

An overview of our auto-labeling pipeline for instance segmentation is illustrated in Fig. 1. This framework is inspired by prior research, which extends SAM2 by integrating object detection-based prompt generation, enabling fully automated, memory-efficient video segmentation for real-world applications.

However, it also comes with some serious limitations—the capability of real-time inference is limited, and the high inference latency leads to significant operational costs. The inference speed is particularly problematic when considering real-world deployment. In our case, automatic annotation is expected to run on GPU clusters at scale. A time-consuming inference may result in high GPU resource demands and cost. Without optimization, the system would struggle to keep up with large-scale data processing demands, making it impractical for our production workflow.

To bridge this gap between research and real-world deployment, we decided to take on the task of adapting SAM2 for our needs. Our goal was to retain the strengths of the model while making it fast, portable, and ready for large-scale auto-labeling.

What is SAM2?

SAM2 (Segment Anything Model v2) is a foundation model for image segmentation, developed to produce high-quality, prompt-based masks with improved efficiency and generalization. As shown in Fig. 2, given a point prompt or a bounding box prompt from an image as input, SAM2 can generate a precise segmentation mask for the corresponding object or region in the image. This makes it especially valuable for applications such as auto-labeling, where fast, consistent, and accurate generation of segmentation masks is essential at scale.

Fig. 2. SAM2 architecture

Compared to its predecessor (SAM), SAM2 introduces several enhancements:

Smaller model size with significantly faster inference speed, while maintaining comparable segmentation quality.
A unified architecture for both image and video segmentation, whereas SAM uses separate models for different tasks.
Native support for video datasets and temporally consistent segmentation (via the MOSE dataset), making it more adaptable for dynamic scene understanding.

These improvements make SAM2 a more practical and scalable choice for production scenarios like ours, where segmentation needs to be performed over large driving datasets with domain-specific prompts and class constraints.

Model conversion and Python inference prototyping

To prepare SAM2 for high-performance inference, we first converted the PyTorch model to ONNX format using torch.onnx.export. While the basic export process was supported by community examples, SAM2’s dynamic input shapes and custom layers introduced some challenges. For example, parts of the decoder required adjustment to ensure compatibility with ONNX inference engines. These modifications, particularly for supporting ONNX Runtime C++ API and TensorRT, will be discussed in detail later in this blog.

After conversion, we developed a Python-based ONNX Runtime prototype to quickly validate the model’s behavior. This prototype allowed us to:

Verify the correctness of ONNX exports
Test prompt-based inputs (points or boxes)
Check numerical consistency with PyTorch

There are many open-source references available for ONNX model conversion and Python ONNX Runtime inference, which made this stage relatively smooth. However, for production deployment, Python’s limitations in speed and scalability became a bottleneck.

We then moved on to building native C++ frameworks to meet high-speed processing and large-scale deployment requirements, described in the next sections.

ONNX Runtime and TensorRT C++ inference framework development

Building an efficient inference pipeline in C++ required more than just plugging in the exported ONNX models. Due to the complexity of SAM2 and limitations in certain versions of ONNX Runtime and TensorRT APIs, we made minor modifications to parts of the model architecture. These changes ensured compatibility and stability across inference backends, and will be discussed further in later sections.

Since we had already validated the model’s correctness and logic using the Python ONNX Runtime prototype, we began by developing a native C++ inference framework based on ONNX Runtime. This allowed us to reuse the validated model and focus on achieving better runtime performance and system integration.

Compared to the Python version, the C++ implementation of ONNX Runtime offers significantly faster inference due to lower overhead, better memory control, and more efficient multi-threading. These characteristics are particularly important in high-throughput scenarios like auto-labeling, where every millisecond counts.

Initially, we were able to get the ONNX Runtime inference working smoothly in FP32 using the CUDA Execution Provider. The results were stable, but the performance was still not ideal for our auto-labeling pipeline. To push for higher throughput, we explored FP16 inference using ONNX Runtime.

However, despite switching to FP16, the speedup was marginal. After some investigation, we found that ONNX Runtime’s CUDA provider doesn’t fully take advantage of FP16 acceleration.

To overcome this limitation, we decided to build a TensorRT-based C++ inference framework from scratch. TensorRT provides enhanced support for FP16 inference by leveraging Nvidia’s Tensor Cores, enabling more efficient computation and finer control over the optimization pipeline.

Our inference pipeline—shared by both ONNX Runtime and TensorRT implementations—follows a clear and modular structure:

1. Input preparation

Preprocess raw images (resize, normalize) to match the encoder’s input format. Prompts (bounding boxes) are encoded into tensor structures.

2. Model loading and memory allocation

Load the ONNX model into memory. For ONNX Runtime, this involves creating a session with the CUDA Execution Provider. For TensorRT, this step includes parsing the ONNX model and building the inference engine. Memory buffers are allocated for inputs and outputs.

3. Forward inference

The encoder processes the image and produces intermediate features. These features, along with the encoded prompts, are fed into the decoder to generate segmentation masks.

4. Post-processing

The output masks are resized, thresholded, and formatted to integrate with downstream components like dataset exporters or annotation tools.

This pipeline structure is designed to support high efficiency while offering two well-integrated inference options, allowing us to choose between ONNX Runtime and TensorRT, based on performance needs and deployment constraints.

While this initial framework provided a solid foundation for running SAM2 in C++, it quickly became clear that further optimization was necessary to meet the performance requirements of real-time, high-volume auto-labeling. In the following section, we dive into the key optimizations we implemented—such as batching, multi-threading, and memory reuse—to significantly improve inference speed without sacrificing accuracy.

Optimizations to speed up deployment

We applied the following techniques in our deployment framework.

Decoder batch processing
Encoder batch processing (optional)
Multi-threading with OpenMP
Inference pipeline optimization
GPU Memory allocation optimization

The subsequent experiment was conducted entirely on a single L40s GPU. The target image we chose is a front-view picture captured by our test car, which contains 135 detected traffic-related objects (135 box prompts).

Decoder batch processing

The prototype inference framework's inefficiency stems from processing each bounding box as a single prompt. This necessitates executing the decoder for each of the N bounding boxes within a frame, which is time-consuming. To improve time efficiency, batch processing of the decoder's bounding box input is essential. Fig. 3 and Fig. 4 show the difference in framework pipeline between prototype and decoder-batch-processing type. The decoder was exported with a dynamic input size, enabling dynamic batch-size processing during runtime.

Fig. 3. Prototype inference framework pipeline

Fig. 4. Pipeline with decoder batch processing

Fig. 5. Pipeline with decoder batch processing (with limitation)

As shown in Fig. 5, to accommodate the GPU memory constraints of the hardware, we have opted to implement a flexible batch size that allows for processing 'M' boxes per iteration, rather than batching all boxes together. The value of 'M' can be adjusted to align with the specific limitations of your GPU memory.

This optimization significantly reduces the total time consumed by the decoder. The results are illustrated in Fig. 6.

Speed evaluation

Fig. 6. Speed evaluation for decoder batch processing

As a reference experiment, we evaluated the efficiency of decoder batch processing.

The test image contains 95 object boxes.
The experiment was conducted on a single L40s GPU.
Using an FP16 TensorRT engine.

Decoder batch processing evaluation

Fig. 7. Decoder evaluation with different batch sizes

As shown in Fig. 7, a batch size of 64 is most efficient for this environment.

Encoder batch processing

Fig. 8. Without encoder batch processing

Fig. 9. Encoder batch processing

In order to improve overall throughput, we initially attempted to batch multiple images during the encoder stage, which is illustrated in Fig. 8 and Fig. 9. The assumption was that leveraging batch processing could reduce per-image inference time due to more efficient GPU utilization.

We modified the encoder to accept batches of images and carefully ensured all layers, including normalization and positional encoding, behaved correctly with larger input dimensions.

Although we tested different batch sizes (1 to 3) for the encoder, the per-image processing time remained the same—or even worsened slightly at batch size 3—despite higher GPU memory usage. Detailed benchmarking results are summarized in the chart below. This is likely because the encoder is computationally intensive, and even at batch size 1, the L40s GPU’s compute units were already near saturation. As a result, increasing the batch size did not lead to better throughput. However, with more powerful GPUs like the H100, which offer significantly more compute capacity, batching may yield better performance improvements.

Table 1. Encoder evaluation with different batch size

Encoder batch processing evaluation

Fig. 10. Encoder evaluation with different batch sizes

As shown in Table 1 and Fig. 10, not only did the encoder fail to gain speed from batching, but GPU memory consumption increased and overall efficiency actually decreased. Due to this, we chose not to implement encoder-side batch processing in the final production pipeline.

Multi-threading with OpenMP

During the deployment phase, we identified a few bottlenecks in post-processing and image drawing. As shown in Fig. 11, some of these image-level computations were relatively heavy and became a limiting factor in inference speed. To address this, we applied OpenMP-based parallelization to offload and accelerate the more time-consuming parts.

Single thread time evaluation (ms)

Fig. 11. Time evaluation for pipeline (single thread)

The implementation was straightforward but effective—by simply parallelizing loops over pixels or masks, we achieved a noticeable speedup in these stages. The presented outcome is distinctly illustrated in Fig. 12. This lightweight optimization helped further improve the overall throughput of the auto-labeling pipeline with minimal development overhead.

Multi-threads for mask drawing

Multi-threads for decoder’s post-processing

Multi-thread time evaluation (ms)

Fig. 12. Time evaluation for pipeline (multi-threads)

Inference pipeline optimization

The inference pipeline of SAM2, illustrated in Fig. 13, highlights the decoder's batch processing mode. This mode necessitates multiple decoder starts due to batch size limitations, leading to substantial cold start times for each initial batch. To optimize this process, we restructured the code to initialize the decoder at frame 1 and subsequently reuse it for ensuing frames.

Fig. 13. SAM2 inference pipeline (ONNX Runtime)

The graph in Fig. 14 depicts the performance evaluation of executing one frame (excluding the initial frame) with the decoder needing three executions due to batch size limitations. After optimization, the execution time of the initial batch is reduced to half of its previous duration. Essentially, the optimization efforts have led to a substantial speedup, particularly in the initial batch processing within the frame execution cycle.

Engine pipeline (ms)

Fig. 14. Inference time evaluation for one frame

*This time evaluation contains not only engine forward time, but also pre-processing, image I/O time, etc.

GPU memory allocation optimization (TensorRT)

TensorRT offers more flexibility in GPU memory allocation than ONNX Runtime, which requires precise memory allocation for each input tensor during engine inference. In contrast, TensorRT supports dynamic batch processing, which means the engine can handle varying batch sizes at runtime without regeneration. It also allows for the initial allocation of maximum memory, which can be reused for subsequent batches. This eliminates the need for decoder memory allocation for each batch, saving time. Fig. 15 illustrates the TensorRT’s inference pipeline before and after the optimization.

Fig. 15. SAM2 inference pipeline (TensorRT)

As shown in Fig. 16, the overall execution time for a single frame was improved by 1.6x after GPU memory allocation optimization.

Engine pipeline (ms)

Fig. 16. Inference time evaluation for one frame

*This time evaluation contains not only engine forward time, but also pre-processing, image I/O time, etc.

Total performance evaluation

Lastly, an overall performance evaluation is presented to demonstrate the performance of each milestone. The experimental environment is shown below.

Experiment environment

Target image: Single car-front-view image shown in Fig. 17, captured by our in-house Driving Recording System (DRS).
Number of boxes: 94
Device: Single L40s GPU
- GPU RAM: 46068 MiB
- CPU frequency: 2.10–3.40 GHz

Fig. 17. Test image: original (top), inference result (bottom)

Speed evaluation

Table 2 below shows the benchmark result of our experiment. The columns "encoder" and "decoder" only include the TensorRT inference engine forward time for one frame. The column “whole” shows the entire inference time per frame, which includes TensorRT inference engine forward, image read/write, and pre/post-process.

Table 2. Speed evaluation on different frameworks

*The original PyTorch version takes over 10 minutes to process a single frame with many boxes, making it unsuitable for testing such images and thus is excluded from this comparison.

Speed evaluation

Fig. 19. Speed evaluation on different frameworks

The optimization process resulted in a 40x increase in speed for executing one frame, which significantly improved the auto-labeling functionality. We also listed 2 benchmarks for other devices besides L40s, shown in Table 3.

Table 3. Speed evaluation on other devices (ms)

These include the RTX 3070 Ti, a consumer-grade desktop GPU, and the Jetson Orin, an edge device. While their performance doesn’t match the L40s, both still demonstrated respectable inference speeds, validating the effectiveness of our optimizations across different hardware environments. For the Jetson Orin, we ensured maximum performance by setting the power mode to MAXN and enabling Jetson Clocks, with both CPU and GPU running at full speed during evaluation.

Accuracy evaluation

Table 4. Mean IoU evaluation on different framework

To ensure that our performance optimizations did not compromise accuracy, we evaluated the optimized SAM2 inference pipeline on the MOSE validation set (311 images). As shown in Table 4, there is no degradation in segmentation quality compared to the original PyTorch model—confirming that our acceleration efforts (including batch processing, multithreading, and FP16 inference) maintained full model fidelity while significantly improving speed.

Conclusion

The successful adaptation of SAM2 for our auto-labeling pipeline demonstrates how a powerful research model can be transformed into a scalable, production-ready solution. Through careful model conversion, operator-level optimization, batch processing, multi-threaded execution, GPU memory management, and the development of high-performance C++ inference engines with ONNX Runtime and TensorRT, we achieved a 40x speedup in per-frame processing, without compromising segmentation quality. These dramatically improve inference throughput and reduce GPU costs, making large-scale, high-frequency annotation both feasible and cost-effective. Furthermore, we are planning to fully integrate this optimized pipeline into the Co-MLOps auto-labeling framework, enabling automated generation of high-quality, diverse labeled datasets, which is a key step toward accelerating the democratization of autonomous driving.

TIER IV is always on the lookout for passionate individuals to join our journey. If you're interested in working on large-scale data systems, edge sensor platforms, or scalable infrastructure to support real-world autonomous driving, check out these open positions on the Co-MLOps team.

Visit our careers page to view all job openings.

If you’re uncertain about which roles align best with your experience, or if the current job openings don’t quite match your preferences, register your interest here. We’ll get in touch if a role that matches your experience becomes available, and schedule an informal interview.

Inquiries

Media: pr@tier4.jp
Business: sales@tier4.jp

Social Media
X (Japan/Global) | LinkedIn | Facebook | Instagram | YouTube

More

View full post