Advancing autonomous driving technology with MLOps, training infrastructure

Written by TIER IV | 19-Dec-2024 01:00:00

I’m Riki Ota, an MLOps engineer on TIER IV's Data/ML team. My role involves developing and operating services that underpin MLOps workflows. My team creates and oversees a suite of cloud-based services to tackle challenges related to data and machine learning. In this post, I’ll introduce what we’re working on and highlight challenges in achieving shorter lead times for refining machine learning models.

MLOps challenges

The number of autonomous driving hubs running TIER IV software has been growing every year. New services are regularly appearing in regional areas. As these areas of operation expand, so does the demand for machine learning model improvements, driving the need to further shorten the lead time for these improvements.

Here are some of the challenges in MLOps for autonomous driving, including issues common to MLOps in general:

Difficulty selecting appropriate data due to the large amount of unstructured data.
High computational and human costs for annotation and model training.
Rigorous safety standards and thorough evaluations need to be cleared with the integration of autonomous driving software and machine learning models.

Data is collected daily in the development phase, including sensor data from cameras and LiDAR in multiple directions, resulting in massive volumes of information. Manually extracting relevant and useful data from this vast dataset is a significant challenge. Additionally, the characteristics of the data and the required volume lead to high costs for annotation and training, which drive up associated management costs. The need for thorough evaluations of autonomous driving software after integrating machine learning models further adds to the lead time required for model improvements.

In autonomous driving, where strict safety standards are a necessity, it is crucial to carefully manage models running in released software. This includes ensuring accountability in the event of accidents and maintaining robust tracking and reproducibility of experiments. Taking these challenges into account, we are currently developing an MLOps workflow to achieve efficient model improvements for the ever-increasing number of autonomous driving hubs.

MLOps workflow

Under the current MLOps workflow, the Data/ML team is responsible for the steps from data collection to training, with a primary focus on analyzing unstructured data and developing the annotation and ML pipelines.

The Data Search Service is where collected rosbag data is analyzed and extracted. Metadata can be used to extract data of interest. A simple visualization feature is also available to provide an overview of rosbag data via the web.

The Annotation Pipeline includes functions for management, search and visualization of annotated datasets, integrating with third-party annotation tools to create a seamless flow from collected rosbag data to annotation.

The ML Pipeline is for managing experiments and models, which will be discussed in more detail in the next section.

Finally, the model is integrated into the autonomous driving software via the CI/CD Pipeline for evaluation and then deployed to the vehicle.

This MLOps workflow structure streamlines the annotation and training cycles. Parts of this workflow are also used for creating evaluation data for autonomous driving software, further increasing development efficiency.

ML Pipeline

The ML Pipeline, which serves as the training infrastructure, has been released in beta and is now available for internal use, as well as for a select group of external users. The documentation can be found here. The main features include experiment management and model management. Users can easily execute training and release models via either the command line or graphical user interface, with seamless integration into Web.Auto services.

Experiment management

ML Experiments provides a pipeline for executing training on the cloud, along with functions for managing experimental conditions and outcomes.

The pipeline for executing training is divided into three stages: Build, Process, and Train.

In the Build stage, a Docker image is created from code uploaded via the Web.Auto command line interface or from code stored on GitHub.

In the Process stage, preprocessing of the dataset is performed using the specified dataset configuration and the image created in the Build stage.

In the final Train stage, training is executed using the image from the Build stage and the output – comprising logs, checkpoints, and the machine learning model – is processed.

Training can be efficiently managed by reusing the results from each stage. This allows for modifications to training settings and rerunning the code, or for resuming training from a specific checkpoint.

Training code and settings management

A requirement for any training infrastructure is the reproducibility of experiments. This is especially critical in autonomous driving, where the ability to provide accountability in the event of an accident makes tracking and reproducing machine learning experiments essential. To ensure reproducibility, ML Pipeline manages runtime training settings according to a specific format. The tasks to be executed are determined based on training settings written in YAML, and computational resources for training are allocated accordingly. Additionally, the experiment logs and output models are linked, enabling easy tracking of the experimental conditions and the datasets used for training after the model is distributed to an autonomous vehicle.

Model management

ML Package provides functionality for managing machine learning model files and linking them to Web.Auto's CI/CD Pipeline.

A package is created by grouping the files of a machine learning model, and a version is assigned before releasing it. The version can be named arbitrarily, and user access to the package can be precisely controlled. While Web.Auto separates data management by project, ML Package supports cross-project sharing of machine learning models. This functionality allows machine learning models to be shared and collaboratively used in autonomous driving software across multiple locations.

Integration with Web.Auto CI/CD Pipeline

ML Package is linked to the CI/CD Pipeline, enabling released machine learning models to be integrated into autonomous driving software simulations for evaluation or integrated into firmware images for distribution to vehicles. Models linked to the CI/CD Pipeline are managed alongside firmware images, making it easy to track how a specific model was evaluated in simulations, which vehicles it was deployed to, and the current operational environment it is running in. This model tracking not only supports incident investigations but also facilitates the continuous improvement of models after their release.

Wrap-up

In this post, I’ve focused on the MLOps workflow and the training infrastructure being developed by the Data/ML team. We’re building a system-managed MLOps workflow with experiment tracking and reproducibility, utilizing Web.Auto features to support the entire machine learning workflow. Many challenges lie ahead, but we are continuing to advance the MLOps workflow to further reduce lead times for model improvements.

The team is tackling development challenges that are likely to appeal to data and machine learning engineers, including some not mentioned here, such as automated labeling efforts in the Annotation Pipeline and LLM-based search in the Data Search Service.

We are always open to speaking with engineers who want to contribute to the development of MLOps for autonomous driving. If you’re interested in any of the following, get in touch:

Developing not just API servers, but complex workflows
Analysis and search of large-scale unstructured data
DevOps and MLOps for autonomous driving systems

Feel free to reach out to learn more about our open positions. If you're interested in a role, you can arrange a preliminary interview via the links in this blog or on our careers page. In your application, simply note that you'd like to have an informal conversation about the position.

Also, we’re planning a meetup event in February 2025, focusing on data infrastructure and MLOps! The event details will be published soon, so be sure to check TIER IV’s Connpass page. And don’t forget to subscribe to our newsletter to stay updated on this and other events!

Riki Ota | Data/ML Team
Riki joined TIER IV as an MLOps engineer in October 2022. He earned a master’s degree in engineering from Tohoku University and has experience in back-end development for taxi-hailing platforms and autonomous driving tools.

TIER IV is always on the lookout for passionate individuals to join our journey. If you share our vision of making autonomous driving accessible to all, get in touch.

Visit our careers page to view all job openings.

If you’re uncertain about which roles align best with your experience, or if the current job openings don’t quite match your preferences, register your interest here. We’ll get in touch if a role that matches your experience becomes available, and schedule an informal interview.

Inquiries

Media: pr@tier4.jp
Business: sales@tier4.jp

Social Media
X (Japan/Global) | LinkedIn | Facebook | Instagram | YouTube

More

View full post