Timezone: »

Workshop
Machine Learning for Autonomous Driving
Jiachen Li · Nigamaa Nayakanti · Xinshuo Weng · Daniel Omeiza · Ali Baheri · German Ros · Rowan McAllister

Sat Dec 03 06:20 AM -- 03:00 PM (PST) @ Theater B

Welcome to the NeurIPS 2022 Workshop on Machine Learning for Autonomous Driving!

Autonomous vehicles (AVs) offer a rich source of high-impact research problems for the machine learning (ML) community; including perception, state estimation, probabilistic modeling, time series forecasting, gesture recognition, robustness guarantees, real-time constraints, user-machine communication, multi-agent planning, and intelligent infrastructure. Further, the interaction between ML subfields towards a common goal of autonomous driving can catalyze interesting inter-field discussions that spark new avenues of research, which this workshop aims to promote. As an application of ML, autonomous driving has the potential to greatly improve society by reducing road accidents, giving independence to those unable to drive, and even inspiring younger generations with tangible examples of ML-based technology clearly visible on local streets. All are welcome to attend! This will be the 7th NeurIPS workshop in this series. Previous workshops in 2016, 2017, 2018, 2019, 2020, and 2021 enjoyed wide participation from both academia and industry.

 Sat 6:20 a.m. - 6:30 a.m. Welcome (Opening Remarks) 🔗 Sat 6:30 a.m. - 7:05 a.m. Vision-centric Autonomous Driving: from Perception to Prediction (Talk) Hang Zhao 🔗 Sat 7:05 a.m. - 7:40 a.m. Toward Generalizable Embodied AI for Autonomous Driving (Talk) Bolei Zhou 🔗 Sat 7:40 a.m. - 8:10 a.m. Paper Spotlight Talks (Talk) 🔗 Sat 8:10 a.m. - 9:00 a.m. Posters and Break (Poster) 🔗 Sat 9:00 a.m. - 9:35 a.m. Trustworthy Machine Learning in Autonomous Driving (Talk) Bo Li 🔗 Sat 9:45 a.m. - 10:10 a.m. Scenario generation for long-tail discovery (Talk) Yuning Chai 🔗 Sat 10:10 a.m. - 11:00 a.m. Lunch and Posters (Poster) 🔗 Sat 11:00 a.m. - 11:35 a.m. Contingency Planning with Learned Models of Behavioral and Perceptual Uncertainty (Talk) Nicholas Rhinehart 🔗 Sat 11:35 a.m. - 12:10 p.m. CausalAgents: A robustness benchmark for motion forecasting using causal relationships (Talk) Liting Sun 🔗 Sat 12:10 p.m. - 1:00 p.m. Posters and Break (Poster) 🔗 Sat 1:00 p.m. - 1:35 p.m. Operating Autonomous Vehicles In Different Contexts - On Roads and In Shared Spaces (Talk) Stewart Worrall 🔗 Sat 1:35 p.m. - 2:55 p.m. CARLA Challenge (Talk) 🔗 Sat 2:55 p.m. - 3:00 p.m. Closing Remarks 🔗 - Fast-BEV: Towards Real-time On-vehicle Bird’s-Eye View Perception (Poster)    Recently, the pure camera-based Bird’s-Eye-View (BEV) perception removes expensive Lidar sensors, making it a feasible solution for economical autonomous driving. However, most existing BEV solutions either suffer from modest performance or require considerable resources to execute on-vehicle inference. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing real-time BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive view transformation or depth representation. Starting rom M2BEV baseline, we further introduce (1) a strong data augmentation strategy for both image and BEV space to avoid over-fitting (2) a multi-frame feature fusion mechanism to leverage the temporal information (3) an optimized deployment friendly view transformation to speed up the inference. Through experiments, we show Fast-BEV model family achieves considerable accuracy and efficiency on edge. In particular, our M1 model (R18@256×704) can run over 50FPS on the Tesla T4 platform, with 46.9% NDS on the nuScenes validation set. Our largest model (R101@900x1600) establishes a new state-of-the-art 53.5% NDS on the nuScenes validation set. Code will be made publicly available. Bin Huang · Yangguang Li · Feng Liang · Enze Xie · Luya Wang · Mingzhu Shen · Fenggang Liu · Tianqi Wang · Ping Luo · Jing Shao 🔗 - A Versatile and Efficient Reinforcement Learning Approach for Autonomous Driving (Poster)    Heated debates continue over the best solution for autonomous driving. The classic modular pipeline is widely adopted in the industry owing to its great interpretability and stability, whereas the fully end-to-end paradigm has demonstrated considerable simplicity and learnability along with the rise of deep learning. As a way of marrying the advantages of both approaches, learning a semantically meaningful representation and then using it in the downstream driving policy learning tasks provides a viable and attractive solution. However, several key challenges remain to be addressed, including identifying the most effective representation, alleviating the sim-to-real generalization issue as well as balancing model training cost. In this study, we propose a versatile and efficient reinforcement learning approach and build a fully functional autonomous vehicle for real-world validation. Our method shows great generalizability to various complicated real-world scenarios and superior training efficiency against the competing baselines. Guan Wang · Haoyi Niu · desheng zhu · Jianming HU · Xianyuan Zhan · Guyue Zhou 🔗 - VN-Transformer: Rotation-Equivariant Attention for Vector Neurons (Poster)    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations.Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons."We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are:(i) we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; (ii) we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; (iii) we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; (iv) we show that small tradeoffs in equivariance (epsilon-approximate equivariance) can yield large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models.Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results. Serge Assaad · Carlton Downey · Rami Al-Rfou · Nigamaa Nayakanti · Benjamin Sapp 🔗 - Improving Motion Forecasting for Autonomous Driving with the Cycle Consistency Loss (Poster)    Robust motion forecasting of the dynamic scene is a critical component of an autonomous vehicle. It is a challenging problem due to the heterogeneity in the scene and the inherent uncertainties in the problem. To improve the accuracy of motion forecasting, in this work, we identify a new consistency constraint in this task, that is an agent's future trajectory should be coherent with its history observations and visa versa. To leverage this property, we propose a novel cycle consistency training scheme and define a novel cycle loss to encourage this consistency. In particular, we reverse the predicted future trajectory backward in time and feed it back into the prediction model to predict the history and compute the loss as an additional cycle loss term. Through our experiments on the Argoverse dataset, we demonstrate that cycle loss can improve the performance of competitive motion forecasting models. Titas Chakraborty · Akshay Bhagat · Henggang Cui 🔗 - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video Segmentation (Poster)    Real-time video segmentation is a crucial task for many real-world applications such as autonomous driving and robot control. Since state-of-the-art semantic segmentation models are often too heavy for real-time applications despite their impressive performance, researchers have proposed lightweight architectures with speed-accuracy trade-offs, achieving real-time speed at the expense of reduced accuracy. In this paper, we propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks by exploiting the temporal locality in videos. Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins. We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame. This is done by dynamically dropping out residual blocks using a gating mechanism which decides which blocks to drop based on inter-frame distortion. We validate our Spatial-Temporal Mask Generator (STMG) on video semantic segmentation benchmarks with multiple backbone networks, and show that our method largely speeds up inference with minimal loss of accuracy. Hyunsu Rhee · Dongchan Min · Sunil Hwang · Bruno Andreis · Sung Ju Hwang 🔗 - An Intelligent Modular Real-Time Vision-Based System for Environment Perception (Poster)    A significant portion of driving hazards is caused by human error and disregard for local driving regulations; Consequently, an intelligent assistance system can be beneficial. This paper proposes a novel vision-based modular package to ensure drivers' safety by perceiving the environment. Each module is designed based on accuracy and inference time to deliver real-time performance. As a result, the proposed system can be implemented on a wide range of vehicles with minimum hardware requirements. Our modular package comprises four main sections: lane detection, object detection, segmentation, and monocular depth estimation. Each section is accompanied by novel techniques to improve the accuracy of others along with the entire system. Furthermore, a GUI is developed to display perceived information to the driver. In addition to using public datasets, like BDD100K, we have also collected and annotated a local dataset that we utilize to fine-tune and evaluate our system. We show that the accuracy of our system is above 80% in all the sections. Our code and data will be available on our GitHub page. Amirhossein Kazerouni · Amirhossein Heydarian · Milad Soltany · Aida Mohammadshahi · Abbas Omidi · Saeed Ebadollahi 🔗 - ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection (Poster)    Driver distraction detection is an important computer vision problem that can play a crucial role in enhancing traffic safety and reducing traffic accidents. This paper proposes a novel semi-supervised method for detecting driver distractions based on Vision Transformer (ViT). Specifically, a multi-modal Vision Transformer (ViT-DD) is developed that makes use of inductive information contained in training signals of distraction detection as well as driver emotion recognition. Further, a self-learning algorithm is designed to include driver data without emotion labels into the multi-task training of ViT-DD. Extensive experiments conducted on the SFDDD and AUCDD datasets demonstrate that the proposed ViT-DD outperforms the best state-of-the-art approaches for driver distraction detection by 6.5% and 0.9%, respectively. Yunsheng Ma · Ziran Wang 🔗 - PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking? (Poster)    Most (3D) multi-object tracking methods rely on object-level information, e.g. appearance, for data association. By contrast, we investigate how far we can get by considering only geometric relationships and interactions between objects over time. We represent each tracking sequence as a multiplex graph where 3D object detections are nodes, and spatial and temporal pairwise relations among them are encoded via two types of edges. This structure allows our graph neural network to consider all types of interactions and distinguish temporal, contextual and motion cues to obtain final scene interpretation by posing tracking as edge classification. The model outputs classification results after multiple rounds of neural message passing, during which it is able to reason about long-term object trajectories, influences and motion based solely on initial pairwise relationships. To enable our method for online (streaming) scenarios, we introduce a technique to continuously evolve our graph over long tracking sequences to achieve good performance while maintaining sparsity with linear complexity for the number of edges. We establish a new state-of-the-art on the nuScenes dataset. Aleksandr Kim · Guillem Braso · Aljosa Osep · Laura Leal-Taixé 🔗 - A Graph Representation for Autonomous Driving (Poster)    For human drivers, an important aspect of learning to drive is knowing how to pay attention to areas of the roadway that are critical for decision-making while simultaneously ignoring distractions. Similarly, the choice of roadway representation is critical for good performance of an autonomous driving system. An effective representation should be compact and permutation-invariant, while still representing complex vehicle interactions that govern driving decisions. This paper introduces the Graph Representation for Autonomous Driving (GRAD); GRAD generates a global scene representation using a space-time graph which incorporates the estimated future trajectories of other vehicles. We demonstrate that GRAD outperforms the best performing social attention representation on a simulated highway driving task in high traffic densities and also has a low computational complexity in both single and multi-agent settings. Zerong Xi · Gita Sukthankar 🔗 - Improving Predictive Performance and Calibration by Weight Fusion in Semantic Segmentation (Poster)    Averaging predictions of a deep ensemble of networks is a popular and effective method to improve predictive performance and calibration in various benchmarks and Kaggle competitions.However, the runtime and training cost of deep ensembles grow linearly with the size of the ensemble, making them unsuitable for many applications.Averaging ensemble weights instead of predictions circumvents this disadvantage during inference and is typically applied to intermediate checkpoints of a model to reduce training cost. Albeit effective, only few works have improved the understanding and the performance of weight averaging.Here, we revisit this approach and show that a simple weight fusion (WF) strategy can lead to a significantly improved predictive performance and calibration. We describe what prerequisites the weights must meet in terms of weight space, functional space and loss. Furthermore, we present a new test method (called oracle test) to measure the functional space between weights. We demonstrate the versatility of our WF strategy across state of the art segmentation CNNs and Transformers as well as real world datasets such as BDD100K and Cityscapes. We compare WF with similar approaches and show our superiority for in- and out-of-distribution data in terms of predictive performance and calibration. Timo Saemann · Ahmed Hammam · Andrei Bursuc · Christoph Stiller · Horst-Michael Gross 🔗 - CAMEL: Learning Cost-maps Made Easy for Off-road Driving (Poster)    Cost-maps are used by robotic vehicles to plan collision-free paths. The cost associated with each cell in the map represents the sensed environment information which is often determined manually after several trial-and-error efforts. In off-road environments, due to the presence of several types of features, it is challenging to handcraft the cost values associated with each feature. Moreover, different handcrafted cost values can lead to different paths for the same environment which is not desirable. In this paper, we address the problem of learning the cost-map values from the sensed environment for robust vehicle path planning. We propose a novel framework called as CAMEL using deep learning approach that learns the parameters through demonstrations yielding an adaptive and robust cost-map for path planning. CAMEL has been trained on multi-modal datasets such as RELLIS-3D. The evaluation of CAMEL is carried out on an off-road scene simulator (MAVS) and on field data from IISER-B campus. We also perform real-world implementation of CAMEL on a ground rover. The results shows flexible and robust motion of the vehicle without collisions in unstructured terrains. Kasi Viswanath · PB Sujit · Srikanth Saripalli 🔗 - DiffStack: A Differentiable and Modular Control Stack for Autonomous Vehicles (Poster)    Autonomous vehicle (AV) stacks are typically built in a modular fashion, with explicit components performing detection, tracking, prediction, planning, control, etc. While modularity improves reusability, interpretability, and generalizability, it also suffers from compounding errors, information bottlenecks, and integration challenges. To overcome these challenges, a prominent approach is to convert the AV stack into an end-to-end neural network and train it with data. While such approaches have achieved impressive results, they typically lack interpretability and reusability, and they eschew principled analytical components, such as planning and control, in favor of deep neural networks. To enable the joint optimization of AV stacks while retaining modularity, we present DiffStack, a differentiable and modular stack for prediction, planning, and control. Crucially, our model-based planning and control algorithms leverage recent advancements in differentiable optimization to produce gradients, enabling optimization of upstream components, such as prediction, via backpropagation through planning and control. Our results on the nuScenes dataset indicate that end-to-end training with DiffStack yields substantial improvements in open-loop planning metrics by, e.g., learning to make fewer prediction errors that would affect planning. Beyond these immediate benefits, DiffStack opens up new opportunities for fully data-driven yet modular and interpretable AV architectures. Peter Karkus · Boris Ivanovic · Shie Mannor · Marco Pavone 🔗 - Stress-Testing Point Cloud Registration on Automotive LiDAR (Poster)    Rigid Point Cloud Registration (PCR) algorithms aim to estimate the 6-DOF relative motion between two point clouds, which is important in various fields, including autonomous driving. Recent years have seen a significant improvement in \emph{global} PCR algorithms, \ie algorithms that can handle a large relative motion. This has been demonstrated in various scenarios, including indoor scenes, but has only been minimally tested in the Automotive setting, where point clouds are produced by vehicle-mounted LiDAR sensors. In this work, we aim to answer questions that are important for automotive applications, including: which of the new algorithms is the most accurate, and which is fastest? How transferable are deep-learning approaches, \eg what happens when you train a network with data from Boston, and run it in a vehicle in Singapore? How small can the overlap between point clouds be before the algorithms start to deteriorate? To what extent are the algorithms rotation invariant? Our results are at times surprising. When comparing robust parameter estimation methods for registration, we find that the fastest and most accurate is not one of the newest approaches. Instead, it is a modern variant of the well known RANSAC technique. We also suggest a new outlier filtering method, Grid-Prioritized Filtering (GPF), to further improve it. An additional contribution of this work is an algorithm for selecting challenging sets of frame-pairs from automotive LiDAR datasets. This enables meaningful benchmarking in the Automotive LiDAR setting, and can also improve training for learning algorithms. Amnon Drory · Raja Giryes · Shai Avidan 🔗 - Analyzing Deep Learning Representations of Point Clouds for Real-Time In-Vehicle LiDAR Perception (Poster)    LiDAR sensors are an integral part of modern autonomous vehicles as they provide an accurate, high-resolution 3D representation of the vehicle's surroundings. However, it is computationally difficult to make use of the ever-increasing amounts of data from multiple high-resolution LiDAR sensors. As frame-rates, point cloud sizes and sensor resolutions increase, real-time processing of these point clouds must still extract semantics from this increasingly precise picture of the vehicle's environment. One deciding factor of the run-time performance and accuracy of deep neural networks operating on these point clouds is the underlying data representation and the way it is computed. In this work, we examine the relationship between the computational representations used in neural networks and their performance characteristics. To this end, we propose a novel computational taxonomy of LiDAR point cloud representations used in modern deep neural networks for 3D point cloud processing. Using this taxonomy, we perform a structured analysis of different families of approaches. Thereby, we uncover common advantages and limitations in terms of computational efficiency, memory requirements, and representational capacity as measured by semantic segmentation performance. Finally, we provide some insights and guidance for future developments in neural point cloud processing methods. Marc Uecker · Tobias Fleck · Marcel Pflugfelder · Marius Zöllner 🔗 - Direct LiDAR-based object detector training from automated 2D detections (Poster)    3D Object detection (3DOD) is an important component of many applications, however existing methods rely heavily on datasets of depth and image data which require expensive annotation in 3D thus limiting the ability of a diverse dataset being collected which truly represents the long tail of potential scenes in the wild.In this work we propose to utilise a readily available robust 2D Object Detector and to transfer information about objects from 2D to 3D, allowing us to train a 3D Object Detector without the need for any human annotation in 3D. We demonstrate that our method significantly outperforms previous 3DOD methods supervised by only 2D annotations, and that our method narrows the accuracy gap between methods that use 3D supervision and those that do not. Robert McCraith · Eldar Insafutdinov · Lukas Neumann · Andrea Vedaldi 🔗 - Safe Real-World Autonomous Driving by Learning to Predict and Plan with a Mixture of Experts (Poster)    The goal of autonomous vehicles is to navigate public roads safely and comfortably. To enforce safety, traditional planning approaches rely on handcrafted rules to generate trajectories. Machine learning-based systems, on the other hand, scale with data and are able to learn more complex behaviors. However, they often ignore that agents and self-driving vehicle trajectory distributions can be leveraged to improve safety. In this paper, we propose modeling a distribution over multiple future trajectories for both the self-driving vehicle and other road agents, using a unified neural network architecture for prediction and planning. During inference, we select the planning trajectory that minimizes a cost taking into account safety and the predicted probabilities. Our approach does not depend on any rule-based planners for trajectory generation or optimization, improves with more training data and is simple to implement. We extensively evaluate our method through a realistic simulator and show that the predicted trajectory distribution corresponds to different driving profiles. We also successfully deploy it on a self-driving vehicle on urban public roads, confirming that it drives safely without compromising comfort. The code for training and testing our model on a public prediction dataset and the video of the road test are available at https://woven.mobi/safepathnet. Stefano Pini · Christian Perone · Aayush Ahuja · Ana Sofia Rufino Ferreira · Moritz Niendorf · Sergey Zagoruyko 🔗 - Are All Vision Models Created Equal? A Study of the Open-Loop to Closed-Loop Causality Gap (Poster)    There is an ever-growing zoo of modern neural network models that can efficiently learn end-to-end control from visual observations. These advanced deep models, ranging from convolutional to patch-based networks, have been extensively tested on offline image classification and regression tasks.In this paper, we study these vision architectures with respect to the open-loop to closed-loop causality gap, i.e., offline training followed by an online closed-loop deployment. This causality gap emerges in end-to-end autonomous driving, where a network is trained to imitate the control commands of a human. In this setting, two situations arise: 1) Closed-loop testing in-distribution, where the test environment shares properties with those of offline training data. 2) Closed-loop testing under distribution shifts and out-of-distribution.Contrary to recently reported results, we show that \emph{under proper training guidelines}, all vision models perform indistinguishably well on in-distribution deployment, resolving the causality gap. In situation 2, We observe that the causality gap disrupts performance regardless of the choice of the model architecture. Our results imply that the causality gap can be solved in situation one with our proposed training guideline with any modern network architecture, whereas achieving out-of-distribution generalization (situation two) requires further investigations, for instance, on data diversity rather than the model architecture. Mathias Lechner · Ramin Hasani · Alexander Amini · Tsun-Hsuan Johnson Wang · Thomas Henzinger · Daniela Rus 🔗 - Verifiable Goal Recognition for Autonomous Driving with Occlusions (Poster)    Goal recognition (GR) allows the future behaviour of vehicles to be more accurately predicted. GR involves inferring the goals of other vehicles, such as a certain junction exit. In autonomous driving, vehicles can encounter many different scenarios and the environment is partially observable due to occlusions. We present a novel GR method named Goal Recognition with Interpretable Trees under Occlusion (OGRIT). We demonstrate that OGRIT can handle missing data due to occlusions and make inferences across multiple scenarios using the same learned decision trees, while still being fast, accurate, interpretable and verifiable. We also present the inDO and roundDO datasets of occluded regions used to evaluate OGRIT. Cillian Brewitt · Massimiliano Tamborski · Stefano Albrecht 🔗 - PnP-Nav: Plug-and-Play Policies for Generalizable Visual Navigation Across Robots (Poster)    Learning provides a powerful tool for vision-based navigation, but the capabilities of learning-based policies are constrained by limited training data. If we could combine data from all available sources, including multiple kinds of robots, we could train more powerful navigation models. In this paper, we study how goal-conditioned policies for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots, and enable broad generalization across environments and embodiments. We analyze the necessary design decisions for effective data sharing across different robots, including the use of temporal context and standardized action spaces, and demonstrate that an omnipolicy trained from heterogeneous datasets outperforms policies trained on any single dataset. We curate 60 hours of navigation trajectories from 6 distinct robots, and deploy the trained omnipolicy on a range of new robots, including an underactuated quadrotor. We also find that training on diverse, multi-robot datasets leads to robustness against degradation in sensing and actuation. Using a pre-trained base navigational omnipolicy with broad generalization capabilities can bootstrap navigation applications on novel robots going forward, and we hope that PnP represents a step in that direction. Dhruv Shah · Ajay Sridhar · Arjun Bhorkar · Noriaki Hirose · Sergey Levine 🔗 - Finding Safe Zones of Markov Decision Processes Policies (Poster)    Given a policy, we define a SafeZone as a subset of states, such that most of the policy's trajectories are confined to this subset. The quality of the SafeZone is parameterized by the number of states and the escape probability, i.e., the probability that a random trajectory will leave the subset.SafeZones are especially interesting when they have a small number of states and low escape probability. We study the complexity of finding optimal SafeZones and show that in general, the problem is computationally hard. For this reason, we concentrate on computing approximate SafeZones Our main result is a bi-criteria approximation algorithm which gives a factor of almost $2$ approximation for both the escape probability and SafeZone size, using a polynomial size sample complexity. We conclude the paper with an empirical demonstration of our algorithm. Lee Cohen · Yishay Mansour · Michal Moshkovitz 🔗 - CW-ERM: Improving Autonomous Driving Planning with Closed-loop Weighted Empirical Risk Minimization (Poster)    The imitation learning of self-driving vehicle policies through behavioral cloning is often carried out in an open-loop fashion, ignoring the effect of actions to future states. Training such policies purely with Empirical Risk Minimization (ERM) can be detrimental to real-world performance, as it biases policy networks towards matching only open-loop behavior, showing poor results when evaluated in closed-loop.In this work, we develop an efficient and simple-to-implement principle called Closed-loop Weighted Empirical Risk Minimization (CW-ERM), in which a closed-loop evaluation procedure is first used to identify training data samples that are important for practical driving performance and then we these samples to help debias the policy network. We evaluate CW-ERM in a challenging urban driving dataset and show that this procedure yields a significant reduction in collisions as well as other non-differentiable closed-loop metrics. Eesha Kumar · Yiming Zhang · Stefano Pini · Simon Stent · Ana Sofia Rufino Ferreira · Sergey Zagoruyko · Christian Perone 🔗 - Missing Traffic Data Imputation Using Multi-Trajectory Parameter Transferred LSTM (Poster)    We propose a lightweight data imputation algorithm for spatio-temporal data based on multi-trajectory parameter transferred Long Short-Term Memory (MPT-LSTM) in a traffic environment where the roadside units (RSUs) collect traffic information. In this paper, we consider a scenario where the RSUs are accidentally broken down or have malfunctioned but cannot be immediately recovered, temporally incurring massive data loss. Unlike existing imputation algorithms based on LSTM, the proposed architecture reduces the dimensions of input data by separating the data into trajectories in the spatio-temporal domain, thereby allowing us to train the model with irreducible LSTMs for each single data input. The designed approach can transfer parameters between irreducible LSTM phases, which trains data collected from an RSU, adopts spatial interpolation, and shows robust imputation accuracy for the trajectories. We show that the proposed MPT-LSTM improves imputation accuracy and significantly reduces the number of parameters and operations, which leads to a reduction in LSTM memory space and execution time. The proposed algorithm can also show robust and efficient performance in the experiments using real-world vehicle speed data collected from expressways and urban areas. Jungmin Kwon · Hyunggon Park 🔗 - Uncertainty-aware self-training with expectation maximization basis transformation (Poster)    Self-training is a powerful approach to deep learning. The key process is to find a pseudo-label for modeling. However, previous self-training algorithms suffer from the over-confidence issue brought by the hard labels, even some confidence-related regularizers cannot comprehensively catch the uncertainty. Therefore, we propose a new self-training framework to combine uncertainty information of both model and dataset. Specifically, we propose to use Expectation-Maximization (EM) to smooth the labels and comprehensively estimate the uncertainty information. We further design a basis extraction network to estimate the initial basis from the dataset. The obtained basis with uncertainty can be filtered based on uncertainty information. It can then be transformed into the real hard label to iteratively update the model and basis in the retraining process. Experiments on image classification and semantic segmentation show the advantages of our methods among confidence-aware self-training algorithms with 1-3 percentage improvement on different datasets. Zijia Wang · Wenbin Yang · Zhi-Song Liu · Zhen Jia 🔗 - Potential Energy based Mixture Model for Noisy Label Learning (Poster)    Training deep neural networks (DNNs) from noisy labels is an important and challenging task. However, most existing approaches focus on the corrupted labels and ignore the importance of inherent data structure. To bridge the gap between noisy labels and data, inspired by the concept of potential energy in physics, we propose a novel Potential Energy based Mixture Model (PEMM) for noise-labels learning. We innovate a distance-based classifier with the potential energy regularization on its class centers.Embedding our proposed classifier with existing deep learning backbones, we can have robust networks with better feature representations. They can preserve intrinsic structures from the data, resulting in a superior noisy tolerance.We conducted extensive experiments to analyze the efficiency of our proposed model on several real-world datasets. Quantitative results show that it can achieve state-of-the-art performance. Zijia Wang · Wenbin Yang · Zhi-Song Liu · Zhen Jia 🔗 - Rationale-aware Autonomous Driving Policy utilizing Safety Force Field implemented on CARLA Simulator (Poster)    Despite the rapid improvement of autonomous driving technology in recent years, automotive manufacturers must resolve liability issues to commercialize autonomous passenger car of SAE J3016 Level 3 or higher. To cope with the product liability law, manufacturers develop autonomous driving systems in compliance with international standards for safety such as ISO 26262 and ISO 21448. Concerning the safety of the intended functionality (SOTIF) requirement in ISO 26262, the driving policy recommends providing an explicit rational basis for maneuver decisions. In this case, mathematical models such as Safety Force Field (SFF) and Responsibility-Sensitive Safety (RSS) which have interpretability on decision, may be suitable. In this work, we implement SFF from scratch to substitute the undisclosed NVIDIA's source code and integrate it with CARLA open-source simulator. Using SFF and CARLA, we present a predictor for claimed sets of vehicles, and based on the predictor, propose an integrated driving policy that consistently operates regardless of safety conditions it encounters while passing through dynamic traffic. The policy does not have a separate plan for each condition, but using safety potential, it aims human-like driving blended in with traffic flow. Ho Suk · Taewoo Kim · Hyungbin Park · PAMUL YADAV · Junyong Lee · Shiho Kim 🔗 - PlanT: Explainable Planning Transformers via Object-Level Representations (Poster)    Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations of the scene containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. With this representation, we demonstrate that information regarding the ego vehicle's route provides sufficient context regarding the road layout for planning. On the challenging Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3x faster than equivalent pixel-based planning baselines during inference. Combining PlanT with an off-the-shelf perception module provides a sensor-based driving system that is more than 9 points better in terms of driving score than the existing state of the art.} Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision-making. Our results indicate that PlanT can focus on the most relevant object in the scene, even when this object is geometrically distant. Katrin Renz · Kashyap Chitta · Otniel-Bogdan Mercea · A. Sophia Koepke · Zeynep Akata · Andreas Geiger 🔗 - Multi-Modal 3D GAN for Urban Scenes (Poster)    Recently, a number of works have explored training 3D-aware Generative Adversarial Networks (GANs) that include a neural rendering layer in the generative pipeline.Doing so, they succeed in building models that can infer impressive 3D information while being trained solely on 2D images.However, they have been mostly applied to images centered around an object.Transitioning to driving scenes is still a challenge, as not only the scenes are open and more complex, but also one usually does not have access to as many diverse viewpoints. Typically only the front camera view is available.We investigate in this work how 3D GANs are amenable are for such a setup, and propose a method to leverage information from LiDAR sensors to alleviate the detected issues. Loïck Chambon · Mickael Chen · Tuan-Hung VU · Alexandre Boulch · Andrei Bursuc · Matthieu Cord · Patrick Pérez 🔗 - KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients (Poster)    Simulators offer the possibility of scalable development of self-driving systems. However, current driving simulators exhibit naïve behavior models for background traffic. Hand-tuned scenarios are typically used to induce safety-critical situations. An alternative approach is to adversarially perturb the background traffic trajectories. In this paper, we study this approach to safety-critical driving scenario generation using the CARLA simulator. We use a kinematic bicycle model as a proxy to the simulator's true dynamics and observe that gradients through this proxy model are sufficient for optimizing the background traffic trajectories. Based on this finding, we propose KING, which generates safety-critical driving scenarios with a 20% higher success rate than black-box optimization, which previous work relies on. Furthermore, we demonstrate that the generated scenarios can be used to fine-tune imitation learning agents, leading to improved collision avoidance. Niklas Hanselmann · Katrin Renz · Kashyap Chitta · Apratim Bhattacharyya · Andreas Geiger 🔗 - Enhancing System-level Safety in Autonomous Driving via Feedback Learning (Poster)    The perception component of autonomous driving systems is often designed and tuned in isolation from the control component based on well-known performance measures such as accuracy, precision, and recall. Commonly used loss functions such as cross-entropy loss and negative log-likelihood only focus on minimizing the loss with respect to misclassification without considering the consequences that follow after the misclassifications. In other words, this approach fails to take into account the difference in the severity of system-level failures due to misclassification and other errors in perception components. Therefore in this work, we proposed a novel feedback learning training framework to build the perception component of an autonomous system that is aware of system-level safety objectives, which in turn, enhances the safety of the vehicle as a whole. The crux of the idea is to utilize the concept of a rulebook to provide feedback on system-level performance as safety scores and leverage them in designing and computing the loss functions for the models in the framework. The experimental results show the perception model gets improved by feedback from the system-level safety rule. The framework was trained and tested on an open-sourced dataset, and the experimental results showed that the resulting model had shown superior system-level safety performance over the baseline perception model. Sin Yong Tan · Weisi Fan · Qisai Liu · Tichakorn Wongpiromsarn · Soumik Sarkar 🔗 - Imitation Is Not Enough: Robustifying Imitation with Reinforcement Learning for Challenging Driving Scenarios (Poster)    Imitation learning (IL) is a simple and powerful way to use high-quality human driving data, which can be collected at scale, to identify driving preferences and produce human-like behavior. However, policies based on imitation learning alone often fail to sufficiently account for safety and reliability concerns. In this paper, we show how imitation learning combined with reinforcement learning using simple rewards can substantially improve the safety and reliability of driving policies over those learned from imitation alone. In particular, we use imitation and reinforcement learning to train a policy on over 100k miles of urban driving data, and measure its effectiveness in test scenarios grouped by different levels of collision risk. To our knowledge, this is the first application of a combined imitation and reinforcement learning approach in autonomous driving that utilizes large amounts of real-world human driving data. Yiren Lu · Yiren Lu · Yiren Lu · Justin Fu · George Tucker · Xinlei Pan · Eli Bronstein · Rebecca Roelofs · Benjamin Sapp · Brandyn White · Aleksandra Faust · Shimon Whiteson · Dragomir Anguelov · Sergey Levine 🔗 - VISTA: VIrtual STereo based Augmentation for Depth Estimation in Automated Driving (Poster)    Depth estimation is the primary task for automated vehicles to perceive the 3D environment. The classical approach for depth estimation leverages stereo cameras on the cars. This approach can provide accurate and robust depth estimation, but also requires a more expensive setup and detailed calibration. The recent trend of depth estimation, therefore, focuses on learning the depth from monocular videos. These approaches only need an easy setup but may also be vulnerable to occlusion or light condition changes in the scene. In this work, we propose a novel idea that exploits the fact that data collected by large fleets naturally contains scenarios where vehicles with monocular cameras drive close to each other and are looking at the same scene. Our approach combines the monocular view of the ego vehicle and the neighboring vehicle to form a virtual stereo pair during training, while still only requiring the monocular image during inference. With such a virtual stereo view, we are able to train self-supervised depth estimation by two sources of constraints: 1) the spatial and temporal constraints between sequential monocular frames; 2) the geometric constraints between the frames from two cameras that form the virtual stereo.Public datasets for multiple vehicles sharing the common view to form possible virtual stereo views do not exist, and so we also created our synthetic dataset using CARLA simulator where multiple vehicles can observe the same scene at the same time. The evaluation shows that our virtual stereo approach can improve the ego vehicle's depth estimation accuracy by 8%, compared to the approaches that use monocular frames only. Bin Cheng · Kshitiz Bansal · Mehul Agarwal · Gaurav Bansal · Dinesh Bharadia 🔗 - Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-based Fault Detection and Identification (Poster)    Perception is a critical component of high-integrity applications of robotics and autonomous systems, such as self-driving cars. In these applications, failure of perception systems may put human life at risk, and a broad adoption of these technologies requires the development of methodologies to guarantee and mon- itor safe operation. Despite the paramount importance of perception, currently there is no formal approach for system-level perception monitoring. This paper investigates runtime monitoring of perception systems. We formalize the problem of runtime fault detection and identification in perception systems and present a framework to model diagnostic information using a diagnostic graph. We then provide a set of deterministic, probabilistic, and learning-based algorithms that use diagnostic graphs to perform fault detection and identification. Moreover, we investigate fundamental limits and provide deterministic and probabilistic guar- antees on the fault detection and identification results. We conclude the paper with an extensive experimental evaluation, which recreates several realistic fail- ure modes in the LGSVL open-source autonomous driving simulator, and applies the proposed system monitors to a state-of-the-art autonomous driving software stack (Baidu’s Apollo Auto). The results show that the proposed system moni- tors outperform baselines, have the potential of preventing accidents in realistic autonomous driving scenarios, and incur a negligible computational overhead. Pasquale Antonante · Heath Nilsen · Luca Carlone 🔗 - Risk Perception in Driving Scenes (Poster) The holy grail of intelligent vehicles is to enable a zero collision mobility experience. This endeavor requires an interdisciplinary effort to understand driver behavior and to assess risks surrounding the vehicle. A driver's perception of risk is a complex cognitive process that is largely manifested by the voluntary response of the driver to external stimuli as well as the apparent attentiveness of participants toward the ego-vehicle. In this work, we examine the problem of risk perception and introduce a new dataset to facilitate research in this domain. Our dataset consists of 4706 short video clips that include annotations of driver intent, road network topology, situation (e.g., crossing pedestrian), driver response, and pedestrian attentiveness using face annotations. We also provide a simple weakly supervised framework which performs favorably against state of the art methods. Nakul Agarwal · Yi-Ting Chen 🔗 - Robust Trajectory Prediction against Adversarial Attacks (Poster)    Trajectory prediction using deep neural networks (DNNs) is an essential component of autonomous driving (AD) systems. However, these methods are vulnerable to adversarial attacks, leading to serious consequences such as collisions. In this work, we identify two key ingredients to defend trajectory prediction models against adversarial attacks including (1) designing effective adversarial training methods and (2) adding domain-specific data augmentation to mitigate the performance degradation on clean data. We demonstrate that our method is able to improve the performance by 46% on adversarial data and at the cost of only 3% performance degradation on clean data, compared to the model trained with clean data. Additionally, compared to existing robust methods, our method can improve performance by 21% on adversarial examples and 9\% on clean data. Our robust model is evaluated with a planner to study its downstream impacts. We demonstrate that our model can significantly reduce the severe accident rates (e.g., collisions and off-road driving). Yulong Cao · Danfei Xu · Xinshuo Weng · Zhuoqing Morley Mao · Anima Anandkumar · Chaowei Xiao · Marco Pavone 🔗 - Calibrated Perception Uncertainty Across Objects and Regions in Bird's-Eye-View (Poster)    In driving scenarios with poor visibility or occlusions, it is important that the autonomous vehicle would take into account all the uncertainties when making driving decisions, including choice of a safe speed. The grid-based perception outputs such as occupancy grids and object-based outputs such as lists of detected objects must then be accompanied with well-calibrated uncertainty estimates. We highlight limitations in the state-of-the-art and propose a more complete set of uncertainties to be reported, particularly including undetected-object-ahead probabilities. We suggest a novel way to get these probabilistic outputs from bird’s-eye-view probabilistic semantic segmentation, in the example of the FIERY model. We demonstrate that the obtained probabilities are not calibrated out-of-the-box and propose methods to achieve well-calibrated uncertainties. Markus Kängsepp · Meelis Kull 🔗 - DriveCLIP: Zero-shot transfer for distracted driving activity understanding using CLIP (Poster)    Distracted driving action recognition from naturalistic driving is crucial for both driver and pedestrian's safe and reliable experience. However, traditional computer vision techniques sometimes require a lot of supervision in terms of a large amount of annotated training data to detect distracted driving activities. Recently, the vision-language models have offered large-scale visual-textual pretraining that can be adapted to unsupervised task-specific learning like distracted activity recognition. The contrastive image-text pretraining models like CLIP have shown significant promise in learning natural language-guided visual representations. In this paper, we propose a CLIP-based driver activity recognition framework that predicts whether a driver is distracted or not while driving. CLIP's vision embedding offers zero-shot transfer, which can identify distracted activities by the driver from the driving videos. Our result suggests this framework offers SOTA performance on zero-shot transfer for predicting the driver's state on three public datasets. We also developed DriveCLIP, a classifier on top of the CLIP's visual representation for distracted driving detection tasks, and reported the results here. Md Zahid Hasan · Ameya Joshi · Mohammed Shaiqur Rahman · Venkatachalapathy Archana · Anuj Sharma · Chinmay Hegde · Soumik Sarkar 🔗 - AdvDO: Realistic Adversarial Attacks for Trajectory Prediction (Poster)    Trajectory prediction is essential for autonomous vehicles(AVs) to plan correct and safe driving behaviors. While many prior works aim to achieve higher prediction accuracy, few studies the adversarial robustness of their methods. To bridge this gap, we propose to study the adversarial robustness of data-driven trajectory prediction systems. We devise an optimization-based adversarial attack framework that leverages a carefully-designed differentiable dynamic model to generate realistic adversarial trajectories. Empirically, we benchmark the adversarial robustness of state-of-the-art prediction models and show that our attack increases the prediction error for both general metrics and planning-aware metrics by more than 50% and 37%. We also show that our attackcan lead an AV to drive off-road or collide into other vehicles in simulation. Finally, we demonstrate how to mitigate the adversarial attacks using an adversarial training scheme. Yulong Cao · Chaowei Xiao · Anima Anandkumar · Danfei Xu · Marco Pavone 🔗 - One-Shot Learning of Visual Path Navigation for Autonomous Vehicles (Poster)    Autonomous driving presents many challenges due to the large number of scenarios the autonomous vehicle (AV) may encounter. End-to-end deep learning models are comparatively simplistic models that can handle a broad set of scenarios. However, end-to-end models require large amounts of diverse data to perform. This paper presents a novel deep neural network that performs image-to-steering path navigation that helps with the data problem by adding one-shot learning to the system. Presented with a new path in a new but related environment, the vehicle can drive the path autonomously after being shown the path once and without model retraining. In fact, the full path is not needed and images of the road junctions is sufficient. In-vehicle testing and offline testing are performed to verify the performance of the proposed navigation by comparing different architectures. Zhongying CuiZhu · Francois Charette · Amin Ghafourian · Debo Shi · Matthew Cui · Anjali Krishnamachar · Iman Soltani 🔗 - TALISMAN: Targeted Active Learning for Object Detection with Rare Classes and Slices using Submodular Mutual Information (Poster)    Deep neural networks based object detectors have shown great success in a variety of domains like autonomous vehicles, biomedical imaging, etc. It is known that their success depends on a large amount of data from the domain of interest. While deep models often perform well in terms of overall accuracy, they often struggle in performance on rare yet critical data slices. For example, data slices like "motorcycle at night" or "bicycle at night" are often rare but very critical slices for self-driving applications and false negatives on such rare slices could result in ill-fated failures and accidents. Active learning (AL) is a well-known paradigm to incrementally and adaptively build training datasets with a human in the loop. However, current AL based acquisition functions are not well-equipped to tackle real-world datasets with rare slices, since they are based on uncertainty scores or global descriptors of the image. We propose TALISMAN, a novel framework for Targeted Active Learning or object detectIon with rare slices using Submodular MutuAl iNformation. Our method uses the submodular mutual information functions instantiated using features of the region of interest (RoI) to efficiently target and acquire data points with rare slices. We evaluate our framework on the standard PASCAL VOC07+12 and BDD100K, a real-world self-driving dataset. We observe that TALISMAN outperforms other methods by in terms of average precision on rare slices, and in terms of mAP. Suraj Kothawade · Saikat Ghosh · Sumit Shekhar · Yu Xiang · Rishabh Iyer 🔗