Timezone: »

MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds
Shaocong Dong · Lihe Ding · Haiyang Wang · Tingfa Xu · Xinli Xu · Jie Wang · Ziyang Bian · Ying Wang · Jianan Li


3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

Author Information

Shaocong Dong (Beijing Institute of Technology)
Lihe Ding (BIT)
Haiyang Wang (Peking University)
Tingfa Xu (Beijing Institute of Technology, Tsinghua University)
Xinli Xu (Beijing Institute of Technology)
Jie Wang (Beijing Institute of Technology)
Ziyang Bian (Beijing Institute of Technology)
Ying Wang (Beijing Institute of Technology)
Jianan Li (Beijing Institute of Technology)

More from the Same Authors