DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
Overview of our contributions. DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding, transforms surround-view images into unified scene tokens that jointly encode textural, semantic, and geometric information, enabling high-quality reconstruction and scene understanding for autonomous driving.
We propose our DriveTok for multi-view scene reconstruction and understanding. Vision-only surround-view images are processed by a 3D scene encoder to produce unified scene tokens on a BEV grid, independent of camera layout and resolution. A spatial-aware multi-view decoder renders predictions in both image and occ spaces. Through joint multi-task training, our scene tokens encode rich textural, semantic, and geometric information.
@article{drivetok,
title={DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding},
author={Dong Zhuo and Wenzhao Zheng and Sicheng Zuo and Siming Yan and Lu Hou and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2603.19219},
year={2026}
}