DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo*, Wenzhao Zheng*, †, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu
Tsinghua University
Yinwang Intelligent Technology Co. Ltd.
[Paper (Arxiv)]     [Code (GitHub)]    
*Equal contributions. Project Leader.

Overview of our contributions. DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding, transforms surround-view images into unified scene tokens that jointly encode textural, semantic, and geometric information, enabling high-quality reconstruction and scene understanding for autonomous driving.

Unified 3D Scene Tokenization for Multi-View Reconstruction and Understanding


Overall Framework of DriveTok


We propose our DriveTok for multi-view scene reconstruction and understanding. Vision-only surround-view images are processed by a 3D scene encoder to produce unified scene tokens on a BEV grid, independent of camera layout and resolution. A spatial-aware multi-view decoder renders predictions in both image and occ spaces. Through joint multi-task training, our scene tokens encode rich textural, semantic, and geometric information.

Results


Image Reconstruction


We evaluate the image reconstruction performance of DriveTok on NuScenes. Our method achieves performance comparable to all baselines across all six cameras, demonstrating that DriveTok can be adapted to multi-view inputs in autonomous driving.

Depth Estimation


We conduct evaluations of multi-view depth estimation on NuScenes dataset. DriveTok achieves the lowest AbsRel and the highest δ < 1.25 among all compared methods. These results demonstrate that the unified scene tokens effectively capture both local geometry and global scene structure, forming a reliable geometric foundation for downstream perception tasks.

Occupancy Prediction


We evaluate our DriveTok by testing its capability on occupancy prediction task, which directly measures how well the learned tokens capture geometric and semantic structure of the driving scene. The quantitative results show that our DriveTok significantly outperforms existing methods under the same setting, highlighting its superior ability to learn rich spatial and semantic information from multi-view inputs.

Visualizations


Bibtex

@article{drivetok,
      title={DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding}, 
      author={Dong Zhuo and Wenzhao Zheng and Sicheng Zuo and Siming Yan and Lu Hou and Jie Zhou and Jiwen Lu},
      journal={arXiv preprint arXiv:2603.19219},
      year={2026}
}

Website Template