我们正处在人工智能从 “屏幕里的世界” 走向 “真实物理世界” 的关键阶段。让 AI 不仅能看懂图像、听懂语言,更要能在真实空间中自主移动、理解指令、完成任务。
MTU3D这一工作的出现,将 “理解” 和 “探索” 结合在一起,让 AI 像人一样,一边探索环境,一边理解指令,逐步建立起对周围世界的认知。通过结合真实和虚拟的数据训练,MTU3D不仅在模拟器中表现出色,也可以在真实机器人上完成任务,给未来的具身导航提供了新的思路和更多的想象空间。
参考文献:
[1] Liu, Y., et al. "Aligning cyber space with physical world: A comprehensive survey on embodied ai. arXiv 2024." arXiv preprint arXiv:2407.06886.
[2] Zhu, Ziyu, et al. "3d-vista: Pre-trained transformer for 3d vision and text alignment." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[3] Khanna, Mukul, et al. "Goat-bench: A benchmark for multi-modal lifelong navigation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[4] Caron, Mathilde, et al. "Emerging properties in self-supervised vision transformers." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
[5] Liu, Baoyuan, et al. "Sparse convolutional neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[6] Zhang, Chaoning, et al. "Faster segment anything: Towards lightweight sam for mobile applications." arXiv preprint arXiv:2306.14289 (2023).
[7] Zhu, Ziyu, et al. "Unifying 3d vision-language understanding via promptable queries." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
[8] Xu, Xiuwei, et al. "Embodiedsam: Online segment any 3d thing in real time." arXiv preprint arXiv:2408.11811 (2024).
[9] Yamauchi, Brian. "Frontier-based exploration using multiple robots." Proceedings of the second international conference on Autonomous agents. 1998.
[10] Dai, Angela, et al. "Scannet: Richly-annotated 3d reconstructions of indoor scenes." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[11] Ramakrishnan, Santhosh K., et al. "Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai." arXiv preprint arXiv:2109.08238 (2021).
[12] Yokoyama, Naoki, et al. "HM3D-OVON: A dataset and benchmark for open-vocabulary object goal navigation." 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024.
[13] Zhang, Zhuofan, et al. "Task-oriented sequential grounding in 3d scenes." arXiv preprint arXiv:2408.04034 (2024).
[14] Majumdar, Arjun, et al. "Openeqa: Embodied question answering in the era of foundation models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.