arxiv:2406.05132

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Published on Jun 7

· Submitted by

jedyang97 on Jun 13

Upvote

Authors:

Jianing Yang ,

Xuweiyi Chen ,

Nikhil Madaan ,

Madhavan Iyengar ,

Shengyi Qian ,

Abstract

The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

View arXiv page View PDF Add to collection

Community

jedyang97

Paper author Paper submitter Jun 13

🔥 3D-LLMs go brrrr! 🚀 Excited to announce our latest research on scaling 3D-LLM training data to million-scale with dense grounding.

🌟 Introducing 3D-GRAND: a pioneering dataset featuring 40,087 household scenes paired with 6.2 million densely-grounded 3D-text pairs. 🏠💬 https://3d-grand.github.io

🚀 We envision 3D-GRAND to be the bedrock for future 3D-LLMs! 🏠💬

6.2 million instructions + 40k 3D household scenes 🔥
Significantly enhances grounding & reduces hallucinations for 3D-LLMs 🌟
3D-POPE: the first benchmark for systematic evaluation of hallucinations in 3D-LLMs 🎯
Data scaling law and sim-to-real transfer provide strong early signals for a low-cost, scalable future for 3D-LLMs 📈

🌟 What's special about this data? 🤔

Dense Grounding: Unlike traditional 3D-text datasets, ours connects every noun to an object in the 3D world. 🏠🔗
Large-scale: We provide million-scale data, bridging the gap between 3D and 2D datasets. 📊
Diverse Tasks: Curated 8 diverse tasks to cover future 3D-LLM challenges. 🌐
Hallucination: Special attention was given to curate a balanced dataset to help reduce hallucinations & Introduced a benchmark for evaluating hallucinations in 3D-LLMs. 🧠📏

🚀 Results of 3D-LLMs trained on 3D-GRAND:

Stronger Grounding
Less Hallucination (huge improvement over prev. 3D-LLMs)
Data Scaling Law: More data -> better performance. 📈
Sim-to-real Transfer: Trained on synthetic 3D scenes -> effective transfer to real 3D scans in ScanNet. 🌐