Skip to content

Latest commit

 

History

History
5 lines (3 loc) · 2.6 KB

2502.00708.md

File metadata and controls

5 lines (3 loc) · 2.6 KB

PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation

Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T3Bench, and improves efficiency by 24x.

文本到 3D 资产生成在 2D 扩散先验的监督下取得了显著优化。然而,在处理复合场景时,现有方法面临多个挑战:1)无法确保复合场景布局符合物理定律;2)难以准确捕捉复杂场景描述中提到的资产和它们之间的关系;3)利用大语言模型(LLMs)的布局方法在自主资产生成能力方面有限。为避免这些妥协,我们提出了一种新的复合场景生成框架 PhiP-G,该框架无缝地将生成技术与基于世界模型的布局引导相结合。通过利用基于 LLM 的代理,PhiP-G 分析复杂的场景描述并生成场景图,同时集成多模态的 2D 生成代理和 3D 高斯生成方法用于有针对性的资产创建。在布局阶段,PhiP-G 使用具有附着能力的物理池和视觉监督代理,形成一个用于布局预测和规划的世界模型。大量实验表明,PhiP-G 显著提升了复合场景的生成质量和物理合理性。值得注意的是,PhiP-G 在 CLIP 分数上达到了最先进的(SOTA)性能,在 T3Bench 测量的生成质量上与领先方法持平,并提高了 24 倍的效率。