InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception

Haijie Li1, Yanmin Wu1, Jiarui Meng1, Qiankun Gao1, Zhiyao Zhang2, Ronggang Wang1, Jian Zhang1

1Peking University, 2Northeastern University

Framework

Framework image
Top: Appearance-semantic joint Gaussian representation avoids the imbalance and inconsistency in appearance-semantic learning.

Bottom: Bottom-up instantiation: Over-segmentation is achieved via FPS sampling and clustering, followed by instantiation through graph-connectivity-based aggregation.

Abstract

3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations.

However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine-grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance-based Gaussians often misrepresent object boundaries; and 3) reliance on top-down instance segmentation methods, which struggle with uneven category distributions, leading to over- or under-segmentation.

In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic-Scaffold-GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance-semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom-up, category-agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state-of-the-art performance in category-agnostic, open-vocabulary 3D point-level segmentation, highlighting the effectiveness of the proposed representation and training strategies.

Results

Instance segmentation result
Visualization comparison of category-agnostic 3D instance segmentation result.
Open vocabulary results
Open-vocabulary query point cloud understanding on Scannet dataset.
Open vocabulary results
Open-vocabulary 3D object selection and rendering on the LeRF dataset.
Open vocabulary results

Top: Reference image of scenes. Middle: Constructed 3D Gaussians/points.

Bottom: The visualization result of category-agnostic 3D instance segmentation in GraspNet dataset.

BibTeX


@misc{li2024instancegaussianappearancesemanticjointgaussian,
      title={InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception}, 
      author={Haijie Li and Yanmin Wu and Jiarui Meng and Qiankun Gao and Zhiyao Zhang and Ronggang Wang and Jian Zhang},
      year={2024},
      eprint={2411.19235},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.19235}, 
}