OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

1Nanjing University, School of Intelligent Science and Technology
2China Mobile Zijin Innovation Institute
Corresponding author

Abstract

3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget.

Method Overview

Method

Method Overview
Figure 2. Overview of the OpenGround framework. The core of our framework is the Active Cognition-based Reasoning (ACR) module. First, the ACR invokes Cognitive Task Chain Construction module to obtain a sequential task chain to guide step-by-step grounding. Next, the ACR module progresses along the task chain to ground objects progressively. For objects not present in the OLT, it activates the Active Cognition Enhancement module to extend the OLT with newly perceived objects around previously grounded objects. Then, the ACR module uses Single-Step Grounding which prompts VLM with annotated images from perspectives focused on candidates (with reference to previously grounded objects) to obtain the target object's ID in this step. The ID is used to retrieve the object's 3D bounding box from the extended OLT. Upon completing the ACR module's workflow, we obtain the bounding box of the final target.

Objects Parsing

Parse the query into objects, attributes, and relations to form semantic nodes and constraints.

Objects Retrieval

Retrieve matching instances from the OLT to establish grounded intermediate goals and reference objects.

Task Chain Construction

Construct a step-by-step grounding chain from known objects to the target using hierarchy and context.

Perspectives Selection

Select viewpoints around grounded objects to cover local neighborhoods and reveal unseen parts.

2D Segmentation and Lifting

Segment objects in 2D from selected views and lift them to 3D to add new instances to the extended OLT.

Candidates Selection

Select candidates from the extended OLT for the next step, narrowing the search space before single-step grounding.

Perspectives Selection and Annotation

Render candidate-focused views and annotate reference objects and candidate regions to form multi-view inputs.

VLM Reasoning

Prompt the VLM with the query and annotated views to predict the target ID and return its 3D bounding box.

Dataset

OpenTarget Dataset

To rigorously evaluate the open-world grounding capability of our OpenGround framework, we construct a novel dataset named OpenTarget, based on ScanNet++ and Articulate3D. Existing 3DVG benchmarks focus on object-level instances with limited category diversity, failing to simulate the open-world scenarios where fine-grained, previously undefined objects (e.g., sink handles, cabinet doors) are pervasive.

In contrast, OpenTarget introduces objects from Articulate3D that are absent from the object lookup table (OLT) built on ScanNet++. These fine-grained parts mimic unforeseen objects in open-world scenarios, providing a realistic benchmark for open-world grounding.

7,724
Object-Pairs
120
Class Labels
50
Object Classes
70
Part Classes

Data Collection Pipeline

Data Collection Pipeline
Figure 3. Data Collection Pipeline. This pipeline generates discriminative object descriptions in three stages, plus a two-stage verification process. It leverages a hierarchical label structure (e.g., cabinet→drawer→handle), selects target/distractor objects with viewing perspectives, and uses VLMs to generate context-aware descriptions (incorporating parent annotations for child objects). Quality control involves VLM majority voting and manual refinement, ensuring high dataset accuracy.

Quality Verification

We employ a two-stage quality verification process, combining automatic filtering and manual refinement. First, for objects with the same label, we use multiple VLMs to vote on object-query pairs (via selected perspectives and object IDs), retaining only majority-approved pairs for manual review. Subsequently, human annotators verify these pairs, with the ability to mark them as "unidentifiable" or revise queries, ensuring high dataset quality.

Results

Qualitative Results

Task Chain Process

Task Chain 1
Task Chain 2
Task Chain 3
Task Chain 4

Method Comparison

Query
Ours
SeqVLM+GT
Find the outer lid of the box with the "DMS" mark on a cabinet.
Query lid - Ours
Query lid - SeqVLM+GT
Find a handle attached to a white window frame. The window frame is part of a tall, narrow window with blinds partially drawn; the window is situated beside a door with a window.
Query handle - Ours
Query handle - SeqVLM+GT
Locate a metallic hinge attached to the light blue door frame near a sink and wall-mounted dispenser. The hinge is lower than the sink.
Query hinge - Ours
Query hinge - SeqVLM+GT
Locate a small yellow soap placed on the edge of a white sink in a compact bathroom. The sink is mounted next to a toilet, with shelves holding toiletries visible above.
Query soap - Ours
Query soap - SeqVLM+GT

Interactive Demo

Here, we present an interactive 3D room visualization. When a query is selected, the left panel displays an interactive 3D scene where the target object's 3D bounding box is highlighted. The right panel shows the query-aligned 2D rendered image.

3D Interactive Scene

(Placeholder - implemented with Three.js)

Query-aligned View

Controls: Click + Drag = Rotate | Ctrl + Drag = Pan | Scroll Up/Down = Zoom In/Out

Quantitative Results

Results on ScanRefer

Method Supervision VLM Unique Multiple Overall
Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5
Supervised Methods
ScanRefer [6] Supervised - 67.6 46.2 32.1 21.3 39.0 26.1
ViewSRD [13] Supervised - 82.1 68.2 37.4 29.0 45.4 36.0
GPT4Scene [38] Supervised - 90.3 83.7 56.4 50.9 62.6 57.0
3D-R1 [15] Supervised - - - - - 65.8 59.2
Zero-Shot Methods
SeeGround [24] Zero-Shot Qwen2-VL-72b [55] 75.7 68.9 34.0 30.0 44.1 39.4
SeqVLM [25] Zero-Shot Doubao-1.5-pro [5] 77.3 72.7 47.8 41.3 55.6 49.6
VLM-Grounder* [59] Zero-Shot GPT-4o [37] 51.6 32.8 66.0 29.8 48.3 33.5
SPAZER [19] Zero-Shot GPT-4o [37] 80.9 72.3 51.7 43.4 57.2 48.8
ZSVG3D [64] Zero-Shot GPT-4 turbo [37] 63.8 58.4 27.7 24.6 36.4 32.7
Ours Zero-Shot GLM-4.5V [51] 77.8 74.4 57.9 47.9 61.8 53.1
Quantitative Comparisons on ScanRefer [Chen et al., 2020]. Results are reported for "Unique" (scenes with a single target object) and "Multiple" (scenes with distractors of the same class) subsets, with metrics Acc@0.25 and Acc@0.50. * denotes results on selected 250 samples.

Results on Nr3D

Method Easy Hard Dep. Indep. Overall
Supervised Methods
3D-R1 [15] - - - - 68.8
TSP3D [10] - - - - 48.7
ViewSRD [13] 75.3 64.8 68.6 70.6 69.9
Zero-Shot Methods
VLM-Grounder [59] 55.2 39.5 45.8 49.4 48.0
SeeGround [24] 54.5 38.3 42.3 48.2 46.1
ZSVG3D [64] 46.5 31.7 36.8 40.0 39.0
SeqVLM [25] 58.1 47.4 51.0 54.5 53.2
SPAZER [19] 68.0 58.8 59.9 66.2 63.8
Ours 59.1 54.7 54.1 58.3 56.8
Ours 64.3 59.3 59.2 63.1 61.7
Detailed Performance on Nr3D [Achlioptas et al., 2020]. Queries are categorized as "Easy" (with one distractor) or "Hard" (with multiple distractors), and as "Dep." (View-Dependent) or "Indep." (View-Independent) based on viewpoint requirements for grounding. denotes methods using GPT-4o [37] as the VLM.

Results on OpenTarget

Method OLT Acc@0.25 Acc@0.50
SeeGround [24] GT 17.9 17.4
VLM-Grounder* [59] GT 28.6 20.4
SeqVLM [25] GT 19.4 19.2
GPT4Scene [38] GT 12.1 11.8
Ours Mask3D [44] +ACE 46.2 34.2
Ours GT 54.8 54.3
Performance on OpenTarget. * denotes results on randomly selected 300 samples due to its low efficiency.

BibTeX

@misc{huang2025opengroundactivecognitionbasedreasoning,
      title={OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding}, 
      author={Wenyuan Huang and Zhao Wang and Zhou Wei and Ting Huang and Fang Zhao and Jian Yang and Zhenyu Zhang},
      year={2025},
      eprint={2512.23020},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.23020}, 
}