Skip to content
/ SoFar Public

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Notifications You must be signed in to change notification settings

qizekun/SoFar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

We present SoFar, the first 6-DoF system for spatial reasoning and robotic manipulation.

We introduce the concept of semantic orientation, representing the object orientation condition on open vocabulary language.

Zekun Qi *, Wenyao Zhang *, Yufei Ding *, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang and Li Yi.

Project Page Paper PDF Hugging Face Code License Data License

PWC PWC PWC PWC PWC

Contents

Quick-Start

Setup environment:

conda create -n sofar python=3.10 -y
conda activate sofar

git clone https://github.com/qizekun/SoFar.git
cd SoFar
pip install -e .
pip install -e segmentation/SAM

Download checkpoints:

mkdir checkpoints & cd checkpoints
# Florence-2
huggingface-cli download microsoft/Florence-2-base
# Segment Anything
wget -c https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
# PointSO
wget -c https://huggingface.co/qizekun/PointSO/resolve/main/small.pth
wget -c https://huggingface.co/qizekun/PointSO/resolve/main/base_finetune.pth

More detailed installation instructions can be found in INSTALL.md.

SoFar

Set OpenAI key:

export OPENAI_API_KEY=your_openai_key

Note that gemini-2.0-flash-exp is comparable and even better than the gpt-4o, especially the Open6DOR task.

Demo

6-DoF Object Rearrangement Demo

python scripts/open6dor_demo.py

Object Manipulation Demo

python scripts/manipulation_demo.py

Spatial Visual Question Answering Demo

python scripts/vqa_demo.py

Evaluation

Object Manipulation on SimplerEnv

Google Robot Visual Matching
Method Training Data Pick Coke Can Move Near Open / Close Drawer Average
Octo-Base OXE 0.170 0.042 0.227 0.168
OpenVLA OXE 0.163 0.462 0.356 0.277
RoboVLM OXE 0.727 0.663 0.268 0.563
SpatialVLA OXE 0.810 0.696 0.593 0.719
SoFar - 0.923 0.917 0.403 0.749
Widow-X Visual Matching
Method Training Data Put Spoon on Towel Put Carrot on Plate Stack Green Block on Yellow Block Put Eggplant in Yellow Basket Average
Octo-Base OXE 0.170 0.042 0.227 0.168 0.160
OpenVLA OXE 0.000 0.000 0.000 0.041 0.010
RoboVLM OXE 0.208 0.250 0.083 0.000 0.135
SpatialVLA OXE 0.208 0.208 0.250 0.708 0.344
SoFar - 0.583 0.667 0.708 0.375 0.583

We evaluate SoFar's performance on two tracks in SimplerEnv, and SoFar achieved SOTA performance in all cases. Due to the independent configuration of the environment, we provided detailed evaluation code in SimplerEnv-SOFAR.

6-DoF Object Rearrangement Perception on Open6DOR V2

Method Position Track Rotation Track 6-DoF Track
Level 0 Level 1 Level 0 Level 1 Level 2 Overall
Dream2Real 17.2 11.0 37.3 27.6 26.2 13.5
VoxPoser 35.6 21.7 - - - -
Open6DOR-GPT 78.6 60.3 45.7 32.5 49.8 35.6
SoFar-LLaVA 86.3 57.9 62.5 30.2 67.1 40.3
SoFar 96.0 81.5 68.6 42.2 70.1 48.7

Download the refined dataset following DATASET.md.

# Predict on Open6DOR dataset
python open6dor_eval_perception.py
# Evaluate the metrics
python eval_open6dor.py

Note that Open6DOR uses the observer's perspective, which means it is oriented relative to the robotic arm. This implies that the X-axis and Y-axis of the observer coordinate system are opposite to those of the robotic arm's base coordinate system. This is reflected in the system prompt: in the observer coordinate system, the Y-axis extends from left to right, and the X-axis extends from far to near.

Additionally, for the Open6DOR task, we recommend using small_finetune.pth as the orientation model in pointso.py to achieve better performance.

6-DoF Spatial VQA on 6-DoF SpatialBench

Method Position (rel.) Position (abs.) Orientation (rel.) Orientation (abs.) Total
GPT-4o 49.4 28.4 44.2 25.8 36.2
SpaceLLaVA 32.4 30.5 30.9 24.9 28.2
SpatialBot 50.9 21.6 39.6 22.9 33.7
RoboPoint 43.8 30.8 33.8 25.8 33.5
SoFar 59.6 33.8 54.6 31.3 43.9

Download the refined dataset following DATASET.md.

python spatialbench/eval_spatialbench.py

PointSO

Pretrain

Download the PointMAE as initialization.

wget https://github.com/Pang-Yatian/Point-MAE/releases/download/main/pretrain.pth -P orientation/

Perpare the OrienText300K dataset following DATASET.md.

cd orientation
sh train_ddp.sh

Finetune

Perpare the Open6DOR finetuning dataset following DATASET.md. Finetune PointSO will significantly improve the performance on Open6DOR rotation track & 6-DoF track.

cd orientation
sh train_ddp_ft.sh

Datasets & Benchmarks

OrienText300K

We obtained the OrienText300K dataset by rendering multi-views of Objaverse and annotating with ChatGPT, including the filtering of Objaverse 1.0, 350K orientation-text pairs, and 8M multi-view images. The complete multi-view data will be uploaded.

In addition, if your work requires filtering 3D data, the attributes.zip we use to filter OrienText300K may be helpful for your work. We provide multi-view annotations for each object in Objaverse across multiple dimensions, removing low-quality, meaningless, noise, and 3D assets containing useless data.

Data open source on Huggingface OrienteText300K.

Open6DOR V2

We removed the erroneous data from Open6DOR V1 and eliminated parts that required manual judgment to facilitate replication. Open6DOR V2 contains ~4500 tasks for 6-DoF object rearrangement & spatial relationship evaluation.

Data open source on Huggingface Open6DOR V2.

6-DoF SpatialBench

Previous spatial perception LLMs mainly focused on operations of positional relationships, such as left-right, near-far, size, and counting, etc. In actual object manipulation, the orientation of the object is also a very important factor. Therefore, we proposed a new 6-DoF spatial perception benchmark dataset for evaluating the model's reasoning capabilities in position, orientation, and position-orientation relationships. We evaluated existing spatial perception models on this benchmark dataset.

Data open source on Huggingface 6-DoF SpatialBench.

TODO

  • Release the evaluation code for Simpler-Env for Google Robot & Widow-X.
  • Release the evaluation code for Open6DOR-Libero. (About 2 week)
  • Release more version of PointSO. (About 2 week)
  • Release the improved version of OrienText300K. (About 1 month)
  • Release gradio demo for SoFar & PointSO. (About 1 month)
  • Release the Objaverse-XL version dataset & PointSO. (Maybe 2 month or more)

Contact

If you have any questions related to the code or the paper, feel free to email Zekun ([email protected]).

Acknowledgements

Citation

If you find SoFar, PointSO, OrienText300K, Open6DOR V2 or 6-DoF SpatialBench helpful for your research, please consider citing the following BibTeX entry.

@article{qi2025sofar,
  author = {Qi, Zekun and Zhang, Wenyao and Ding, Yufei and Dong, Runpei and Yu, Xinqiang and Li, Jingwen and Xu, Lingyun and Li, Baoyu and He, Xialin and Fan, Guofan and Zhang, Jiazhao and He, Jiawei and Gu, Jiayuan and Jin, Xin and Ma, Kaisheng and Zhang, Zhizheng and Wang, He and Yi, Li},
  title = {SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation},
  journal = {arXiv preprint arXiv:2502.13143},
  year = {2025}
}

About

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages