We present a Ca3DPose(Canonical 3D Pose) estimation method via object-level classification and reimplementation of a NPCS (Normalized Part Coordinate Space) based method to object level to do 3D pose estimation. We use the newly released 3D scene dataset Multiscan with over 200 scans. Our model relies on the MinkowskiEngine-powered U-Net backbone or PAConv backbone to get point-level features, voxelized or max pooled point-level features to get object-level features and does object-level classification. We formalize 3D Pose as a combination of the up-direction class, front-direction latitude class, and front-direction longitude class. As a result, we used the results of the NPCS method as a baseline, and our Ca3DPose outperformed the baseline method in the Multiscan dataset. We also found that PAConv backbone outperformed the U-Net backbone.
We trained the our model on newly released Multiscan dataset
The main contribution is predicting the canonical 3d pose(front and up direction) of an object given its point cloud by object-level classification
The basic code architecture of W&B logger, Hydra part and the Backbone model are fromMINSU3D
- ObjectClassifier is an efficient framework(MinkowskiEngine based) for point cloud object level pose estimation. It voxelizes the per point features from UNet to obtain object-level features. It also discretizes the front/up directions into different latitude and longitude classes and then computes the directions given the predicted class. Therefore, the canonical pose estimation can be simplied as a classifiction problem and 3 layer MLP is used.
- The Normalized Object Coordinate Space is the reimplementation of the Normalized Part Coordinate Space model. We modified some model details and loss function to fit the Multiscan dataset and the features from UNet in MinkowskiEngine
- AC_(angle): the accuracy, the threshold is that the angle between prediction and ground truth direction is within [angle] degree
- Rerr: average angle between prediction and ground truth direction, in Radian system
What | AC_5 | AC_10 | AC_20 | Rerr | Number |
---|---|---|---|---|---|
wall | 0.764 | 0.809 | 0.828 | 0.543 | 157 |
door | 0.610 | 0.659 | 0.683 | 0.937 | 41 |
table | 0.517 | 0.583 | 0.633 | 0.849 | 60 |
chair | 0.529 | 0.657 | 0.857 | 0.236 | 70 |
cabinet | 0.652 | 0.710 | 0.783 | 0.595 | 69 |
window | 0.824 | 0.941 | 1.000 | 0.046 | 17 |
sofa | 0.636 | 0.727 | 0.864 | 0.274 | 22 |
microwave | 0.500 | 0.667 | 0.667 | 1.061 | 6 |
pillow | 0.727 | 0.788 | 0.939 | 0.157 | 33 |
tv_monitor | 0.455 | 0.500 | 0.545 | 1.427 | 22 |
curtain | 0.591 | 0.591 | 0.682 | 0.791 | 22 |
trash_can | 0.875 | 0.875 | 0.875 | 0.393 | 8 |
suitcase | 0.594 | 0.625 | 0.688 | 0.669 | 32 |
sink | 0.286 | 0.500 | 0.500 | 1.479 | 14 |
backpack | 0.000 | 0.250 | 0.250 | 1.808 | 4 |
bed | 0.750 | 0.750 | 0.750 | 0.588 | 8 |
refrigerator | 0.600 | 0.600 | 0.600 | 0.909 | 10 |
toilet | 0.333 | 0.444 | 0.444 | 1.052 | 9 |
average | 0.631 | 0.697 | 0.763 | 0.621 | 604 |
The dataset is the newly released Multiscan dataset using our ObjectClassifier model. Our model is only trained on the 8 object categoried with articulated parts.
The results using the NOCS model are lower than results in our model.
Method | AC_5 | AC_10 | AC_20 | Rerr |
---|---|---|---|---|
Ca3DPose (U-Net) | 0.328 | 0.349 | 0.387 | 1.812 |
Ca3DPose (PAConv) | 0.631 | 0.697 | 0.763 | 0.621 |
NOCS (baseline) | 0.004 | 0.040 | 0.175 | 1.359 |
- Design a new object level classfier method based on latitude class and longitude class, the model architecture is as followed.
- Preprocess the Multiscan to get the objects with annotated canonical poses
- Highly-modularized design enables researchers switch between the NOCS model and our ObjectClassifier model easily.
- Better logging with W&B, periodic evaluation during training, and Easy experiment configuration with Hydra.
Environment requirements
- CUDA 11.X
- Python 3.8
We recommend the use of miniconda to manage system dependencies.
# create and activate the conda environment
conda create -n min3dcapose python=3.8
conda activate min3dcapose
# install PyTorch 1.8.2
conda install pytorch cudatoolkit=11.1 -c pytorch-lts -c nvidia
# install Python libraries
pip install -e .
# Python libraries installation verfication
python -c "import min3dcapose"
# install OpenBLAS and SparseHash via conda
conda install openblas-devel -c anaconda
conda install -c bioconda google-sparsehash
export CPATH=$CONDA_PREFIX/include:$CPATH
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
# install MinkowskiEngine
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"
# install C++ extensions
cd minsu3d/common_ops
python setup.py develop
Note: Setting up with Pip (no conda) requires OpenBLAS and SparseHash to be pre-installed in your system.
# create and activate the virtual environment
virtualenv --no-download env
source env/bin/activate
# install PyTorch 1.8.2
pip install torch==1.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
# install Python libraries
pip install -e .
# install OpenBLAS and SparseHash via APT
sudo apt install libopenblas-dev libsparsehash-dev
# install MinkowskiEngine
pip install MinkowskiEngine
# install C++ extensions
cd minsu3d/common_ops
python setup.py develop
- Download the Multiscan dataset and repo. To acquire the access to the dataset, please refer to their instructions. The download dataset would follow this file system structure You will get a download script if your request is approved:
- Substitute the
MULTISCAN/dataset/preprocess/gen_instsegm_dataset.py
file in the downloaded Multiscanrepo with the gen_instsegm_dataset.py, set the environment following the Instructions - Preprocess the data, it converts the objects with annotated pose to
.pth
data, and split dataset as the default way byMultiscan
# about 406.3GB in total of Multiscan raw dataset
python gen_instsegm_dataset.py
# the processed data is about 5.9GB in total
Each .pth
file should named by the scans, which contains all the objects in that scan. The objects dictionary should have the following keys:
"xyz":
"rgb":
"normal":
"obb":
"front":
"up":
"instance_ids":
"sem_labels":
Download splitted Multiscan objects with metadata by Multiscan_objects
Note: Configuration files are managed by Hydra, you can easily add or override any configuration attributes by passing them as arguments.
# log in to WandB
wandb login
# train a model from scratch
python train.py model={model_name} data={dataset_name}
# train a model from a checkpoint
python train.py model={model_name} data={dataset_name} model.ckpt_path={checkpoint_path}
# test a pretrained model
python test.py model={model_name} data={dataset_name} model.ckpt_path={pretrained_model_path}
# evaluate inference results
python eval.py model={model_name} data={dataset_name} model.model.experiment_name={experiment_name}
# examples:
# python train.py model=nocs data=multiscan model.trainer.max_epochs=120
# python test.py model=object_classifier data=multiscan model.ckpt_path=Object_Classifier_best.ckpt
# python eval.py model=nocs data=multiscan model.model.experiment_name=run_1
We provide pretrained models for Multiscan. The pretrained model and corresponding config file are given below. Note that all NOCS models are trained from scratch. While the ObjectClassifier model is trained from the pretrained HAIS-MultiScanObj-epoch=55.ckpt model which is trained on Multiscan dataset. It uses the hyper-parameters in Backbone UNet to accelerate training process. After downloading a pretrained model, run test.py
to do inference as described in the above section.
Model | Code | AC_5 | AC_10 | AC_20 | Rerr | Download |
---|---|---|---|---|---|---|
ObjectClassifier | config model | 0.318 | 0.337 | 0.348 | 1.337 | link |
NOCS | config model | link |
We provide scripts to visualize the predicted and ground truth canonical 3d pose of an object. When testing and inferencing, use the following option to show visualizations
model.show_visualization: True
the default visualization results will be saved in the following file structure
min3dcapose
├── visualization_results
# results whose average angles error is below 5 degree
│ ├── Ac5-
# input object
│ │ ├── [object_name].png
# object in predicted canonical pose
│ │ ├── [object_name]_r.png
# results whose average angles error is over 30 degree
│ ├── Ac30-
│ │ ├── [object_name].png
│ │ ├── [object_name]_r.png
Some results visualizations are as followed
# the red arrow is the predicted front direction
# left one: the original object with default OBB by `pcd.get_oriented_bounding_box()` in Open3d, the pose is randomly rotated
# right one: object rotated to predicted canonicalized pose, the OBB is align to canonicalized axis
The good predictions when angle<5 degree:
object name | uncanonicalized pose | predicted canonical pose |
---|---|---|
toilet | ||
chair | ||
cabinet | ||
door |
The bad predictions when angle>30 degree:
object name | uncanonicalized pose | predicted canonical pose |
---|---|---|
toilet | ||
chair | ||
cabinet | ||
door |
We report the time it takes to train on Multiscan data of 134 scans
Test environment
- CPU: Intel Core i7-12700 @ 2.10-4.90GHz × 12
- RAM: 32GB
- GPU: NVIDIA GeForce RTX 3090 Ti 24GB
- System: Ubuntu 20.04.2 LTS
Training time in total (without validation)
Model | Epochs | Batch Size | Time |
---|---|---|---|
ObjectClassifier | 15 | 8 | 12hr4min |
NOCS | 91 | 8 | 30hr10min |
Inference time per object (avg)
Model | Time |
---|---|
ObjectClassifier | 1.004s |
- It's hard to predict the canonical pose of some objects categories due to annotation limitations. For instance, the front direction of some windows is defined as pointing into room. Therefore, the front direction is hard to predict without background.
- The results of the reimplementation of NOCS model still need to be impoved.
This repo is built upon the MinkowskiEngine and Minsu3d. We train our models on Multiscan. If you use this repo and the pretrained models, please cite the original papers.
@inproceedings{mao2022multiscan,
author = {Mao, Yongsen and Zhang, Yiming and Jiang, Hanxiao and Chang, Angel X, Savva, Manolis},
title = {MultiScan: Scalable RGBD scanning for 3D environments with articulated objects},
booktitle = {Advances in Neural Information Processing Systems},
year = {2022}
}
@article{Zhou2018,
author = {Qian-Yi Zhou and Jaesik Park and Vladlen Koltun},
title = {{Open3D}: {A} Modern Library for {3D} Data Processing},
journal = {arXiv:1801.09847},
year = {2018},
}
@article{ravi2020pytorch3d,
author = {Nikhila Ravi and Jeremy Reizenstein and David Novotny and Taylor Gordon
and Wan-Yen Lo and Justin Johnson and Georgia Gkioxari},
title = {Accelerating 3D Deep Learning with PyTorch3D},
journal = {arXiv:2007.08501},
year = {2020},
}
@inproceedings{choy20194d,
title={4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks},
author={Choy, Christopher and Gwak, JunYoung and Savarese, Silvio},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={3075--3084},
year={2019}
}
@misc{https://doi.org/10.48550/arxiv.2211.05272,
doi = {10.48550/ARXIV.2211.05272},
url = {https://arxiv.org/abs/2211.05272},
author = {Geng, Haoran and Xu, Helin and Zhao, Chengyang and Xu, Chao and Yi, Li and Huang, Siyuan and Wang, He},
keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}