Human3D πŸ§‘β€πŸ€β€πŸ§‘

3D Segmentation of Humans in Point Clouds with Synthetic Data

1ETH ZΓΌrich, Switzerland 2RWTH Aachen University, Germany 1ETH AI Center, Switzerland *,†equal contribution
Paper Code Dataset EgoBody Visualizations

Abstract

TL;DR: We propose the first multi-human body-part segmentation model, called Human3D πŸ§‘β€πŸ€β€πŸ§‘, that directly operates on 3D scenes. In an extensive analysis, we validate the benefits of training on synthetic data on multiple baselines and tasks.

Segmenting humans in 3D indoor scenes has become increasingly important with the rise of human-centered robotics and AR/VR applications. In this direction, we explore the tasks of 3D human semantic-, instance- and multi-human body-part segmentation. Few works have attempted to directly segment humans in point clouds (or depth maps), which is largely due to the lack of training data on humans interacting with 3D scenes. We address this challenge and propose a framework for synthesizing virtual humans in realistic 3D scenes. Synthetic point cloud data is attractive since the domain gap between real and synthetic depth is small compared to images. Our analysis of different training schemes using a combination of synthetic and realistic data shows that synthetic data for pre-training improves performance in a wide variety of segmentation tasks and models. We further propose the first end-to-end model for 3D multi-human body-part segmentation, called Human3D, that performs all the above segmentation tasks in a unified manner. Remarkably, Human3D even outperforms previous task-specific state-of-the-art methods. Finally, we manually annotate humans in test scenes from EgoBody to compare the proposed training schemes and segmentation models.

Explanatory Video

Point Cloud from iPhone LiDAR

Remarkably, our approach generalizes to out-of-distribution examples. Although trained on synthetic data and real Kinect depth data, Human3D shows promising results on reconstructed point clouds scanned with an iPhone LiDAR sensor.

Depth from Kinect Sensor

Human3D shows smooth and robust predictions on videos recorded with the Kinect Depth Sensor.

Synthetic Pretraining

Original Image
Modified Image

Only EgoBody data: We observe that models trained only on EgoBody data do not generalize to scenes with more than 2 humans. Here we can see that the instance masks of two people leak into the third person's mask on the right. The reason for this is that the EgoBody dataset only contains scenes with less than 3 people at the same time. When only trained on EgoBody, Human3D inevitably learns this bias and consequently fails for scenes with more than 2 people.
Pretrained with synthetic data: In contrast, our synthetic dataset consists of scenes with up to 10 people. Human3D, pre-trained on synthetic data and fine-tuned on real EgoBody data, shows significantly better results for scenes with a larger number of people.

We conclude that pre-training with synthetic data helps to segment humans in 3D point clouds!

Publication

BibTeX


@article{Takmaz22,
    title     = {{3D Segmentation of Humans in Point Clouds with Synthetic Data}},
    author    = {Takmaz, Ay\c{c}a and Schult, Jonas and Kaftan, Irem and Ak\c{c}ay, Mertcan 
                  and Leibe, Bastian and Sumner, Robert and Engelmann, Francis and Tang, Siyu},
    booktitle = {{arXiv:2212.00786}},
    year      = {2022}
  }