3D Segmentation of Humans in Point Clouds with Synthetic Data
Paper Code Dataset EgoBody Visualizations
TL;DR: We propose the first multi-human body-part segmentation model, called Human3D π§βπ€βπ§, that
directly operates on 3D scenes.
In an extensive analysis, we validate the benefits of training on synthetic data on multiple baselines
and tasks.
Segmenting humans in 3D indoor scenes has become increasingly important with the rise of human-centered
robotics and AR/VR applications.
In this direction, we explore the tasks of 3D human semantic-, instance- and multi-human body-part
segmentation.
Few works have attempted to directly segment humans in point clouds (or depth maps), which is largely due
to the lack of training data on humans interacting with 3D scenes.
We address this challenge and propose a framework for synthesizing virtual humans in realistic 3D scenes.
Synthetic point cloud data is attractive since the domain gap between real and synthetic depth is small
compared to images.
Our analysis of different training schemes using a combination of synthetic and realistic data shows that
synthetic data for pre-training improves performance in a wide variety of segmentation tasks and models.
We further propose the first end-to-end model for 3D multi-human body-part segmentation, called Human3D,
that performs all the above segmentation tasks in a unified manner.
Remarkably, Human3D even outperforms previous task-specific state-of-the-art methods.
Finally, we manually annotate humans in test scenes from EgoBody to compare the proposed training schemes
and segmentation models.
Remarkably, our approach generalizes to out-of-distribution examples. Although trained on synthetic data and real Kinect depth data, Human3D shows promising results on reconstructed point clouds scanned with an iPhone LiDAR sensor.
Human3D shows smooth and robust predictions on videos recorded with the Kinect Depth Sensor.
Only EgoBody data:
We observe that models trained only on EgoBody data do not generalize to scenes with more than 2 humans. Here we can
see that the instance masks of two people leak into the third person's mask on the right.
The reason for this is that the EgoBody dataset only contains scenes with less than 3 people at the same time. When
only trained on EgoBody, Human3D inevitably learns this bias and consequently fails for scenes with more than 2
people.
Pretrained with synthetic data:
In contrast, our synthetic dataset consists of scenes with up to 10 people. Human3D, pre-trained on synthetic data
and fine-tuned on real EgoBody data, shows significantly better results for scenes with a larger number of people.
We conclude that pre-training with synthetic data helps to segment humans in 3D point clouds!
@article{Takmaz22,
title = {{3D Segmentation of Humans in Point Clouds with Synthetic Data}},
author = {Takmaz, Ay\c{c}a and Schult, Jonas and Kaftan, Irem and Ak\c{c}ay, Mertcan
and Leibe, Bastian and Sumner, Robert and Engelmann, Francis and Tang, Siyu},
booktitle = {{arXiv:2212.00786}},
year = {2022}
}