We present a machine learning technique to recognize gestures and estimate metric depth of hands for 3D interaction, relying only on monocular RGB video input. We aim to enable spatial interaction with small, body-worn devices where rich 3D input is desired but the usage of conventional depth sensors is prohibitive due to their power consumption and size. We propose a hybrid classification-regression approach to learn and predict a mapping of RGB colors to absolute, metric depth in real time. We also classify distinct hand gestures, allowing for a variety of 3D interactions. We demonstrate our technique with three mobile interaction scenarios and evaluate the method quantitatively and qualitatively.


Accompanying Video

Published at

ACM Conference on Human Factors in Computing Systems (CHI), 2015

Project Links


@inproceedings{Song:2015, author = {Song, Jie and Pece, Fabrizio and Sörös, Gábor and Koelle, Marion and Hilliges, Otmar}, title = {Joint Estimation of 3D Hand Position and Gestures from Monocular Video for Mobile Interaction}, booktitle = {ACM Conference on Human Factors in Computing Systems (CHI)}, series = {CHI '15}, year = {2015}, isbn = {978-1-4503-3145-6}, location = {Seoul, Republic of Korea}, pages = {3657--3660}, numpages = {4}, url = {http://doi.acm.org/10.1145/2702123.2702601}, doi = {10.1145/2702123.2702601}, acmid = {2702601}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {gesture recognition, machine learning, mobile interaction} }