Learning Human Motion Models for Long-term Predictions

Authors: P. Ghosh, J. Song, E. Aksan, O. Hilliges,
publication: In Proceedings International Conference on 3D Vision, Qingdao, China, 2017

This work receveid a Best Paper Award at 3DV 2017.


Schematic overview over the proposed method. (1) A variant of de-noising autoencoders learns the spatial configuration of the human skeleton via training with dropouts, removing entire joints at random which have to be reconstructed by the network. (2) We train a 3-layer LSTM recurrent neural network to predict skeletal configurations over time. (3) At inference time both components are stacked and the dropout autoencoder filters the noisy predictions of the LSTM layers, preventing accumulation of error and hence pose drift over time.


We propose a new architecture for the learning of predictive spatio-temporal motion models from data alone. Our approach, dubbed the Dropout Autoencoder LSTM (DAE-LSTM), is capable of synthesizing natural looking motion sequences over long-time horizons1 without catastrophic drift or motion degradation. The model consists of two components, a 3-layer recurrent neural network to model temporal aspects and a novel autoencoder that is trained to implicitly recover the spatial structure of the human skeleton via randomly removing information about joints during training. This Dropout Autoencoder (DAE) is then used to filter each predicted pose by a 3-layer LSTM network, reducing accumulation of correlated error and hence drift over time. Furthermore to alleviate insufficiency of commonly used quality metric, we propose a new evaluation protocol using action classifiers to assess the quality of synthetic motion sequences. The proposed protocol can be used to assess quality of generated sequences of arbitrary length. Finally, we evaluate our proposed method on two of the largest motion-capture datasets available and show that our model outperforms the state-of-the-art techniques on a variety of actions, including cyclic and acyclic motion, and that it can produce natural looking sequences over longer time horizons than previous methods.

1 > 10s for periodic motions, e.g. walking, > 2s for aperiodic motion, e.g. eating



      title={Learning Human Motion Models for Long-term Predictions},
      author={Ghosh, Partha and Song, Jie and Aksan, Emre and Hilliges, Otmar},
      journal={arXiv preprint arXiv:1704.02827},