In summary our main contributions are: (A). a structured
model captures the inherent consistency of human poses in video sequences based on a loopy spatio-temporal graph.
(B). An efficient and flexible infer-
ence layer performs message passing along the spatial and
temporal graph edges and significantly reduces joint posi-
tion uncertainty (C). The entire architecture integrates a
ConvNet-based joint regressors and a high-level structured
inference model in a unified framework which can be op-
timized in an end-to-end manner.
</p>
Deep ConvNets have been shown to be effective for the
task of human pose estimation from single images. However, several challenging issues arise in the video-based
case such as self-occlusion, motion blur, and uncommon
poses with few or no examples in the training data. Temporal information can provide additional cues about the
location of body joints and help to alleviate these issues.
In this paper, we propose a deep structured model to estimate a sequence of human poses in unconstrained videos.
This model can be efficiently trained in an end-to-end manner and is capable of representing the appearance of body
joints and their spatio-temporal relationships simultaneously. Domain knowledge about the human body is explicitly incorporated into the network providing effective priors
to regularize the skeletal structure and to enforce temporal
consistency. The proposed end-to-end architecture is evaluated on two widely used benchmarks for video-based pose
estimation (Penn Action and JHMDB datasets). Our approach outperforms several state-of-the-art methods.
Abstract
Video
Thin-slicing network: A deep structured model for pose estimation in videos
Jie Song,
Limin Wang,
Luc VanGool,
Otmar Hilliges