In this paper, we introduce a novel
algorithm for motion style transfer task.
Our method is an integration of matrix
factorization and an artistic style learning
technique. This work can deal with a shortage
of large motion datasets since it can be
applied to small ones. In spite of gaining
some promising outcomes, there are several
limitations remaining:
1. The number of deformation components
(K) has to be defined in advance.
2. The tuning parameters s and c are
user-specified. It is a trade-off between
the style and content we want to transfer
to a new motion.
3. The style representation between two
motions is minimized during the learning
period, and in fact the best style is
unknown.
4. The velocity and acceleration of the
human body are omitted in this study
which are vital properties to make
motion smooth and natural.
10 trang |
Chia sẻ: huongthu9 | Lượt xem: 474 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Learning and transferring motion style using Sparse PCA, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10
Learning and transferring motion style using Sparse PCA
Do Khac Phong1∗, Nguyen Xuan Thanh1, Hongchuan Yu2,
1Faculty of Information Technology, VNU University of Engineering and Technology,
No. 144 Xuan Thuy Street, Dich Vong Ward, Cau Giay District, Hanoi, Vietnam
2National Centre for Computer Animation, Bournemouth University,
Talbot Campus, Fern Barrow, Poole, Dorset, BH12 5BB, United Kingdom
Abstract
Motion style transfer is a primary problem in computer animation, allowing us to convert the motion of an
actor to that of another one. Myriads approaches have been developed to perform this task, however, the majority
of them are data-driven, which require a large dataset and a time-consuming period for training a model in order
to achieve good results. In contrast, we propose a novel method applied successfully for this task in a small
dataset. This exploits Sparse PCA to decompose original motions into smaller components which are learned with
particular constraints. The synthesized results are highly precise and smooth motions with its emotion as shown in
our experiments.
Received 07 May 2018, Revised 03 December 2018, Accepted 29 December 2018
Keywords: Sparse PCA, style learning, motion style transfer
1. Introduction
The automatically precise stylization of
human motion to express mood is a primary
role in realistic humanoid animation. Motion
style transfer is a primary problem in
computer animation, allowing us to convert
the motion of an actor to that of another
character. Such characters can express
∗ Corresponding author. Email: phongdk@vnu.edu.vn
https://doi.org/10.25073/2588-1086/vnucsce.206
their emotions like happy, sad, joy, or so.
The precise stylization of human motion to
express the state of mind or identity plays
a vital role in realistic humanoid animation.
Previously, this manual work takes numerous
time to generate huge variations of motion
data, thereby automating this process is
really useful and essential for a bunch of
applications such as films, and computer
games.
Many approaches have been developed
1
2 D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10
for this style transfer task. Hsu et al. [1]
introduced a linear time-invariant (LTI) model
for homogeneous human motion stylization,
e.g. walking. For heterogeneous motion,
Xia et al. [2] proposed a method through
temporally local nearest neighbor blending
in spatial-temporal space. Recently, along
with the rapid exploration of deep learning
technique, neural style transfer for images
is introduced by Gatys and his colleagues
[3]. Inspired by their work, Holden et
al. [4] adapted it and used a deep neural
network to transform a style of motion
data. Nevertheless, these data-driven methods
require a large number of training datasets
and manual alignment leading to typically
time-consuming.
In this paper, our goal is to build a
framework for the rapid style transfer process,
as well as eliminating the training time. To
do this, we first decompose original motions
into smaller factors: weights and components.
A style transfer algorithm is then performed
to generate transferred components following
by a motion reconstruction step that satisfies
the constraints with respect to content, style
and bone length of a target character.
To summary, our main contributions are:
(a) Propose a novel, fast and effective model
to transfer motion based on matrix
factorization.
(b) Our model can be applied to small
datasets.
2. Related Work
2.1. Matrix factorization
Matrix factorization is a technique that
factorizes a single matrix into a product
of matrices. This could be understood
as a way to find a new representation
of data with much lower dimensions or
to dimension reduction. In particular,
Principle Components Analysis (PCA) is a
primary and popular method that decompose
multivariate data into a set of orthogonal
components. In other words, PCA attempts
to represent each principal component by a
linear combination of the original variables
such that the derived variables capture
maximal variance [5]. Nevertheless, the
coefficients of all variables are typically
nonzero causing a difficulty in the derived
principal components interpretation. It is
obvious that the global effects are not essential
in some circumstances. For example in face
decomposition, sparse components extracted
should be an eye, a nose.
Sparse PCA, in contrast, is a variant of
PCA which produce localized components
[6], [5] by introducing a sparsity-inducing
norm such as l1. Such methods exploited
a localized set of variables, thereby
applying successfully in computer vision,
medical imaging and signal processing.
Neumann et al. [7] extended Sparse PCA
for animation processing by adding local
support map which is suitable for surface
deformations, for instance, faces or muscle.
Localized components are appropriate for
motion-emotion data where the state of mind
is shown via actions, and each action is
D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10 3
associated with several human parts. Such
work inspired us to use Sparse PCA in this
paper.
2.2. Correlation, Covariance and Gram
matrix
Suppose we have a set of centered column
vectors Xi ∈ Rm×1, i = 1, . . . , n; forming a
matrix X = [X1, X2, . . . , Xn], X ∈ Rm×n, and
m > n. A Singular Value Decomposition
(SVD) of X expresses it as X = UDVT
where Dk×k is an diagonal matrix with positive
values which are the “singular values” of X
on the diagonal, Um×k and Vk×n are unitary
matrices.
The covariance and correlation matrix of
X, denoted as ΣX and rX respectively are
computed by the following formulas:
ΣX = E[XXT ] =
1
m − 1XX
T = UD2UT (1)
rX = (diag(ΣX))−1/2ΣX(diag(ΣX))−1/2 (2)
where diag(ΣX) is the matrix of the diagonal
elements of ΣX. The correlation matrix
can be seen as the covariance matrix of the
standardized Xi. Meanwhile, the Gram matrix
of X, denoted as Gram(X), is calculated as:
Gram(X) = XTX = VD2VT (3)
As can be seen from Eq.1 and Eq. 3, the
gram matrix and the covariance matrix share
the same eigenvalues up to the (m− 1) factors.
Therefore, minimizing the difference of two
matrices using their covariance or correlation
matrices is equivalent to optimizing their
Gram matrices. That is reason why many
techniques, e.g. Multi-Dimensional Scaling,
Kernel PCA use Gram matrix to compute the
principal components instead of covariance
matrix [8], [9] in case of m n.
Additionally, Gatys et al .[3] exploited Gram
matrix to calculate the features correlation
in style representation of an image towards
transferring its style to others.
2.3. Motion Style transfer
Basically, human motion expresses the
action it embodies whereas a considerable
component of a natural human act is the style
of that action. Furthermore, the style and
emotion of a motion are more likely to convey
meaningful information compared with the
underlying motion itself. The accurate
stylization of human motion to express mood
or identity is a key role in realistic humanoid
animation. This works benefits a wide range
of applications, especially virtual games,
films instead of capturing an enormous
amount of all possible actions and styles
[2],[10].
Many solutions have been proposed
for human motion stylization and most of
them are data-driven techniques. A linear
time-invariant model was proposed by [1]
to encode style differences between motions,
and the learned model was utilized to transfer
motion from one style to another style
in a real time. However, this method
was developed for homogeneous human
behaviors, e.g. walking, kicking. Xia et
al. [2] introduced a series of a local mixture
of autoregressive models to capture complex
relationships between styles of heterogeneous
motions (walking ⇒ running ⇒ jumping),
and a search scheme to seek appropriate
style candidate on a huge motion dataset.
4 D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10
Yumer et al. [10] developed a method
where the style transfer task is performed in
the frequency domain. Nevertheless, their
method requires a costly searching step to
find the best candidates from an available
database for a source style and reference
style in the spectral domain. Holden et
al. [4] applied convolutional autoencoder
network to perform the style transfer task
over the neural network hidden unit values
to generate a motion that has the content of
one input but with the style of another. A
large motion database collected from many
different sources of motion capture (CMU1,
Xia et al.[2], etc), was converted into a
suitable format for training the neural network
that is a time-consuming process. Such work
promotes us to discover a new strategy which
can apply for a small stylized motion data set.
3. Methodology
The architecture of our model to transfer
the style of a motion MC to a target (content)
motions MS is shown in Fig. 1. Each
motion M ∈ RF×3N consists of a mesh
animation with F frames in which each
frame f is a pose with N joint positions
in 3D. Then, the corresponding components
CS (hidden style representation) and CC
(hidden content representation) of two input
motions are extracted in the decomposition
step. In style transfer process, a white noise
componentCX is adjusted such that it matches
both components CS and CC. Finally, the
new motion MX is composed of the mixed
components C˜X and the target weights WC.
1
3.1. Decomposition
Given a motion M ∈ RF×3N , we seek
an appropriate matrix factorization technique
to decompose M into K deformation
components C with weights W
MF×3N = WF×K .CK×3N (4)
The matrix W with the one dimension F
is assumed to include time variant data,
meanwhile the matrix C contains coordinates
of K basic motions (see section 3.3 for more
details). Depending on the regularization
term added to Eq.(4), there are many
different solutions for W and C. In PCA,
this constraint is the orthogonality of the
components, CTC = I. On the other hand,
by imposing sparsity reducing norm such as
l1 norm, sparse components can be achieved
in Sparse PCA [5]. Subsequently, the matrix
factorization now turns into a joint regularized
minimization problem as:
arg min
W,C
||M −W.C||2F + Ω(C) s.t.φ(W)
(5)
Since the ith joint in frame k is identified
by a triplet coordinate j(i)k = [x, y, z]
(i)
k ,
while regularizing C with l1 norm could
induce sparsity, the group structure would
be ignored leading to eliminating each
dimension separately. Consequently, the l1/l2
norm is utilized to make the dimension vanish
simultaneously [11],[7], [12]
Ω(C) =
K∑
k=1
N∑
i=1
|| j(i)k ||2 (6)
The direct optimization of Eq.(5) is
D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10 5
Figure 1. Our framework for motion style transfer.
complicated due to its non-convex. By
fixing either W or C, the problem is convex
and can be solved easily by an iterative
refinement method that alternates between the
two optimization tasks [13], [7].
3.2. Style Learning
Our idea in order to learn a new style
for the target motion is transforming basic
motions CC to CS resulting in stylized
components C¯X. In other words, CX is
matched with both the content representation
and style representation.
3.2.1. Content
As expected the transferred motion
contains the content of the target motion
MC, the difference between the content
representation of the target motion and of the
transferred one is considered as the content
loss of our model:
Lcontent = c||CC −CX ||2 (7)
where the user-defined scaled weight c is set
to 1.0 in our experiments.
3.2.2. Style
In order to transfer the style of the input
motion MS to the content motion MC, the
style loss is defined as the distinction between
the style representation of the input style
motion and of the transferred one. This is
scaled by a user-specified weight s (s = 0.01
in our cases) as follows:
Lstyle = s||Gram(CS ) −Gram(CX)||2 (8)
where the Gram matrix calculate the inner
products of the element values in components
C across basic motions
Gram(C) =
K∑
i
CTi Ci (9)
3.2.3. Constraint
Although we are able to achieve a stylized
motion by a multiplication of the content
weights WC and the stylized components CX
by optimizing Eq. (7) and (8), it does not
guarantees the composed stylized motion in
a human body form. As a consequence,
additional loss function related to human
bone length is exploited as a constraint to
human body. Suppose that each bone b in
transformed motion MX has two joints j1 and
6 D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10
j2, so their bone length is a distance in 3D
space of their coordinate pb j1 and pb j2 . Given
a length lb, the bone length constraint is in a
form:
Lbone =
∑
b
|||pMXb j1 − p
MX
b j2
|| − lb|2 (10)
The stylized components CX first is
initialized from white noise. Afterwards, it is
adjusted via stochastic gradient descent with
automatic derivatives calculation performed
via Theano until the following total loss
converges to a particular threshold
Ltotal = Lcontent + Lstyle + Lbone (11)
To speed up the process learning the stylized
components, Adam [14] is used for stochastic
optimization in our experiment.
3.3. Composition
Since the original motion is decomposed
into the K basic motions, the synthesized
motion is an inverse process indeed. The
third-row figure in Fig. 2 shows the
reconstruction motion utilizing the first two
basic motions (first two rows of the matrix C
with K = 30), which is able to approximate
70% the content of the original one. For
the first four basic motions (the bottom
figure), the majority of the content motion
is preserved in spite of not being too smooth
as the origin. Our purpose is to retain the
personality of the content motion, so the target
weight matrix WC is kept unchanged and
taken as input of the composition process,
along with the transferred components CX
output from the style transfer step. Simply, a
transferred motion MX is reconstructed in a
form:
MX = WC.C˜X (12)
4. Experimental Results
4.1. Dataset
We collect freely available Emotional
Body Motion Database2 which consists
of 1447 files in BVH format [15], [16].
Notwithstanding, we only keep 323 files
which have an agreement between two fields:
‘Intended emotion’ and ‘Perceived category’.
The former represents the emotion the actor
intended to convey, whereas the latter shows
the emotion was chosen by most of the
observers. There are eight participants
(four males and four women) freely showing
11 emotion categories namely amusement,
anger, disgust, fear, joy, neutral, pride,
relief, sadness, shame, and surprise via their
entire body, face, and voice. Nevertheless,
our study focuses on their skeleton motion
only. Additionally, the emotion is expressed
mostly by their upper body movement as we
observed.
All of the motion in the database are
downsampled to 60 frames per second (fps)
from 120 fps and converted into the 3D joint
position format from rotational motion in the
original dataset. The origin is on the ground
where the root position is projected onto. In
addition, the joint positions are located in the
body’s local coordinate system. Finally, we
subtract mean pose from data, then divide
by their own standard deviation resulting in
zero mean and standard deviation motion
2
D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10 7
Original
1st
1st& 2nd
1st to 4th
Figure 2. Reconstruction using several basic motions (K=30)
data. Each pose is represented by the 23 joint
positions giving us 69 degrees of freedom
(DOF) in total. Although the motion duration
can be either different or fixed, each motion
in our experiment has similar length, and last
for about 200 frames.
4.2. Results
4.2.1. Stylization
In this section, we demonstrate some
results of our approach. As can be seen from
Fig. 3, the first character action describe Pride
mood whilst the behavior of the second one is
Disgust. The transferred motion using Sparse
PCA retains the personality of the former,
but with the latter’s style. Consequently,
we achieve a new motion in Disgust mood.
Besides, we also take into account the effect
of parameter K in our experiment. For K =
10, the left-hand folds too tight and it looks
less similar to the input style figure than those
with K = 30 or K = 50. Meanwhile, the spine
in some frames for K = 50 is not straight
as in the corresponding frames for K = 30.
This suggests that if we retain a number of
deformation components too few, it will lose
more information and make the transferred
motion less natural. The similar outcome is
indicated in Fig. 4 where the new motion
is synthesized from two motions in Surprise
and Anger mood. The stylized motion in our
model behaves in a way that he/she is Anger
and the most similar one is when K = 30.
4.2.2. Sparsity
Fig.3 and Fig. 4 demonstrates the
effects of sparse decomposition as well.
In spite of learning style components of
PCA, the style of synthesized motion is not
transferred precisely as contrast to the stylized
motion in our method using Sparse PCA.
It indicates that localized components are
better than global components in animation.
In addition, the group structure controlled
by Eq.6 also makes the dimension be
modified simultaneously, contributing to
spatial preservation.
4.2.3. Constraint
The advantageous of the bone length
constraint is demonstrated explicitly in Fig.
4. In a case of missing Lbone, the left hand
of the synthesized motion is longer than
that of the Surprise one. In contrast, the
8 D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10
Style
Content
PCA
No Lbone
K=10
K=30
K=50
Figure 3. Animations are generated in time series. Blue: input style motion (Disgust). Green: input content motion
(Pride). Black: output transferred motion. Green circles/ellipses are invalid shapes. The last four row used Sparse
PCA.
shorter left hand is indicated in Fig. 3,
compared to the target motion in Pride mood.
The explanation for this is that during the
iterative period of learning components and
reconstructing motion, the difference between
the target and synthesized motion with respect
to bone length is minimized, finally making
the stylized motion capture the human body
form of the target one.
5. Conclusions
In this paper, we introduce a novel
algorithm for motion style transfer task.
Our method is an integration of matrix
factorization and an artistic style learning
technique. This work can deal with a shortage
of large motion datasets since it can be
applied to small ones. In spite of gaining
some promising outcomes, there are several
limitations remaining:
1. The number of deformation components
(K) has to be defined in advance.
2. The tuning parameters s and c are
user-specified. It is a trade-off between
the style and content we want to transfer
to a new motion.
3. The style representation between two
motions is minimized during the learning
period, and in fact the best style is
unknown.
4. The velocity and acceleration of the
human body are omitted in this study
which are vital properties to make
motion smooth and natural.
D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10 9
Style
Content
PCA
No Lbone
K=10
K=30
K=50
Figure 4. Animations are generated in time series. Blue: input style motion (Anger). Green: input content motion
(Surprise). Black: output transferred motion. Green circles/ellipses are invalid shapes. The last four row used Sparse
PCA.
Those limitations are perceived as our future
work.
Acknowledgments
We would like to thank anonymous
reviewers for their detailed comments on
our paper. This work was supported by
EU H2020 project-AniAge (No.691215),
and by the project named “Multimedia
application tools for intangible cultural
heritage conservation and promotion”
(No. DTDL.CN-34/16). The emotional
body motion database was provided by
the Max-Planck Institute for Biological
Cybernetics in Tuebingen, Germany.
References
[1] E. Hsu, K. Pulli, J. Popovic´, Style translation
for human motion, in: ACM Transactions on
Graphics (TOG), Vol. 24, ACM, 2005, pp.
1082–1089.
[2] S. Xia, C. Wang, J. Chai, J. Hodgins, Realtime
style transfer for unlabeled heterogeneous human
motion, ACM Transactions on Graphics (TOG)
34 (4) (2015) 119.
[3] L. A. Gatys, A. S. Ecker, M. Bethge, A
neural algorithm of artistic style, arXiv preprint
arXiv:1508.06576.
[4] D. Holden, J. Saito, T. Komura, A deep learning
framework for character motion synthesis and
editing, ACM Transactions on Graphics (TOG)
35 (4) (2016) 138.
[5] H. Zou, T. Hastie, R. Tibshirani, Sparse principal
component analysis, Journal of computational
10 D.K. Phong et al. / VNU Journal of Science: Comp. Science & Com. Eng., Vol. 35, No. 1 (2019) 1–10
and graphical statistics 15 (2) (2006) 265–286.
[6] I. T. Jolliffe, N. T. Trendafilov, M. Uddin, A
modified principal component technique based
on the lasso, Journal of computational and
Graphical Statistics 12 (3) (2003) 531–547.
[7] T. Neumann, K. Varanasi, S. Wenger, M. Wacker,
M. Magnor, C. Theobalt, Sparse localized
deformation components, ACM Transactions on
Graphics (TOG) 32 (6) (2013) 179.
[8] T. F. Cox, M. A. Cox, Multidimensional scaling,
CRC press, 2000.
[9] J. Shawe-Taylor, C. K. Williams, N. Cristianini,
J. Kandola, On the eigenspectrum of the gram
matrix and the generalization error of kernel-pca,
IEEE Transactions on Information Theory 51 (7)
(2005) 2510–2522.
[10] M. E. Yumer, N. J. Mitra, Spectral style transfer
for human motion between independent actions,
ACM Transactions on Graphics (TOG) 35 (4)
(2016) 137.
[11] F. Bach, R. Jenatton, J. Mairal, G. Obozinski,
et al., Optimization with sparsity-inducing
penalties, Foundations and Trends R© in Machine
Learning 4 (1) (2012) 1–106.
[12] S. J. Wright, R. D. Nowak, M. A.
Figueiredo, Sparse reconstruction by separable
approximation, IEEE Transactions on Signal
Processing 57 (7) (2009) 2479–2493.
[13] J. Mairal, F. Bach, J. Ponce, G. Sapiro,
Online dictionary learning for sparse coding,
in: Proceedings of the 26th annual international
conference on machine learning, ACM, 2009, pp.
689–696.
[14] D. Kingma, J. Ba, Adam: A method
for stochastic optimization, arXiv preprint
arXiv:1412.6980.
[15] E. Volkova, S. De La Rosa, H. H. Bülthoff,
B. Mohler, The mpi emotional body expressions
database for narrative scenarios, PloS one 9 (12)
(2014) e113647.
[16] E. P. Volkova, B. J. Mohler, T. J. Dodds, J. Tesch,
H. H. Bülthoff, Emotion categorization of body
expressions in narrative scenarios, Frontiers in
psychology 5.
Các file đính kèm theo tài liệu này:
- learning_and_transferring_motion_style_using_sparse_pca.pdf