We want to outline in general terms about the first achievements with deep learning in character animation for our
Cascadeur program.
While working on
Shadow Fight 3 , we accumulated a lot of combat animation - about 1100 movements with an average duration of about 4 seconds. It seemed to us long ago that this could be a good dataset for training some kind of neural network.
Once we noticed that when the animators make the first sketches of ideas on paper, then they just need to draw a literally stick man to imagine the character's pose. We thought that since an experienced animator can set a pose well in a simple pattern, it is quite possible that the neural network can handle it. From this observation, a simple idea was born: let's take only 6 key points from each pose - wrists, ankles, pelvis and base of the neck. If the neural network knows only the positions of these points, can it predict the rest of the pose - the position of the 37 remaining points of the character?
How to arrange the learning process, it was clear from the very beginning: at the entrance, the network receives the positions of 6 points from a specific pose, at the output it gives the positions of the remaining 37 points, and we compare them with the positions that were in the initial position. In the evaluation function, you can use the least squares method for the distances between the predicted positions of the points and the source.
For the training dataset, we had all the movements of the characters from Shadow Fight 3. We took poses from each frame, and we got about 115,000 poses. But this set was specific - the character almost always looked along the X axis, and the left leg was always in front at the beginning of the movement. To solve this problem, we artificially expanded the dataset by generating mirror poses, and also randomly rotating each pose in space. It also allowed us to increase the dataset to two million poses. We used 95% of our dataset for network training and 5% for parameterization and testing.
We took a fairly simple neural network architecture - a fully-connected five-layer network with an activation function and an initialization method from
Self-Normalizing Neural Networks . On the last layer, activation is not used. Having 3 coordinates for each node, we get an input layer of 6 * 3 elements and an output layer of 37 * 3 elements. We searched for the optimal architecture for hidden layers and settled on a five-layer architecture with the number of neurons of 300, 400, 300, 200 on each hidden layer, however, networks with fewer hidden layers also produced good results.
L2 regularization of network parameters was also very useful, it made predictions smoother and more continuous.
A neural network with such parameters predicts the position of points with an average error of 3.5 cm. This is a very high error, but the specifics of the problem must be taken into account. For one set of input values, there may be many possible output values. Therefore, the neural network eventually learned to issue the most probable, averaged predictions. However, when the number of input points increased to 16, the error decreased by half, which in practice yielded a very accurate prediction of the pose.
But at the same time, the neural network could not give out a completely correct pose, preserving the lengths of all bones and the correct joint joints. Therefore, we additionally launch an optimization process that aligns all the solid bodies and joints of our physical model.
In practice, the results were quite convincing - you can see them in our video. But there is also a specificity due to the fact that the training dataset is combat animations from a fighting game with weapons. For example, a character seems to suggest that he turns with one shoulder towards the enemy, as in a fighting stance, and accordingly turns his feet and head. And when you stretch out his hand, the brush does not turn as if it were hit with a fist, but like when hit by a sword.
From this came the logical idea of the next step - to train a few more networks with an expanded set of points that specify the orientation of the hands, feet and head, as well as the position of the knees and elbows. We have added 16-point and 28-point schemes. It turned out that the results of these networks can be combined so that the user can set positions to an arbitrary set of points. For example, the user decided to move the left elbow, but did not touch the right one. Then the position of the right elbow and the right shoulder are predicted in a 6-point pattern, and the position of the left shoulder is predicted in a 16-point pattern.
It seems that this turns out to be a really interesting tool for working with a character's pose. Its potential has not yet been fully revealed, and we have ideas on how to improve it and apply it not only for working with a pose. The first version of this tool is already available in the current version of Cascadeur. You can try it if you sign up for a closed beta test on our website
cascadeur.comWe will be glad to know your opinion and answer questions.
The Banzai Games team requires a Deep learning researcher. Read more about the vacancy here .