I Trained a 2D Game Animation Generation Model to Create Complex, Cool Game Actions (Fully Open-Source)

Community Article Published August 18, 2024

Six months ago, one afternoon, a friend came to me with an AI question.

The problem was that he tried to use OpenAI’s text-to-image model to generate sprites for 2D game animations, but couldn’t achieve it due to character misalignment and consistency issues.

My friend is a veteran in the gaming industry. He said that if AI could be used to generate 2D game animations, it might have significant value for the gaming industry.

I found it interesting and, on a whim, started training a 2D game animation generation model. After some time of exploration, I’ve had some results. Now I’m open-sourcing this model, code, data, and data preparation code.

We also investigated the possibilities of commercializing game animation generation model products, which I’ll share later.

Game animations generated by the model:

Generation Samples

Generation Samples

Generation Samples

01 Where Does the Satisfying Impact of Game Animations Come From?

Diving into animation generation was fun at first, but model training soon became super challenging. I quickly discovered this was a deep pit with no bottom in sight.

The initial natural thought was to find many game animation resources and use the latest open-source video generation models for fine-tuning training.

Game animation resources are not scarce. We quickly found tens of thousands of game animations from the internet, mostly action animations from early console and arcade games and some public works by designers on art design platforms. We began experimenting with training using several common open-source models.

As we started training, we discovered that although it’s similarly a time series of images, game animations are much more complex than general videos in many aspects.

First, frames in daily real-life videos are relatively smooth, without drastic changes.

In game animations, however, many actions have particularly large amplitudes, with sudden changes between frames. There are various highly complex turns, body parts overlapping each other, etc. Not only are there changes in actions, but there are often clothing flutters and light and shadow effects.

To highlight the impact and satisfaction of actions, 2D animations often introduce drastic changes between frames. In many cases, the frame rate of actions is deliberately reduced. For example, it’s said that the frame rate of Pixar animations is half that of normal movies.

Especially in early arcade and other games, due to limited resources, they had to optimize resources to the extreme, saving frames for an action wherever possible.

It can be said that game animations all have severe frame dropping!

For example, the following fireball animation of Kyo Kusanagi from The King of Fighters has only 4 frames, which is hard to imagine.

There are famous 12 principles in the field of game animation, many of which achieve a more fluid and satisfying impact through artistic exaggeration that violates daily rules.

12 principles of animation

For example, according to the “anticipation principle” and “timing principle”, the action of charging and raising hands for a fireball will take a lot of time to depict, while the actual process of the arm coming down to release the fireball is only briefly shown. In one frame, the arm is still behind the back, and in the next frame, it has moved 180 degrees to the front. The impact of the action is more expressed through the swinging of hair and clothes after the action is completed.

So there are big sudden changes between frames. Very unsmooth. The temporal changes can be said to be highly non-linear.

It is precisely this frame loss and dropping that brings the satisfying impact to games.

But this can be said to greatly increase the difficulty for video model training. First, the data is very unsmooth and not fluid. Additionally, because most video backbone models are trained using smooth daily videos, there is low consistency between the training data and pre-training data for game animations with frame dropping and non-linear temporal changes.

Another difficulty comes from the highly variable game actions, which are hard to express in words. For example, even for the simplest running action, each game has a different running style and action frequency.

*Completely different styles of actions, but all can only be called “running”*

When using text to prompt the model, these changes in action styles are difficult to describe in words, and can only be called “running”.

Video models need to learn how to move images to express actions, but training samples with too much variation will make it difficult for the model to understand and summarize the rules of motion.

Due to the above reasons, experiments trying to train game animation models directly using video models had very poor results.

02 All You Need is to Repeat the Same Motion 1000 Times

I don’t think these challenges are unsolvable. In fact, it’s still that current video model training data is not enough. In the AI field, as long as you stack enough data, there’s no challenge that can’t be solved.

It’s just that for poor guys like us, without so much hardware resources to train so much different data, we may have to wait for more powerful foundation models to appear.

So we thought of another method — using 3D rendered animations to train 2D animation models.

When looking for higher quality, higher frame rate training animations, we wondered if we could use 3D game animations rendered into high-quality 2D animations. Because the frame rate of action rendering can be controlled by ourselves, it can solve the problem of insufficient coherence and smoothness between frames.

Mixamo is an open-source 3D action library from Adobe. It provides hundreds of 3D models of game characters and over 200 game actions. The copyright is all open.

mixamo

Mixamo data has been used in many AI model trainings.

We selected some actions we wanted, then wrote scripts to render these actions for all 200 characters into 2D animations.

Using 3D rendered animations to train 2D animation models has many benefits: First, we can find many open-source 3D models and actions. Each model can render an animation for each action. Each action can be interpreted at least 200 times by 200 different characters. We can also choose different camera angles, different lighting, and rendering styles. This is almost equivalent to having thousands of repetitions of training data for the same action. It helps the model to familiarize itself with a complex action over and over again.

This solves the problem of actions being too diverse and variable. Through constant repetition of the same action, it’s easier for the model to understand the rules of the action we want.

We control the rendering process to make the changes between animation frames smoother. Plus, most of Mixamo’s actions come from real human motion capture, not as artistically abstract and frame-lacking as the game animations we used before, which relatively reduces the difficulty for the model.

Using Mixamo rendered animations for training achieved very good results.

Prompt: Game character sword slash animation of An ancient Chinese swordsman, serene and wise, with flowing white robes and long beard, holding a oversized jian sword, side view, solid color background

03 Training Details

1.Selection of Basic Pre-trained Models

We experimented with several different popular models for fine-tuning.

Of course, because it’s just a side project, the budget is controlled within the range that one A100 can handle.

Animatediff is widely used in the open-source community, including the comfyui community. A big advantage is its design of separating time and spatial dimensions, which makes it convenient to replace the spatial dimension image model part later.

The training effect was not good in our game animation scenario. I think it may be because the spatial dimension complexity of game animations is too high, while animatediff’s simplified spatial dimension design causes its spatial dimension expression ability to be weak.

Latte uses a more advanced DIT tokenization method. But the training effect was also not very ideal. We believe that the open-source model may not have enough training data and the quality is not high enough.

The best effects were achieved using VideoCraft2 (VC2) and DynamicCraft (DC) models for training.

Another thing I like about VC2 is that the paper mentions they specifically did more training for style transfer. This capability is important for game animations.

2. Balancing Action (Temporal Dimension) and Image (Spatial Dimension)

An initial problem with training animation models based on 3D rendering was that because a large number of repetitive training samples were all based on 3D rendering, the model output all had a 3D rendered style, and it would affect the ability of the spatial dimension image part.

We used GPT4 to randomly generate 300 game character prompts, and then used the original VC2 model to generate animations. We added this part of the data to the training to help the model retain its original image capabilities. The effect was quite good.

We also retained some other non-3D rendered game animations. We manually screened the training data, trying to keep animations close to 3D actions as much as possible, filtering out animations that were too difficult or too abstract. According to experiments, this part can help improve the model’s image part performance.

3. Weapons

Weapons are also a big difference between game animations and general life videos. Many game characters need to be equipped with or hold various strange weapons. This is also a challenge.

It’s still somewhat difficult for many image models to correctly hold weapons. Therefore, we also sampled a certain proportion of actions specifically adding swords and axes and other weapons to each character, enhancing the model’s effect of holding weapons.

4. Training Details

We used 1 A100, training for 40~50 epochs.

The prompts for all animations are divided into two parts. The first half is a description of an action: “Game character XXX animation of…”, the second half is a description of the character.

The character descriptions are all generated using GPT4 prompts. Because the images rendered using mixamo have high clarity and the concepts are relatively simple and not so abstract, GPT4’s descriptions are relatively accurate. However, for recognizing real game materials, especially low-resolution pixel-style images, or not very common concepts like bull-headed monsters, pig-headed monsters, etc., GPT4 often cannot give accurate prompts. We performed manual screening.

In addition, because the CLIP encoder used by the VC2 model has a 77 token limit, we strictly limited the descriptions generated by GPT4, retrying several times if it exceeded 77 until it met the 77 token limit.

04 Image to Animation

If text to animation might be useful for independent developers, then image to animation is far more practical than text to animation for professional game animators.

Because actual industrialized game development is art-first. Having a unique, stylish art style is crucial to the success of a game. Therefore, many games first have the game character’s art design, then hand it over to animators to do the animation. After the animation is done, it may still need to be repeatedly reviewed with the art team to confirm some art details.

Therefore, if animations can be directly generated by the model based on a game character’s original art design draft, this would ensure the art style while saving animation production costs. The controllability of many AI video models is the key to widespread landing applications. For example, the art style needs to be highly controllable.

While the descriptive ability of text seems extremely weak and unusable in many practical application scenarios.

Image to animation is more difficult than text to animation. It has higher requirements for the model. The model needs to master both actions and understand images, be able to perfectly reproduce the picture, and also understand the structure of the input image, such as which part is the leg, which part is the hand. This is necessary to make the image move correctly.

My method is to add the static image of the character design draft that needs to generate animation to the training samples, training the generation of static images while training animation generation.

This process is equivalent to teaching the model about the image design of this character.

By mixing static images and game actions for i2v training, we achieved good results:

*Added various Minecraft Steve pictures to the training to help the model grasp the concept of Steve*

*The model fuses images and actions to achieve Image2Animation*

However, the disadvantage of this method is that when i2v generation of animations is needed for a game character, it is necessary to first train on this image before correctly generating animations. The duration and cost are relatively high.

05 Open Source

The t2v, i2v models and training data are all open source. You can visit the github repo to get the code.

I couldn’t find the vc2 t2v training code, so I wrote the training code myself, which is also open source in the github repo.

In addition, rendering 2D animations from mixamo 3d models used blender rendering, and the code is also fully open source.

I also created a replicate public model, which you can use for free.

06 Other Methods

Another method for i2v (even t2v) that I think has a lot of potential is the method very common in the comfyui community, which is to add openpose data as an additional guiding condition, similar to animate anyone.

I experimented with this method and found it to be very flexible, not requiring additional training to generate new actions.

But the problem I encountered is that sometimes for overly complex actions, such as actions involving turning around, the model would still make mistakes.

Because the skeleton diagram rendered by openpose does not contain orientation information, sometimes this information is missing when input.

Also tried subsequent work adding 3D condition information, Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance. But the body type mapping of the input image often had problems.

Another problem is that the accuracy of openpose or motion reconstruct is often not particularly ideal. The reason may be that game animation images are often more abstract, possibly with large differences from the data that pose recognition models are trained on.

There are also many special abstract concepts in games, such as Q version characters with particularly large heads, or body parts of pixel art composed of only a few pixels, requiring imagination. And there is a lot of occlusion of body parts during complex actions. This poses great challenges for pose extraction and motion tracking.

When the accuracy of the input guiding condition is not enough, the effect of works like animate anyone will deteriorate.

I think one idea is to directly use the original 3D model to generate pose conditions, which can ensure no loss of accuracy.

In the long term, I’m still very optimistic about this direction, because better flexibility and control capabilities should be more promising than the current method of training models according to actions.

07 Business Opportunities

We also explored whether there are market opportunities for AI 2D game animation. We reached out to nearly a hundred 2D animation practitioners through cold DM for interviews. The interviewees included Indian outsourcing companies specializing in animation, large factories, and even senior artists doing animated films in Hollywood.

First, the conclusion, I believe that 2D game animation, or more accurately called 2D frame by frame animation, is a real problem with huge opportunities, but the current AI technology is not mature and not enough to replace existing solutions.

Of course, I am not a game art practitioner, my interpretation is only based on my interviews with animation practitioners I could reach, I don’t know if you agree, welcome to leave comments and discuss.

1.Pain Points in the 2D Animation Industry and Existing Solutions

Frame-by-frame drawing of 2D animation is very difficult. We interviewed a large number of practitioners, and very few have the ability to hand-draw 2D animations frame by frame. Therefore, I believe there are still huge market opportunities for 2D game animation.

Some data we obtained: In the game industry 2D game projects, the ratio of programmers, artists, and animators is about 1:1:1. It takes two weeks to hand-draw one minute of animation. It can be seen that the production of 2D animation is still a very large cost investment.

Although 2D animation production is indeed a difficult problem, there are actually solutions in the game industry. For many real-world problems that are difficult to solve, people often find a substitute solution to work around, rather than facing the problem head-on.

We learned that there are currently several solutions in the industry:

A. Don’t make 2D games, only make 3D games, or use 3D to make 2D

Because the cost of developing high-quality 2D games is actually getting higher and higher. The level of industrialization and standardization is relatively low. Many industrialized studios are making fewer and fewer 2D games.

Or many 2D perspective games such as side-scrolling games are made using 3D engines.

According to the data I could find 47% of developers use 3D engines, while 36% use 2D engines.

Compared to 2D animations made with 3D engines, I personally still find the hand-drawn style 2D animations of many classic games more attractive, with a stronger artistic sense and less industrial feel.

B. Use spine and other 2D skeletal animation engines

Compared to frame-by-frame hand-drawn 2D animation, most of the animators we interviewed use spine skeletal animation to produce 2D animations.

Spine has indeed greatly reduced the difficulty and cost of 2D game animation production. Many small game and casual game development companies all use Spine to develop animations.

**Spine** **2D** skeletal **animation**

Personally, I think Spine animations always have a strong puppet-like flavor, and it’s still quite challenging to implement complex actions involving various turns.

Another reason for Spine’s popularity is that frame-by-frame animation requires storing images for each frame, thus consuming a large amount of memory resources. Today, mobile games have increasingly high requirements for loading speed, and Spine engine’s super-optimized resource usage has a big advantage.

Personally, I think this factor can actually be optimized through many technical means. For example, H264 and other video encoding for video compression and decoding are already very mature. If it’s purely a resource usage problem, it shouldn’t be an issue for the popularization of frame-by-frame animation.

C. Pixel style and indie games

Pixel style games are difficult to implement through spine due to their special style requirements. Most are still based on frame-by-frame animation.

In addition, many independent developers who pursue unique artistic styles to get a slice of the industrialized game market are also happy to use frame-by-frame animation to achieve more cool artistic effects.

2. The Gap Between AI Animation and Industrial Application

The current AI animation model capabilities are still far from industrial application in the following aspects:

A. High-impact, high-fluidity animations that conform to the 12 principles of animation

Current AI animations are still at the “correct” stage and haven’t reached the “good” stage. And in today’s highly competitive game market, especially in the era of self-media, the rules of game dissemination make the requirements for game art and animation styles even higher.

“Good” is not enough. It needs to be highly stylized, highly unique, and able to go viral.

B. The highly artistic abstract concepts of game art and animation may be a challenge for the model

Due to the above reasons, many games need to seek a very unique, abstract concept to go viral. And these artistic abstract concepts may not account for the majority in the training data of the basic model.

For example, the character design of Hollow Knight below, this character has no nose and mouth, and the arms also seem to be there and not there, with two things on the head that are unknown whether they are horns or what. The model needs to decide what posture this character should be in when performing each action, how each part of the body should move, which is quite challenging.

*Hollow Knight*

C. Cool and expressive light and shadow effects

Today’s games have gone to extremes in light and shadow effects for social media dissemination.

I did some experiments, and it’s not easy for AI models to fully master these exaggerated light and shadow effects, correctly match them with game actions, and maintain accurate rhythm.

D. Model control over detail depiction

Many industrial studio artists we communicated with may have very high requirements for every detail of the animation, such as how the sleeves should move when raising hands.

And today, control is still a shortcoming of AI models. Although the development of various controlnet and LORA models has given us many tools, it’s still often difficult to achieve precise control of pointing where to hit.

This may be difficult to meet the requirements of many perfectionist game art directors.

Although there are various problems, I’m still optimistic about AI game animation in the long run. I believe that many of these challenges are essentially data problems, and the foundation models are rapidly becoming more mature.

In the long run, as long as there is enough investment and enough data input for training, all of the above challenges should be solvable.

We will continue to follow the latest and coolest AI technologies and continue to share open-source work. Welcome to follow us and stay tuned!