The artificial intelligence model focused on video creation, Pyramid Flow, offers high-quality video clips lasting up to 10 seconds. It was developed by Kuaishou Technology in collaboration with researchers from Peking University and Beijing University of Posts and Telecommunications, who brought the AI video production platform Kling AI to life.

Details of the Model

Pyramid Flow is built upon the concept of Pyramidal Flow Matching and utilizes a novel technique. In this approach, a single AI model progressively generates video. While most produced videos are of low resolution, the model only saves a fully-resolution version at the end of the production process. The proposed pyramidal flow reduces the token count by four times compared to traditional diffusion models, resulting in more efficient training. Additionally, the model can compress and optimize video production at different stages, enabling Pyramid Flow to achieve faster convergence during training and generate more examples per training group. You can learn more about the concept of pyramidal flow matching in the detailed paper titled Pyramidal Flow Matching for Efficient Video Generative Modeling.

Training Data

The model is trained on open-source datasets, producing videos at a 768p resolution and 24 frames per second for lengths between 5 and 10 seconds. The datasets used for training include LAION-5B, a large dataset for multimodal AI research; CC-12M, a dataset of image-text pairs scraped from the web; SA-1B, which contains high-quality, non-blurry images; and widely used video datasets for text-to-video generation such as WebVid-10M and OpenVid-1M.

The researchers mention that they have curated approximately 10 million single-shot videos in total. However, the openness of these datasets poses challenges, such as issues of copyright infringement and the potential for generating illegal content.

During inference, the model can produce a 5-second, 384p video in just 56 seconds. Compared to other diffusion models, it exhibits equal or even superior performance. Nonetheless, Runway’s Gen 3-Alpha Turbo has demonstrated the ability to produce videos in under one minute, often in 10-20 seconds, setting a high standard for AI video generation speed. On the other hand, the open-source Pyramid Flow poses competition to subscription-based models such as Runway’s Gen-3 Alpha, Luma’s Dream Machine, Kling, and Haulio.

However, it should be noted that Pyramid Flow has some limitations. The model lacks certain advanced fine-tuning capabilities found in models like Runway Gen-3 Alpha, which allow for precise control over cinematic elements such as camera angles, keyframes, and human movements.

The model’s raw code can be downloaded from Hugging Face and Github. Additionally, the model can be run in an inference shell, although users must download and execute the model code on their own machines for it to function in this manner. Published under the MIT License, Pyramid Flow offers a wide range of uses, including commercial applications, modifications, and redistribution, as long as the copyright notice is preserved. Furthermore, all code and model weights will be made available to users for free through official project pages.


Source