We propose a novel plan-then-populate framework centered on Macro-from-Micro Planning (MMPL) for scalable, high-quality long video generation. Experiments on standard benchmarks confirm that our method outperforms existing video generation models in quality and stability.
Macro-from-Micro Planning (MMPL): a novel long-video generation paradigm that mitigates temporal drift and color shift, while enabling multi-GPU parallelization to generate longer videos.
Overall framework of Macro-from-Micro Planning. Our method operates on two planning levels: (1) Micro Plans, which predict a sequence of future frames within each segment to mitigate local error accumulation, and (2) a Macro Plan, formed as an Autoregressive Chain of Micro Plans, where the planning frames of the first segment autoregressively generate the planning frames of subsequent segments, ensuring long-horizon temporal consistency.
Adaptive Multi-GPU Workload Scheduling for Balanced Execution and Fast Autoregressive Video Generation.
Our model generates high-quality 480P videos and supports streaming generation for extended durations. Below, we present 20-second videos (top), extended 30-second videos (middle), and 1-minute videos (bottom), all produced by our model without noticeable drift or color shift across time. [More Examples]
Our method delivers substantially superior performance on 30-second long video generation, surpassing MAGI, SkyReels, CausVid, and Self-Forcing in both visual quality and temporal consistency. It robustly mitigates frame drift and flickering, while effectively addressing over-saturation and color imbalance, resulting in more stable and photorealistic outputs.
Our framework is not restricted to the text-to-video (T2V) task; it can be seamlessly extended to image-to-video (I2V) generation without introducing any architectural modifications or additional image encoders. This flexibility derives from the unified autoregressive design, which only requires lightweight adjustments to the number and ordering of autoregressive steps.
Our approach can be seamlessly integrated with self-forcing strategies without any architectural modifications. Specifically, it only requires adjusting the attention visibility range and the prediction order during both training and inference.
Although MMPL mitigates error accumulation in long video generation, the substantial temporal span of long videos means that a single text prompt often aligns only with the early content and fails to capture the full video semantics. As generation progresses, the limitations of a static prompt lead to repetitive or even collapsed content in later segments. The examples below illustrate content repetition and quality degradation caused by using a single static prompt.
@article{xiang2025macro,
title={Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation},
author={Xiang, Xunzhi and Chen, Yabo and Zhang, Guiyu and Wang, Zhongyu and Gao, Zhe and Xiang, Quanming and Shang, Gonghu and Liu, Junqi and Huang, Haibin and Gao, Yang and others},
journal={arXiv preprint arXiv:2508.03334},
year={2025}
}