LongCat-Video: Meituan's Open-Source 13.6B Video Generation Model — Long Video Is the Real Battleground
- Smars
- Open Source , Video Generation , AI Models
- 31 May, 2026
Why Long Video Generation Is Hard
You’ve probably tried several video generation tools by now — Kling, Runway, Pika, Wan2.1. Five-second clips all look decent. But try generating 30 seconds or even a minute of video, and the problems start.
Colors begin drifting after a few seconds — that blue sky gradually shifts to green. Facial details on people dissolve frame by frame. Objects in motion suddenly jump, breaking physical consistency. Most commonly, quality drops off a cliff in the later seconds, as if a different model took over.
This isn’t a bug in any one model. It’s a structural problem with the current video generation paradigm.
The mainstream approach is “concatenative long video”: generate the first 5-second clip, use the last few frames as conditioning input for the next clip, and repeat. The problem is that each clip is independently inferred. The model was never trained to “continue” video, so information breaks at the seams. It’s like having ten people write consecutive chapters of a novel without reading each other’s work — the joins will inevitably crack.
LongCat-Video’s core idea: make Video-Continuation a native pretraining task, so the model learns how to continue video during training, rather than hacking it together at inference time.
Project Overview
LongCat-Video is an open-source foundational video generation model from Meituan’s LongCat team — 13.6B parameters, built on the Diffusion Transformer (DiT) architecture, MIT licensed.
It unifies three tasks in a single model: Text-to-Video, Image-to-Video, and Video-Continuation. Video-Continuation is a native pretraining task, not an inference-time workaround.
All model weights are open-sourced, including the base model and two Avatar variants for audio-driven human video generation.
Core Technical Analysis
Unified Architecture: One Model, Three Tasks
The most notable design choice in LongCat-Video is task unification.
Many teams train separate models — one for T2V, one for I2V, one for continuation. At inference, you load different weights depending on the task. This is engineering-simple, but each model’s training data only covers a single task distribution, capping model capability.
LongCat-Video unifies all three tasks within a single DiT framework. The input conditions differentiate task types: text-only for T2V, text + first-frame image for I2V, text + preceding video for Video-Continuation. The model shares all parameters across tasks, and training signals from different tasks reinforce each other.
The benefit: the model sees more diverse data distributions. T2V training teaches “text description to visual scene mapping.” I2V teaches “first-frame consistency.” Video-Continuation teaches “temporal and style continuity from prior context.” Shared parameters allow knowledge transfer across tasks.
The cost is more complex training — you need to balance loss weights across tasks to prevent one from dominating gradients. But from the evaluation results, LongCat-Video’s performance on each individual task hasn’t been significantly compromised by multi-task training.
Coarse-to-Fine Generation Strategy
A direct challenge in long video generation is compute. A 1-minute video at 720p, 30fps contains 1800 frames. Denoising frame-by-frame is prohibitively expensive.
LongCat-Video employs a “spatiotemporal coarse-to-fine” strategy:
- Temporal axis: First generate a video skeleton at low framerate (e.g., only keyframes per second), then interpolate to fill intermediate frames
- Spatial axis: First generate a low-resolution version, then upsample to 720p
The core insight: video information density is uneven. Keyframes carry the core motion and semantics; intermediate frames are mostly smooth transitions. Low-resolution versions carry scene structure and layout; details can be added later. Getting the big structure right first, then adding details level by level, is more efficient than full-resolution generation in one pass.
Combined with Block Sparse Attention, the model skips unimportant attention blocks at high resolution, further reducing computation. In practice, 720p 30fps video generates in minutes.
Multi-Reward RLHF: GRPO
Video generation models trained with only pretraining and SFT tend to produce inconsistent quality — misaligned text understanding, unnatural motion, rough visual fidelity.
LongCat-Video uses Multi-reward GRPO (Group Relative Policy Optimization) for reinforcement learning alignment. The core idea is simultaneously optimizing multiple reward signals: text alignment, visual quality, motion quality, etc. GRPO’s advantage over traditional PPO is that it doesn’t require a separate value network, reducing training instability.
From the evaluation data, RLHF brings substantial improvements. LongCat-Video achieves performance comparable to closed-source solutions on both internal and public benchmarks. A 13.6B dense model’s overall quality is close to a 28B MoE model (Wan 2.2-A14B).
Video-Continuation: Native Pretraining for Long Video
This is LongCat-Video’s biggest differentiator from other open-source video models.
The problem with concatenative long video is that each segment is independently inferred, lacking global temporal consistency. LongCat-Video makes Video-Continuation a pretraining task — the model learns to “generate subsequent content given preceding video” during training. This means:
- The model natively understands temporal correlation between segments, no inference-time patching needed
- Color consistency and style consistency are guaranteed by training, not post-processing
- Long video generation doesn’t suffer quality degradation — later segments match the quality of the first
The model supports generating minute-length videos without color drift or quality degradation.
LongCat-Video-Avatar: Audio-Driven Digital Humans
Beyond the base video generation model, the project also open-sources two Avatar variants for audio-driven human video generation.
Avatar 1.0 uses a Wav2Vec2 audio encoder, supporting single-person and multi-person audio input. It handles Audio-Text-to-Video, Audio-Image-to-Video, and continuation-based long video generation.
Avatar 1.5 is the latest upgrade with these key improvements:
- Audio encoder upgraded from Wav2Vec2 to Whisper-large-v3 for significantly better lip synchronization
- Step distillation compresses inference to 8 steps for faster generation
- INT8 quantization support reduces VRAM usage
- Better generalization — supports stylized domains including anime, animals, and complex real-world conditions
- Supports both single-stream and multi-stream audio inputs
Evaluation Results
Text-to-Video
| Metric | Veo3 | PixVerse-V5 | Wan 2.2-T2V-A14B | LongCat-Video |
|---|---|---|---|---|
| Accessibility | Closed | Closed | Open Source | Open Source |
| Architecture | - | - | MoE | Dense |
| Total Params | - | - | 28B | 13.6B |
| Activated Params | - | - | 14B | 13.6B |
| Text-Alignment | 3.99 | 3.81 | 3.70 | 3.76 |
| Visual Quality | 3.23 | 3.13 | 3.26 | 3.25 |
| Motion Quality | 3.86 | 3.81 | 3.78 | 3.74 |
| Overall Quality | 3.48 | 3.36 | 3.35 | 3.38 |
A 13.6B dense model achieving overall quality of 3.38, close to the 28B MoE model’s 3.35, exceeding expectations for this parameter count. Visual quality at 3.25 even slightly exceeds Wan 2.2.
Image-to-Video
| Metric | Seedance 1.0 | Hailuo-02 | Wan 2.2-I2V-A14B | LongCat-Video |
|---|---|---|---|---|
| Image-Alignment | 4.12 | 4.18 | 4.18 | 4.04 |
| Text-Alignment | 3.70 | 3.85 | 3.33 | 3.49 |
| Visual Quality | 3.22 | 3.18 | 3.23 | 3.27 |
| Motion Quality | 3.77 | 3.80 | 3.79 | 3.59 |
| Overall Quality | 3.35 | 3.27 | 3.26 | 3.17 |
On I2V, LongCat-Video leads in visual quality (3.27, highest), but shows a gap in motion quality and image alignment compared to top solutions. This is the trade-off of multi-task unification — resources allocated to T2V and continuation capabilities come at some cost to I2V motion detail.
Quick Start
Installation
git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video
conda create -n longcat-video python=3.10
conda activate longcat-video
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install ninja psutil packaging flash_attn==2.7.4.post1
pip install -r requirements.txt
For Avatar support, additionally install:
conda install -c conda-forge librosa ffmpeg
pip install -r requirements_avatar.txt
Download Model Weights
pip install "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
Text-to-Video
Single GPU:
torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
Multi-GPU:
torchrun --nproc_per_node=2 run_demo_text_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
Long Video Generation
torchrun run_demo_long_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
Interactive Video Generation
torchrun run_demo_interactive_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile
Avatar 1.5 Audio-Driven Generation
torchrun --nproc_per_node=2 run_demo_avatar_single_audio_to_video.py \
--context_parallel_size=2 \
--checkpoint_dir=./weights/LongCat-Video-Avatar-1.5 \
--stage_1=ai2v \
--input_json=assets/avatar/single_example_1.json \
--use_distill --model_type avatar-v1.5 --use_int8
Avatar 1.5 requires the --use_distill flag for step distillation. INT8 quantization is optional and reduces VRAM usage.
Web UI
streamlit run ./run_streamlit.py --server.fileWatcherType none --server.headless=false
When to Use It
- Long-form video production: Ad clips, product demos, short films — anything needing more than 10 seconds of temporally consistent video. Native continuation is LongCat-Video’s core advantage
- Digital human videos: The Avatar series supports audio-driven generation, suitable for virtual presenters, educational videos, customer service dialogues
- Video continuation and extension: When you have existing footage that needs to be extended in duration, LongCat-Video can seamlessly continue it
- Interactive video generation: Supports real-time interactive generation workflows, suitable for creative exploration and prototyping
Limitations and Caveats
- I2V motion quality gap: From evaluation data, I2V motion quality (3.59) is notably lower than top competitors (Hailuo-02 at 3.80). If your primary use case is image-to-video with strict motion naturalness requirements, benchmark it yourself first
- High VRAM requirements: 13.6B parameter model requires multi-GPU or large-VRAM GPUs for inference. INT8 quantization is only supported on Avatar 1.5, not the base model
- Avatar 1.5 Audio CFG tuning: Lip sync quality is sensitive to the Audio CFG parameter. Recommended range is 3-5; you’ll need to tune per audio clip
- Avatar repeated action issues: Mitigable by adjusting
--ref_img_index(0-24 for consistency, 30 to reduce repetition) and--mask_frame_range(larger reduces repetition but may introduce artifacts) - Training code not open-sourced: Only inference code and weights are available; the training pipeline is not public, so fine-tuning isn’t possible yet
- Early-stage community: The project was open-sourced in October 2025; community contributions and ecosystem are still developing
Comparison with Alternatives
| Dimension | LongCat-Video | Wan 2.2 | CogVideoX |
|---|---|---|---|
| Parameters | 13.6B Dense | 28B MoE (14B activated) | 5B |
| Native long video | Yes | No | No |
| Unified T2V/I2V/Continuation | Yes | No | No |
| Avatar digital human | Yes (1.0 + 1.5) | No | No |
| License | MIT | Apache 2.0 | Apache 2.0 |
| T2V overall quality | 3.38 | 3.35 | - |
LongCat-Video’s differentiator isn’t crushing any single metric. It’s long video and task unification. If you need 5-second high-quality clips, Wan 2.2 may be more straightforward. If your use case demands minute-length video, continuation capability, or audio-driven digital humans, LongCat-Video is currently the most pragmatic open-source choice.
Conclusion
The next battleground for video generation models isn’t 5-second clip quality — that problem is largely solved. The real challenge is long video: how to make AI generate minutes of temporally consistent, quality-stable video.
LongCat-Video offers an answer through native Video-Continuation pretraining. It’s not the best model at any single task, but it’s the only open-source framework that treats long video as a first-class citizen.
If your video generation needs go beyond 5 seconds, it’s worth trying.