This page is currently still a WIP

Motivation

This is a reimplementation of the Bytedance & UCLA paper “LoL: Longer than Longer, Scaling Video Generation to Hour”. The paper analyses the issue of sink-collapse in recent autoregressive video generation methods such as LongLive [2] or RIFLex [3]. These methods employ so-called attention sinks to enable multi video generation of videos. To enable such long video generation these methods employ sliding window attention on the past frames, howevery naive application of this leads to context of first video frames being lost. Therefore so called “sink frames” (typically 3) are used to avoid loosing context of the original frames, stabilizing context over multiple minute long videos.

Method

The paper first diagnoses and analyses the issue of “sink collapse” related to the frequency of RoPE [4] components and then proposes RoPE jitter to mitigate this issue.

Analysis of Sink Collapse

The authors find that while methods using attention sinks can keep long range context to the original video frames, they introduce a new failure mode called sink collapse : during video rollout the periodic structure of RoPE leads to scene rests, cyclic/loop motions and rewinds to the original anchor frames.

Proposed Rope Jitter

Results

While Rope Jitter improves the sink collapse issues as seen in the plots above, it leads to slightly worse prompt alignment. We show that the CLIP score over our 10 generated videos drops from 30.80 to 30.63 for our set of 10 generated 2 minutes each for each method.

References

  1. Cui et al., LoL: Longer than Longer, Scaling Video Generation to Hour, arXiv 2601.16914 (2026).
  2. Yang et al., LongLive: Real-time Interactive Long Video Generation, arXiv 2509.22622 (2025).
  3. Zhao et al., RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers, arXiv 2502.15894 (2025).
  4. Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, arXiv 2104.09864 (2021).