This page is currently still a WIP
This is a reimplementation of the Bytedance & UCLA paper “LoL: Longer than Longer, Scaling Video Generation to Hour”. The paper analyses the issue of sink-collapse in recent autoregressive video generation methods such as LongLive [2] or RIFLex [3]. These methods employ so-called attention sinks to enable multi video generation of videos. To enable such long video generation these methods employ sliding window attention on the past frames, howevery naive application of this leads to context of first video frames being lost. Therefore so called “sink frames” (typically 3) are used to avoid loosing context of the original frames, stabilizing context over multiple minute long videos.
The paper first diagnoses and analyses the issue of “sink collapse” related to the frequency of RoPE [4] components and then proposes RoPE jitter to mitigate this issue.
The authors find that while methods using attention sinks can keep long range context to the original video frames, they introduce a new failure mode called sink collapse : during video rollout the periodic structure of RoPE leads to scene rests, cyclic/loop motions and rewinds to the original anchor frames.
While Rope Jitter improves the sink collapse issues as seen in the plots above, it leads to slightly worse prompt alignment. We show that the CLIP score over our 10 generated videos drops from 30.80 to 30.63 for our set of 10 generated 2 minutes each for each method.