Reimplementation of LoL: Longer than Longer, Scaling Video Generation to Hour

This page is currently still a WIP

Motivation

This is a reimplementation of the Bytedance & UCLA paper “LoL: Longer than Longer, Scaling Video Generation to Hour”. The paper analyses the issue of sink-collapse in recent autoregressive video generation methods such as LongLive [2] or RIFLex [3]. These methods employ so-called attention sinks to enable multi video generation of videos. To enable such long video generation these methods employ sliding window attention on the past frames, howevery naive application of this leads to context of first video frames being lost. Therefore so called “sink frames” (typically 3) are used to avoid loosing context of the original frames, stabilizing context over multiple minute long videos.

Method

The paper first diagnoses and analyses the issue of “sink collapse” related to the frequency of RoPE [4] components and then proposes RoPE jitter to mitigate this issue.

Analysis of Sink Collapse

The authors find that while methods using attention sinks can keep long range context to the original video frames, they introduce a new failure mode called sink collapse : during video rollout the periodic structure of RoPE leads to scene rests, cyclic/loop motions and rewinds to the original anchor frames.

We can observe the mechanism directly at the attention level. Capturing the softmax maps in an early DiT layer (layer 0, denoising step 3), we plot how the 3 newly generated query frames attend over the key frames (columns 0–2 are the sinks, columns 3–11 the local window). On normal blocks attention is spread over the local window (sink fraction ≈ 0.01). On the sink-collapse blocks the head re-routes a large share of its mass onto the sink columns (sink fraction up to ~0.4) — attention collapses onto the anchor frames.

Attention collapse onto sink frames (head 5)

For every newly generated latent frame we measure its L2 distance in latent space to the three sink (anchor) latents. In a healthy rollout this distance grows quickly after the sinks and then stays high, the video keeps moving away from its starting point. Under vanilla LongLive (standard RoPE) we instead observe sharp, periodic dips back towards the sinks (latent indices ≈ 134, 203, 335, 467). Each dip is a moment where the generation snaps back to the anchor frames leading to a visible scene reset or rewind. Applying RoPE head-jitter removes these collapses: the blue curve stays high and the periodic dips disappear.

Latent L2 distance to sink frames: vanilla LongLive vs. RoPE head-jitter

Proposed RoPE Jitter

The key insight from the analysis is that sink collapse is collective: it does not come from a single RoPE component or a single attention head, but from all heads phase-aligning to the sink frames at the same time. RIFLEx-style fixes that retune one intrinsic frequency therefore do not help. Instead, the paper proposes multi-head RoPE jitter: give each attention head a slightly different RoPE base frequency so the heads can no longer synchronize.

Concretely, for the temporal RoPE component with base \(\theta_0\), each head \(h\) gets its own jittered base

\[\hat{\theta}_h = \theta_0\,(1 + \sigma\,\epsilon_h), \qquad \epsilon_h \sim \mathcal{U}[-1, 1],\]

and rotates its queries/keys with the per-head frequencies \(\omega_h = [\hat{\theta}_h^{\nu_0}, \dots, \hat{\theta}_h^{\nu_{D/2-1}}]\). Because RoPE is periodic, this small per-head phase shift desynchronizes the heads: the phase peaks that used to line up now occur at different displacements per head, so the simultaneous, all-head overlap that produces sink collapse becomes very unlikely. The jitter scale \(\sigma\) trades off stability against fidelity — following the paper we use \(\sigma = 0.8\), which removes the collapses while barely touching generation quality.

Implementation

We build on the upstream LongLive [2] inference stack and add jitter as a drop-in replacement for the RoPE frequency table, so no attention kernels need to change:

Per-head frequency table. Standard RoPE uses a shared [T, c] complex frequency table. We instead build a [num_heads, T, c] table where only the temporal slice carries the per-head jittered bases \(\hat{\theta}_h\) (sampled once from a seeded generator); the two spatial slices keep the shared base. The existing rope_apply auto-detects the 2D (shared) vs. 3D (per-head) layout, so the change is localized.
In-place switch. A configure_rope(jitter=True, sigma, seed) call rebuilds model.freqs after the checkpoint loads, leaving the LoRA wrap, dtype cast and device move untouched. With \(\sigma = 0\) the table collapses back to the shared one, giving a bit-exact identity check against vanilla RoPE.

This keeps vanilla and jittered runs numerically comparable — the only difference between the two curves above is the per-head \(\hat{\theta}_h\).

Results

The latent-distance and attention plots above already show that jitter removes the periodic collapses. The effect is also visible in pixel space. Below we decode the same latent indices (the sink frame and the four collapse points) for the vanilla baseline and the \(\sigma = 0.8\) jitter run on a snowy-mountain prompt.

Vanilla LongLive — at every collapse latent the frame snaps back to (a near-copy of) the sink frame:

Vanilla LongLive: decoded frames at the sink and collapse latents, the scene rewinds to the sink frame

With RoPE head-jitter (\(\sigma = 0.8\)) — at the same latent indices the scene keeps evolving instead of rewinding:

RoPE head-jitter: decoded frames at the same latents no rewinds, the video continues

The trade-off is a small drop in prompt alignment (as also reported in the paper): the CLIP score over our set of 10 prompts, 2 minutes each, drops from 30.80 (vanilla) to 30.63 (jitter).

References

Cui et al., LoL: Longer than Longer, Scaling Video Generation to Hour, arXiv 2601.16914 (2026).
Yang et al., LongLive: Real-time Interactive Long Video Generation, arXiv 2509.22622 (2025).
Zhao et al., RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers, arXiv 2502.15894 (2025).
Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, arXiv 2104.09864 (2021).