Kwai-Keye ships 30B video model with 256K context and agents
Kwai-Keye has released Keye-VL-2.0-30B-A3B, a new 30B-parameter multimodal large language model designed for long-video understanding and agent capabilities. The model features sparse attention architecture for efficient processing of hour-long video contexts and performs competitively against top open-source and closed-source models in various video understanding benchmarks. It also includes built-in agent abilities for tasks such as code generation, tool use, and web-grounded search.
Key Takeaways
- Keye-VL-2.0-30B-A3B is a 30B-class base model with built-in Code, Tool, and Search agent abilities.
- The model uses DSA sparse attention and targets 256K ultra-long context for hour-long video inputs.
- On TimeLens, it posted 58.4 mIoU on Charades-TimeLens, 58.5 on ActivityNet-TimeLens, and 70.1 on QVHighlights-TimeLens.
- On VideoMME V2, accuracy rose from 35.3% at 64 frames to 42.4% at 512 frames, with non-linear reasoning score improving from 18.5 to 24.2.
- On LongVideoBench, Keye-VL-2.0-30B-A3B scored 74.1 and the release says it outperformed Qwen3.5-35B-A3B and Qwen3-VL-235B-A22B.
Why It Matters
Keye-VL-2.0-30B-A3B matters because it packages long-video understanding and agent functions in a single 30B model, with the company claiming nearly lossless reasoning over 256K context. That puts a new open model into direct comparison with larger open-source systems and several closed-source baselines across video, coding, and agent benchmarks. The more concrete signal to watch is whether the model’s VideoMME V2 result holds as frame count increases, since the release says accuracy improved from 35.3% at 64 frames to 42.4% at 512 frames.
Read full article at huggingface.co