Meta survey maps the 2026 video intelligence stack
A Meta AI researcher provides a technical survey, framed from the perspective of April 2026, on the state of efficient video intelligence. The post outlines the key architectural patterns that have become standard, including universal vision encoders (EUPE), on-device segmentation and tracking (EdgeTAM), long-form video understanding via adaptive compression (LongVU, Tempo), and VLM-based depth perception. It covers the full stack from model design to deployment, highlighting a convergence on multi-teacher distillation, factorized attention, and aggressive quantization for cloud, edge, and on-device targets.
Key Takeaways
- EUPE distills DINOv2, DINOv3, SAM, SAM 2, SAM 3, CLIP, SigLIP, and SigLIP-SO400M through a proxy teacher, then compresses into students under 100M parameters.
- EdgeTAM runs at 16 FPS on iPhone 15 Pro Max and uses a 2D Spatial Perceiver to preserve tracking accuracy while compressing per-frame memory.
- LongVU uses 1 FPS adaptive sampling plus token pruning to reach 60.6% on VideoMME and 65.4% on MLVU with 1 FPS input.
- Tempo routes token budget per segment from 0.5 to 16 tokens per frame and reports 52.3 on LVBench at an 8K visual token budget.
- ExecuTorch 1.0 GA now powers Meta on-device AI across Instagram, WhatsApp, Messenger, Facebook, Quest 3, and Ray-Ban Meta.
Why It Matters
The near-term signal is that efficient video systems are converging on a repeatable stack: universal encoders, temporal compression, and aggressive quantization rather than one-off specialist models. That matters because the same paper ties cloud, edge, and on-device deployment to different compute and latency envelopes, with ExecuTorch, Core ML, QNN, and Jetson all named as runtime paths. The broader pattern is less about a single model win and more about which components can survive in production video pipelines. Watch for whether future systems keep combining proxy-teacher distillation, adaptive token routing, and streaming inference support in the same stack.
Read full article at v-chandra.github.io