STORM MLLM Unifies Video Object Grounding and Tracking
Researchers introduced STORM, an end-to-end MLLM for referring multi-object tracking in videos presented at CVPR 2026. This technology unifies object grounding and tracking, leveraging a task-composition learning strategy to improve data efficiency. Extensive experiments show STORM achieves state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks.
Key Takeaways
- STORM achieved state-of-the-art performance on image grounding, single-object tracking, and RMOT benchmarks.
- The MLLM integrates grounding and tracking into a single framework, eliminating external detectors.
- A task-composition learning strategy decomposes RMOT into image grounding and object tracking to leverage data-rich sub-tasks.
- STORM-Bench, a new RMOT dataset with 0.2M referring expressions and 73.7K tracked objects, was developed using a bottom-up annotation pipeline.
Why It Matters
The development of an end-to-end MLLM for referring multi-object tracking streamlines a complex computer vision task, potentially reducing computational overhead and improving accuracy in video content analysis. This unification could simplify the architecture for video AI applications, impacting areas from automated content moderation to enhanced search capabilities within video libraries. The ability to track objects based on natural language queries with improved data efficiency suggests advances in practical, deployable AI for streaming platforms. Watch for adoption rates of this integrated MLLM approach in commercial video processing solutions.
Read full article at openaccess.thecvf.com