NeuroFlow claims 55.8x video inference speedup on SigLIP 2
Ynnk-Research published NeuroFlow, a PyTorch implementation for EMA-Gated Temporal Sequence Compression in Vision Transformers. This technology aims to optimize video inference by reducing computational load by up to 55.8x by identifying and eliminating redundant 'stationary asphalt' tokens before the encoder, while maintaining embedding fidelity. The toolkit includes multiple architectures, with Architecture C offering a training-free option that achieves 71.55% zero-shot top-1 accuracy at 84% token sparsity without modifying model weights.
Key Takeaways
- Architecture B reports a 55.80x wall-clock speedup at 1792p, reducing SigLIP 2 inference from 678 ms to 11.9 ms.
- Architecture C is training-free and posts 71.55% zero-shot top-1 accuracy at 84.0% token sparsity on SigLIP.
- The repo says Architecture C retains 92.4% of dense accuracy without modifying any weights.
- NeuroFlow’s gate uses an EMA of patch-level embeddings to skip stationary tokens before the encoder.
- The repository includes /core, /scripts, /paper, and /weights, with 300MB Architecture B weights archived on Hugging Face and Zenodo.
Why It Matters
NeuroFlow is focused on reducing the compute cost of video inference by removing redundant patch tokens before they hit the Vision Transformer encoder. That matters because the repo frames the bottleneck as a mismatch between O(N2) self-attention and highly redundant video streams, while also offering a training-free path in Architecture C for teams that want sparsity without weight updates. The main data points to watch are the 55.80x speedup at 1792p and whether Architecture C’s 71.55% zero-shot top-1 accuracy holds at the cited 84.0% token sparsity.
Read full article at github.com