NVIDIA launches 600M-parameter streaming ASR for English transcription
NVIDIA has released Nemotron-ASR-Streaming, a new English streaming Automatic Speech Recognition (ASR) model with 600M parameters. Developed by NVIDIA, this model uses a Cache-Aware FastConformer-RNNT architecture to provide high-quality transcription with native punctuation and capitalization support, designed for low-latency streaming and high-throughput batch workloads.
Key Takeaways
- March 13, 2026 release: Nemotron-ASR-Streaming is available on Hugging Face, Build.nvidia.com, and NGC.
- The model has 600M parameters and uses a Cache-Aware FastConformer-RNNT architecture with a 24-layer encoder.
- NVIDIA lists four chunk sizes for inference: 80ms, 160ms, 560ms, and 1120ms.
- Training data includes about 250,000 hours of US English speech from NVIDIA Riva ASR training set and Granary.
- On Hugging Face OpenASR leaderboard tests, the model reports 6.93% average WER at 1.12s chunk size.
Why It Matters
Nemotron-ASR-Streaming gives developers a single English ASR model for both live voice workloads and batch transcription, with built-in punctuation and capitalization. The cache-aware design is aimed at reducing redundant overlap in streaming, while NVIDIA says it can improve throughput and lower GPU memory pressure versus buffered approaches. The competitive signal is the model’s reported 6.93% average WER at 1.12s chunk size, plus support for four operating points without retraining. Next to watch: how the model performs in real deployments through the hosted NVIDIA NIM API and NeMo stack.
Read full article at huggingface.co