NVIDIA's Nemotron 3.5 ASR Offers Real-Time Speech-to-Text in 40 Languages

NVIDIA has released Nemotron 3.5 ASR, a 600M-parameter speech-to-text model that supports 40 languages in real time, offering low latency and high accuracy with built-in punctuation and capitalization. The model is open-weights, fine-tunable, and addresses common challenges in multilingual speech recognition for streaming video applications. It provides a detailed guide on how to fine-tune the model for specific languages or domains.

Key Takeaways

Nemotron 3.5 ASR transcribes 40 language-locales from a single 600M-parameter checkpoint with real-time performance.
The model incorporates punctuation and capitalization natively, eliminating the need for post-processing.
Its Cache-Aware FastConformer-RNNT architecture processes each audio frame once, providing low latency (down to 80ms) and high accuracy without recomputation.
Fine-tuning options allow for adapting the model to specific languages, domains, or accents, with demonstrated WER improvements of 31-32% for under-resourced languages like Greek and Bulgarian.
The model supports dynamic latency configuration via `att_context_size` at inference time, ranging from 80ms (ultra-low) to 1.12s (high accuracy).

Why It Matters

This release directly impacts streaming video applications requiring low-latency, accurate, and multilingual speech-to-text capabilities, such as live captions, voice agents, and call-center analytics. By offering a single, fine-tunable, open-weights model for 40 languages, NVIDIA reduces infrastructure complexity and costs associated with managing multiple APIs or models. The configurable latency and native punctuation capabilities also streamline development. Moving forward, watch for adoption rates and independent benchmarks of Nemotron 3.5 ASR in diverse production environments, especially how its fine-tuning capabilities are leveraged for long-tail languages and specialized domains.

Additional Context

NVIDIA's Nemotron 3.5 ASR launch on June 4, 2026, aligns with a broader strategy to provide open, efficient AI models for agentic systems, including the Nemotron 3 Ultra for reasoning and Nemotron 3.5 Content Safety for guardrails (Eigen AI, June 2026). The ASR model is noted as the lowest-latency STT model tested by Pipecat benchmarks, offering significant cost savings if self-hosted, with estimates as low as $0.01 per agent per hour compared to typical API costs of $0.10-$1.00 per hour (daily.co, June 2026). This self-hosting capability, combined with open-source tools for fine-tuning, empowers enterprises to maintain data processing within their security boundaries and optimize for specific use cases. The Nemotron models are moving to the OpenMDW-1.1 license, a permissive framework from the Linux Foundation, aiming to clarify terms and accelerate adoption of open AI models (NVIDIA Developer, June 2026). Partner integrations include Microsoft Foundry, Baseten, DeepInfra, and Together AI, indicating a push for broad accessibility and deployment across various cloud and inference service providers.

Read full article at huggingface.co

Get this in your inbox → Subscribe

Enjoy our coverage?

Add StreamingMeme as a preferred source on Google to see more of our streaming news at the top of your Search results.

Add as preferred source

X: vLLM v0.26.0 introduces tiered KV offloading and multimodal audio-video support

Content+Technology: Runway launches Media Router to automate generative video model selection

IT Brief UK: Fetch.ai and RedSquid TV launch first agentic AI television platform

NVIDIA's Nemotron 3.5 ASR Offers Real-Time Speech-to-Text in 40 Languages

Key Takeaways

Nemotron 3.5 ASR transcribes 40 language-locales from a single 600M-parameter checkpoint with real-time performance.
The model incorporates punctuation and capitalization natively, eliminating the need for post-processing.
Its Cache-Aware FastConformer-RNNT architecture processes each audio frame once, providing low latency (down to 80ms) and high accuracy without recomputation.
Fine-tuning options allow for adapting the model to specific languages, domains, or accents, with demonstrated WER improvements of 31-32% for under-resourced languages like Greek and Bulgarian.
The model supports dynamic latency configuration via `att_context_size` at inference time, ranging from 80ms (ultra-low) to 1.12s (high accuracy).

Why It Matters

Additional Context

Read full article at huggingface.co

NVIDIA's Nemotron 3.5 ASR Offers Real-Time Speech-to-Text in 40 Languages

Key Takeaways

Why It Matters

Additional Context

Enjoy our coverage?

Related Articles

NVIDIA's Nemotron 3.5 ASR Offers Real-Time Speech-to-Text in 40 Languages

Key Takeaways

Why It Matters

Additional Context

Enjoy our coverage?

Related Articles

Newest

Upcoming Events

Top Sources

Newest

Upcoming Events

Top Sources

Related Articles

vLLM v0.26.0 introduces tiered KV offloading and multimodal audio-video support

Runway launches Media Router to automate generative video model selection

Fetch.ai and RedSquid TV launch first agentic AI television platform