Agora Launches Real-Time Speech-to-Text Translation with Sub-Second Latency, AI Integration
Agora has launched a real-time speech-to-text translation solution that supports over 30 languages with ultra-low latency and AI integration. This beta service aims to break down language barriers for global communication in virtual events, education, and live shopping within the streaming industry. Features include advanced speech recognition, translation transcripts, LLM integration, and sub-second end-to-start latency.
Key Takeaways
- The beta service provides real-time speech-to-text translation across more than 30 languages.
- It features ultra-low latency, achieving sub-second end-to-start and under 3 seconds average end-to-end latency.
- The solution includes advanced speech recognition, translation transcripts, and LLM integration for enhanced functionality.
- Use cases span virtual events, education, live shopping, and telehealth, promoting global communication in streaming.
- Developers can manage multilingual interactions by translating up to two source languages into five target languages for audio, with more languages supported for text.
Why It Matters
Agora's sub-second latency real-time translation offers a significant advancement for global live streaming and interactive video platforms. This capability directly impacts user engagement and content accessibility by enabling immediate multilingual communication. The integration of LLMs suggests future pathways for AI-driven content analysis and dynamic localization. Moving forward, observe adoption rates in specified verticals and how this technology influences user retention in international markets, particularly for time-sensitive, interactive content.
Additional Context
The real-time translation market is seeing rapid innovation, with various approaches to achieving low-latency, high-accuracy multilingual communication. According to a ForaSoft analysis from May 2026, the market broadly splits into cascaded pipelines (ASR + MT + TTS) and end-to-end speech-to-speech (S2S) models. While S2S models like OpenAI Realtime and Meta SeamlessM4T v2 can achieve lower latency (230-500 ms), cascaded systems typically offer more control, such as inserting glossaries and a higher degree of auditability. G2 rankings in November 2025 noted Agora’s position against competitors like IBM watsonx Orchestrate and Microsoft, highlighting the importance of pipeline customization and real-time inference in NLP platforms. For specific enterprise needs, platforms like DeepL Voice (launched April 2024), KUDO AI Speech Translator, and Interprefy Aivia are prevalent, with varying benchmarks for accuracy, latency, and cost. For example, DeepL Voice is recognized for its strong text-translation quality, often achieving a lower per-minute cost compared to event-focused platforms like KUDO or Interprefy. Meanwhile, for internal enterprise use, Microsoft Teams, Zoom, and Google Meet have expanded their native real-time captioning capabilities, with some offering limited voice translation. For instance, Zoom supports translated captions in over 40 languages and Google Meet has incrementally rolled out voice mode, often leveraging backend models like GPT-class systems, as noted by ForaSoft in May 2026.
Read full article at prod.agora.io
