Google cuts Gemini AI costs by 90% via context caching
Google's Gemini Enterprise Agent Platform has introduced implicit and explicit context caching for its Gemini models. This feature is designed to reduce costs and latency for AI requests containing repeated content, offering up to a 90% discount on cached tokens for certain models. This is particularly useful for scenarios such as chatbots and repetitive analysis of large video or document files.
Key Takeaways
- Gemini 2.5 and 3.5 models offer a 90% discount on cached tokens compared to standard input prices
- Implicit caching is enabled by default for all Cloud projects with a minimum threshold of 2,048 tokens
- Explicit caching through the API allows manual TTL management and oversight of specific data subsets
- System supports analysis of video, audio, and large document blobs up to 10MB per cached item
Why It Matters
The high cost of processing long-form video content remains a primary barrier for AI-driven metadata extraction and search. By discounting repeated context by up to 90%, Google is lowering the economic threshold for sophisticated video analysis workflows, such as frame-by-frame sports logging or legal review of raw footage. This move pressures competitors like AWS and OpenAI to offer similar architectural efficiency for multimodal workloads. For the streaming ecosystem, this facilitates deeper content discovery tools without the ballooning compute costs traditionally associated with high-token video inputs. Watch for whether Google introduces custom cache-sharing across different project IDs for enterprise media organizations.
Additional Context
The introduction of context caching addresses the 'lost in the middle' phenomenon and high overhead costs associated with the long-context windows that have become a competitive frontier for LLMs. Per CNBC in May 2026, Google has aggressively expanded Gemini’s context window to handle up to 2 million tokens, yet developers have voiced concerns regarding the linear cost scaling of processing consistent background data. This update follows a broader trend where infra-providers move from general model availability to operational cost-optimization. For example, per The Verge in late 2025, competitors have focused on 'prompt caching' to retain users who are moving away from brute-force token consumption. Within the media and entertainment sector, the utility of this feature aligns with recent industry shifts toward AI-driven post-production. Per a June 2026 report from Variety, major studios are increasingly using multimodal models to automate the generation of descriptive metadata and localization scripts. Prior to caching, re-analyzing a 60-minute 4K video file for different targeted outputs — such as social media clips versus accessibility captions — required redundant and expensive token processing. Google’s 90% discount directly targets these repetitive workflows. Furthermore, the technical implementation of implicit caching mirrors recent updates found in open-source frameworks. Per TechCrunch in April 2026, the demand for 'agentic' workflows — where an AI performs multiple sequential tasks on a single dataset — has skyrocketed. By making the cache hit savings automatic for projects using Gemini 3.5 Flash and Flash-Lite, Google is attempting to lock in developers who require low-latency responses for consumer-facing video chatbots and interactive streaming experiences.
Read full article at docs.cloud.google.com