Databricks adds sub-second streaming to Spark Structured Streaming
Databricks has announced the general availability of Real-Time Mode (RTM) for Apache Spark Structured Streaming, a feature designed to provide sub-second, millisecond-level latency for data processing. The new mode processes data as it arrives rather than in microbatches, positioning it as an alternative to Apache Flink for use cases such as real-time ML feature computation and live personalization. RTM is available on Databricks Runtime 16.4 LTS and above but has specific cluster requirements, including disabled autoscaling and Photon.
Key Takeaways
- Real-Time Mode is generally available for Apache Spark Structured Streaming on Databricks.
- Databricks says RTM processes data as it arrives, instead of using Spark’s default microbatch model.
- The feature is available on Databricks Runtime 16.4 LTS and above.
- RTM requires classic compute, disabled autoscaling, disabled Photon, and no spot instances.
- The article says RTM can be used for fraud detection, real-time ML feature computation, live personalization, and IoT anomaly detection.
Why It Matters
RTM gives Spark users a way to push toward sub-second latency without running a separate engine. That matters because the article frames it as an alternative to maintaining a separate Apache Flink cluster, with the same Spark APIs and a single codebase for batch training and real-time inference. The tradeoff is operational: RTM is not supported on serverless or Lakeflow Spark Declarative Pipelines, and it requires fixed task slots, disabled autoscaling, and Photon off. Watch for which streaming workloads can absorb those constraints and move from microbatch to RTM.
Read full article at medium.com
