Simplified Answers – What is Spark Streaming ?
- Spark Streaming is Sparks module for applications such are benefits from data as soon as it lands/arrives from various sources. E.g. page view in real time, train a machine learning model, automatically detect anomalies.
- Developer can use a API which is very similar to batch jobs, also we can reuse the same API skills and code bases.
- Spark Streaming provides an abstractions called DStream(discretized streams). DStream is a sequence of data arriving over a certain/stipulated time. Each DStream is representation of a sequence of RDDs arriving at each time, hence the name discretized.
- DStreams can be created from various input sources like Flume, Kafka, HDFS, S3. Once the DStreams are built it will provide two types of operations. Transformations which yield a new DStreams and Output Operations which write data to an external system.
- DStreams provide maximum of the same operations which are available on RDDs and additionally it provides operations related to time called sliding windows.
- CheckPointing is the main and vital mechanism of Spark Streaming which helps to operate and act on the data by 24/7.
- SparkStreaming programs are best run as a standalone applications built using Maven or SBT(simple build tool).
- StreamingContext is the main entry point for SparkStreaming functionality with the SparkContext used to process the data. It can be started only once, hence we must start after setting up all the DStreams and Output Operations.
- Spark Streaming uses a micro-batch architecture where the streaming computation is treated as a continuous series of batch computation on small batches of data.
- It receives data from various sources and groups it into small batches, and new batches are created at regular time intervals.
- The size of the time intervals is determined by a parameters called the batch intervals and it will typically between 500 milliseconds to several seconds.
- In the SparkStreaming transformations can be one of the two types on DStreams called either stateless or state full. In stateless transformations the processing of each batch does not depends on the data of its previous batches. E.g. map(), filter(), flatMap(). In state full transformations it uses data or intermediate results from previous batches to compute the results of the current batches. E.g. sliding windows, updateStateByKey(), window().
- In general the minimum batch size of Spark Streaming is 500 milliseconds.
- A common way to reduce the processing time of batches is to increase the Parallelism. There are three ways – By Increasing the number of receivers, Explicitly repartitioning received data, Increasing Parallelism in aggregations.
- Enable the Java’s concurrent Mark-Sweep Garbage Collector to minimize unpredictably large pauses.
To conclude Spark Streaming helps developers to build scalable fault-tolerant streaming applications easily by using real time APIs.