Hokam Singh Chauhan: October 2016

Sunday 16 October 2016

Big Data Computational Models

Most of the distributed processing engine comes along with either one or more of three different types of computational models, namely batch, micro-batch, and continuous flow operator.

Batch model: Batch model processes the data at rest, taking a large amount of data at once and then processing it and then writing out the output to some file system or data stores.

Micro-batch model: Micro-batch combines the aspects of both batch and the continuous flow operator. In this model data gets gathered at perticular time interval and gets executed. Micro-batches are an essentially “collect and then process” kind of computational model.

Continuous flow operator model: This model processes the data when it arrives, without any delay in collecting or processing the data.

To have better understanding about Micro-batch and Continuous flow model, lets assume you wants to supply water into a water tank situated on top of a building. So there are two ways to do this, first you can collect it in lower basement water tank and then supplying it into the top most water tank from there but in second way you can directly supply the water to top most water tank using the pipe lines. That is essentially the basic difference between micro-batch and continuous flow operator model.

Apache Spark vs Apache Flink

In last few years the real time processing is one of hot topic in the market and also it has gained lots of value as well. The emerging tools tools which provides support for real time processing are Apache Storm, Apache Spark and Apache Flink. In this blog I am going to compare different features of Spark and Flink.

Feature by feature Comparison of Apache Spark and Flink:

Exactly once semantic:

Both Spark Streaming and Flink provide exactly once guarantee, which means that every record will be processed exactly once, thereby eliminating processing of data multiple times.

High Throughput and Fault Tolerance overhead:

Both Spark Streaming and Flink provides very high throughput compared to other processing systems like Storm. The overhead of fault tolerance is also low in both the processing engines.

Computational Model:

Spark Streaming and Flink differs is in its computation model. Spark has adopted micro-batching model whereas Flink has adopted a continuous flow, operator-based streaming model.

Data Windowing/Batching:

Spark has a time-based Window criteria, whereas Flink has Windows over time, record counts, sessions, data-driven windows or any custom user-defined Window criteria.

Stream Splitting:

Flink has direct API for splitting the input Data Stream into multiple streams, whereas in Spark it is not possible.

Complex Event Processing:

Flink comes along with the complex event processing API which means Flink has support for Event Time and Out-of-Order Events while Spark does not have it.

Memory Management:

Spark provides configurable memory management while Flink provides automatic memory management. Spark has also moved towards automating memory management (Unified memory management) in Spark 1.6.0 version.