on 06 December 18
Several months ago I had blogged about RDD (Resilient Distributed Datasets), the core concept that enables Apache Spark. The widespread adoption of Apache Spark in Big Data projects happened because it enabled raw data to be processed much faster than MapReduce could. MapReduce programs are essentially batch processing ones, and one of the main disadvantages of using them is that they take time to process data. With Apache Spark, that particular drawback was addressed. Spark brought the benefit of being able to process streaming data, and so it was the tool of choice for a whole new set of use cases.
On the inside, what Spark does is divide up data records into small batches that are stored in memory and then processed again as a batch. The output stream of records contains the processed results. Because this processing is also ultimately done on a batch there is a minor lag or delay in receiving output, and so although Spark handles streaming data well and represents a big leap forward from MapReduce it is still not truly all that close to being a real time processing solution.
This is where Flink comes in. Flink was originally developed in Germany in 2009, and became a part of the Apache open source project in 2014. Its logo is the squirrel in the graphic. Flink is a German word that translates to "nimble" or "agile", is the next big step forward in Big Data processing technology that is seeing very good adoption rates. This is because Flink provides the options of both, batch processing as well as true record-by-record processing of data streams. The availability of a true stream processing solution opened the doors for a whole new set of use cases across several sectors of industry and government. For example, in the financial sector stream processing is useful in analysing stock trade data and for detecting fraud. It could also be effectively applied in analysing online user behaviour in the digital world where there may be non-stop usage that can be further analysed by reference timeframes or between events.
Flink also offers iterative process capabilities at the individual record level. This is useful in machine learning and artificial intelligence applications. One of the iterative operations available is called Delta Iteration and because it works only on the part of data that has changed it results in faster processing than other iteration operations. This is also useful in machine learning applications involving real time data stream processing and output that reflects faster response times.
Other areas in which Flink scores well are memory management and self-optimization. Flink, to a good extent, does its own garbage collection (within the JVM) without needing much developer involvement. On the optimization front, where Spark needs manual optimization strategizing and development for data partitions and caching, Flink does it’s own optimization, determining the best way to run given the type of input and runtime environment. All of these capabilities combined result in Flink displaying better performance than Spark.
Like Spark, Flink integrates with Apache Zeppelin for interactive data analytics and visualization. Flink APIs are available for Java and Scala.
So which one is better, Spark or Flink? As usual, the answer is that it depends on the use case. Apache Spark is well suited to applications where microbatch (RDD) processing is adequate and the output becomes available with a bit of a lag. Flink, of course, supports real time stream data processing. However it also comes with batch processing capability, so from that point of view it serves both purposes. This is also one of the reasons why it is being adopted rather fast, and for now, we may expect it to continue to receive a fair amount of support for continued development by open source contributors.