Apache Flink: another Leap in Big Data Processing
Published on 21 September 16
4
3
Several months ago I had blogged about RDD (Resilient Distributed Datasets), the core concept that enables Apache Spark. The widespread adoption of Apache Spark in Big Data projects happened because it enabled raw data to be processed much faster than MapReduce could. MapReduce programs are essentially batch processing ones, and one of the main disadvantages of using them is that they take time to process data. With Apache Spark, that particular drawback was addressed. Spark brought the benefit of being able to process streaming data, and so it was the tool of choice for a whole new set of use cases.
On the inside, what Spark does is divide up data records into small batches that are stored in memory and then processed again as a batch. The output stream of records contains the processed results. Because this processing is also ultimately done on a batch there is a minor lag or delay in receiving output, and so although Spark handles streaming data well and represents a big leap forward from MapReduce it is still not truly all that close to being a real time processing solution.
This is where Flink comes in. Flink was originally developed in Germany in 2009, and became a part of the Apache open source project in 2014. Its logo is the squirrel in the graphic. Flink is a German word that translates to "nimble" or "agile", is the next big step forward in Big Data processing technology that is seeing very good adoption rates. This is because Flink provides the options of both, batch processing as well as true record-by-record processing of data streams. The availability of a true stream processing solution opened the doors for a whole new set of use cases across several sectors of industry and government. For example, in the financial sector stream processing is useful in analysing stock trade data and for detecting fraud. It could also be effectively applied in analysing online user behaviour in the digital world where there may be non-stop usage that can be further analysed by reference timeframes or between events.
Flink also offers iterative process capabilities at the individual record level. This is useful in machine learning and artificial intelligence applications. One of the iterative operations available is called Delta Iteration and because it works only on the part of data that has changed it results in faster processing than other iteration operations. This is also useful in machine learning applications involving real time data stream processing and output that reflects faster response times.
Other areas in which Flink scores well are memory management and self-optimization. Flink, to a good extent, does its own garbage collection (within the JVM) without needing much developer involvement. On the optimization front, where Spark needs manual optimization strategizing and development for data partitions and caching, Flink does it’s own optimization, determining the best way to run given the type of input and runtime environment. All of these capabilities combined result in Flink displaying better performance than Spark.
Like Spark, Flink integrates with Apache Zeppelin for interactive data analytics and visualization. Flink APIs are available for Java and Scala.
So which one is better, Spark or Flink? As usual, the answer is that it depends on the use case. Apache Spark is well suited to applications where microbatch (RDD) processing is adequate and the output becomes available with a bit of a lag. Flink, of course, supports real time stream data processing. However it also comes with batch processing capability, so from that point of view it serves both purposes. This is also one of the reasons why it is being adopted rather fast, and for now, we may expect it to continue to receive a fair amount of support for continued development by open source contributors.
This blog is listed under
Open Source
, Development & Implementations
, Data & Information Management
and Server & Storage Management
Community
Related Posts:
You may also be interested in
Share your perspective
Share your achievement or new finding or bring a new tech idea to life. Your IT community is waiting!
CCNA Training from Chennai, thanks for your comment. Yes, Flink can ingest data from HDFS, which is very common. But the Flink Streaming API can also get data from a streaming source like RabbitMQ, Twitter or Kafka, etc.
Wow really nice information here about the big data. Actually HDFS will be enabled for storing the large data right. After the big data announcement of Flame and Hive it was really nice update from there. Thank you for posting this. CCNA Training in Chennai
At present they are not compatible. But APIs for Python have now been made available for Flink, so you could integrate PHP with Flink indirectly via Python. You could also integrate using Java, depending on the nature of the interaction required.
Will this Flink work with PHP? Or any other similar tool for PHP