MyPage is a personalized page based on your interests.The page is customized to help you to find content that matters you the most.


I'm not curious

Apache Flink: another Leap in Big Data Processing

Published on 21 September 16
4134
4
3

Several months ago I had blogged about RDD (Resilient Distributed Datasets), the core concept that enables Apache Spark. The widespread adoption of Apache Spark in Big Data projects happened because it enabled raw data to be processed much faster than MapReduce could. MapReduce programs are essentially batch processing ones, and one of the main disadvantages of using them is that they take time to process data. With Apache Spark, that particular drawback was addressed. Spark brought the benefit of being able to process streaming data, and so it was the tool of choice for a whole new set of use cases.
Apache Flink: another Leap in Big Data Processing - Image 1

On the inside, what Spark does is divide up data records into small batches that are stored in memory and then processed again as a batch. The output stream of records contains the processed results. Because this processing is also ultimately done on a batch there is a minor lag or delay in receiving output, and so although Spark handles streaming data well and represents a big leap forward from MapReduce it is still not truly all that close to being a real time processing solution.

This is where Flink comes in. Flink was originally developed in Germany in 2009, and became a part of the Apache open source project in 2014. Its logo is the squirrel in the graphic. Flink is a German word that translates to "nimble" or "agile", is the next big step forward in Big Data processing technology that is seeing very good adoption rates. This is because Flink provides the options of both, batch processing as well as true record-by-record processing of data streams. The availability of a true stream processing solution opened the doors for a whole new set of use cases across several sectors of industry and government. For example, in the financial sector stream processing is useful in analysing stock trade data and for detecting fraud. It could also be effectively applied in analysing online user behaviour in the digital world where there may be non-stop usage that can be further analysed by reference timeframes or between events.

Flink also offers iterative process capabilities at the individual record level. This is useful in machine learning and artificial intelligence applications. One of the iterative operations available is called Delta Iteration and because it works only on the part of data that has changed it results in faster processing than other iteration operations. This is also useful in machine learning applications involving real time data stream processing and output that reflects faster response times.

Other areas in which Flink scores well are memory management and self-optimization. Flink, to a good extent, does its own garbage collection (within the JVM) without needing much developer involvement. On the optimization front, where Spark needs manual optimization strategizing and development for data partitions and caching, Flink does it’s own optimization, determining the best way to run given the type of input and runtime environment. All of these capabilities combined result in Flink displaying better performance than Spark.

Like Spark, Flink integrates with Apache Zeppelin for interactive data analytics and visualization. Flink APIs are available for Java and Scala.

So which one is better, Spark or Flink? As usual, the answer is that it depends on the use case. Apache Spark is well suited to applications where microbatch (RDD) processing is adequate and the output becomes available with a bit of a lag. Flink, of course, supports real time stream data processing. However it also comes with batch processing capability, so from that point of view it serves both purposes. This is also one of the reasons why it is being adopted rather fast, and for now, we may expect it to continue to receive a fair amount of support for continued development by open source contributors.


Several months ago I had blogged about RDD (Resilient Distributed Datasets), the core concept that enables Apache Spark. The widespread adoption of Apache Spark in Big Data projects happened because it enabled raw data to be processed much faster than MapReduce could. MapReduce programs are essentially batch processing ones, and one of the main disadvantages of using them is that they take time to process data. With Apache Spark, that particular drawback was addressed. Spark brought the benefit of being able to process streaming data, and so it was the tool of choice for a whole new set of use cases.

Apache Flink: another Leap in Big Data Processing - Image 1

On the inside, what Spark does is divide up data records into small batches that are stored in memory and then processed again as a batch. The output stream of records contains the processed results. Because this processing is also ultimately done on a batch there is a minor lag or delay in receiving output, and so although Spark handles streaming data well and represents a big leap forward from MapReduce it is still not truly all that close to being a real time processing solution.

This is where Flink comes in. Flink was originally developed in Germany in 2009, and became a part of the Apache open source project in 2014. Its logo is the squirrel in the graphic. Flink is a German word that translates to "nimble" or "agile", is the next big step forward in Big Data processing technology that is seeing very good adoption rates. This is because Flink provides the options of both, batch processing as well as true record-by-record processing of data streams. The availability of a true stream processing solution opened the doors for a whole new set of use cases across several sectors of industry and government. For example, in the financial sector stream processing is useful in analysing stock trade data and for detecting fraud. It could also be effectively applied in analysing online user behaviour in the digital world where there may be non-stop usage that can be further analysed by reference timeframes or between events.

Flink also offers iterative process capabilities at the individual record level. This is useful in machine learning and artificial intelligence applications. One of the iterative operations available is called Delta Iteration and because it works only on the part of data that has changed it results in faster processing than other iteration operations. This is also useful in machine learning applications involving real time data stream processing and output that reflects faster response times.

Other areas in which Flink scores well are memory management and self-optimization. Flink, to a good extent, does its own garbage collection (within the JVM) without needing much developer involvement. On the optimization front, where Spark needs manual optimization strategizing and development for data partitions and caching, Flink does it’s own optimization, determining the best way to run given the type of input and runtime environment. All of these capabilities combined result in Flink displaying better performance than Spark.

Like Spark, Flink integrates with Apache Zeppelin for interactive data analytics and visualization. Flink APIs are available for Java and Scala.

So which one is better, Spark or Flink? As usual, the answer is that it depends on the use case. Apache Spark is well suited to applications where microbatch (RDD) processing is adequate and the output becomes available with a bit of a lag. Flink, of course, supports real time stream data processing. However it also comes with batch processing capability, so from that point of view it serves both purposes. This is also one of the reasons why it is being adopted rather fast, and for now, we may expect it to continue to receive a fair amount of support for continued development by open source contributors.

View Comments (4)
Post a Comment

Please notify me the replies via email.

Important:
  • We hope the conversations that take place on MyTechLogy.com will be constructive and thought-provoking.
  • To ensure the quality of the discussion, our moderators may review/edit the comments for clarity and relevance.
  • Comments that are promotional, mean-spirited, or off-topic may be deleted per the moderators' judgment.
  1. 22 September 16
    1

    CCNA Training from Chennai, thanks for your comment. Yes, Flink can ingest data from HDFS, which is very common. But the Flink Streaming API can also get data from a streaming source like RabbitMQ, Twitter or Kafka, etc.

  2. 22 September 16
    1

    Wow really nice information here about the big data. Actually HDFS will be enabled for storing the large data right. After the big data announcement of Flame and Hive it was really nice update from there. Thank you for posting this. CCNA Training in Chennai

  3. 22 September 16
    1

    At present they are not compatible. But APIs for Python have now been made available for Flink, so you could integrate PHP with Flink indirectly via Python. You could also integrate using Java, depending on the nature of the interaction required.

  4. 22 September 16
    1

    Will this Flink work with PHP? Or any other similar tool for PHP

You may also be interested in
Awards & Accolades for MyTechLogy
Winner of
REDHERRING
Top 100 Asia
Finalist at SiTF Awards 2014 under the category Best Social & Community Product
Finalist at HR Vendor of the Year 2015 Awards under the category Best Learning Management System
Finalist at HR Vendor of the Year 2015 Awards under the category Best Talent Management Software
Hidden Image Url

Back to Top