on 09 October 18
The release of the Hadoop framework a few years ago made it practical to work with Big Data. It enabled data scientists to work with very large volumes of data from multiple sources to produce new business insights. The Hadoop framework relies on the MapReduce programming model to take large sets of data and process them across distributed nodes in a fault-tolerant parallel manner to a reduced, structured form which is easier to further query and work with.
While Hadoop solves the problem of working with Big Data, the MapReduce model works in a batch processing mode, and also has to read from and write to disk between each map and reduce step, and so producing output could take several hours, depending on the nature of reduction computation needed. With the rise of analytics usage and the increased need for machine learning algorithms that work iteratively on datasets in a much shorter timeframe, Apache Spark emerged, providing a new alternative to Hadoop and MapReduce that allows the processing to be done much faster, even up to a 100 times faster, by using a different model that does the computation in memory without needing as much disk I/O.
The Spark framework does this by using a programming abstraction called RDD or Resilient Distributed Datasets. The Spark and RDD model was first created in 2012 by a team at University of California, Berkeley, and then given to the Apache Foundation the following year for further maintenance and enhancement. The RDD model is the part of the framework that overcomes the problem of processing time inherent in Hadoop and MapReduce while offering distributed processing across nodes and offering fault tolerance. RDD works by extracting data from storage, and then partitioning it across memory in multiple parallel computing nodes as objects using a logical partitioning basis. In memory, coarse grain operations (such as group by, sum, etc.) are performed on the partitions to transform the dataset into reduced sets of relevant data in steps until finally a result is achieved on which an action can be performed. An action is the required computation.
To illustrate this with an example, let’s assume there is a retail chain that wants to engage with all customers that visited two particular stores and then tweeted about them. To do this it needs the Twitter handles or usernames of each of those customers. Given the Twitter feed of live and past tweets as a data source, one way of getting this list of handles would be to process a dataset of the feed as follows.
- Read feed records
- Filter the feed to select only those records containing the retail chain name
- Filter the reduced selection of records to select those containing the name of one of the desired locations.
- Further filter the result to select those records that also contain the name of the second location.
- From the final reduced dataset, retrieve all the user handles and write them out in a serialized list that is handed over to a utility that tweets a thank you to each user.
In the above set of steps, the first several steps perform transformations to identify and isolate the relevant tweets, while the last step actually produces the list of usernames and engages with them. The last step is the action, and the preceding ones are the transformations, each of which produces a new RDD. The sequence of steps forms what is called a lineage. In the event of a failure, the availability of the lineage supports the re-computation of the required transformations at the node(s) and partition(s) at which the failure occurred.
Spark, thanks to the RDD model, therefore provides a good alternative to Hadoop and MapReduce in use cases requiring bulk, synchronous batch processing. It is rapidly making headway as a model of choice, and is still evolving, thanks to a large number of developers that actively contribute to it. It’s last release was announced as recently as November, 2015.