on 07 October 17
We could call Apache Spark as a large-scale processing engine which was recently invested in active development. It was introduced as a more viable option against MapReduce enhancing the developer’s quotient and performance of an application through Spark analytics. If Spark analytics are combined with its fellow related projects like Spark Streaming, Spark MLlib, Spark SQL, and others, the result emerged is a very impressive one. Additionally, the streaming ventures include the ability to constantly compute data transformation.
This article shed some light on the working of a Spark application in real life. Our focus would be majorly on a simple Spark word count application in three different Spark supportive languages, namely Scala, Java, and Python. However, the Scala and java codes where first written for Cloudera tutorials. Our story would be divided into three phrases of Spark application development. Those would start from Writing the Application, then Compile and Packaging the Scala and Java Applications, all the ways to Running the Application.
Writing the Application-
While writing the application our prime focus would be to learn the distribution of letters in the most updated words in a corpus. The following are steps for how to write an application-
• The first thing would be to create a
Sparkconference and Sparkcontext. Our Spark application would correspond towards cases of Saprk context. While you are running a shell the Spark context is formulated for you.
• Get a threshold for word frequency meanwhile read an input set of text documents.
• Monitor the number of times each
• Filter every word which appears lesser than the threshold.
• For the left-out words, tally the number of times each letter pops up.
When we talk about a MapReduce, it would require you 2 sperate MapReduce applications while preparing the midway data for HDSF. However, in Spark the same application requires 90 percent lesser code lines then the one catered the other way.
Compiling and Packaging the Scala Java Application-
Now to keep it simple use Maven to compile and package both Java and Scala programs. Maven is used to building Spark application for a better hand at practicing. On the other side, to compile and package Scala version, you need to include the Scala tools pug in.
For running the application, the input to your application would be a larger text file, with each line comprised of all word in a document without any punctuations. Now, put the input in the HDFS directory.
Now you are free to run one application by the means of Spark-submit-
For Scala- your run would require a local process with threshold 2. Java works on parallel elements and Python requires YARN also with threshold 2.
You may also be interested in
Share your perspective
Share your achievement or new finding or bring a new tech idea to life. Your IT community is waiting!