Every year we have a big increment of data that we need to store and analyze. To be able to process such amounts of data, we need to use a technology that can distribute multiple computations and make them more efficient. Apache Spark is a technology that allows us to process big data leading to faster and scalable process. If you're looking for a complete, comprehensive source on Apache Spark, then go for this Learning Path.
Packt’s Video Learning Paths are a series of individual video products put together in a logical and stepwise manner such that each video builds on the skills learned in the video before it.
The highlights of this Learning Path are:
- Explore the Apache Spark architecture and delve into its API and key features
- Write code that is maintainable and easy to test
- Get to know the Apache Spark Streaming API and create jobs that analyze data in near real time.
Let’s take a quick look at your journey. This Learning Path introduce you to the various components of the Spark framework to efficiently process, analyze, and visualize data. You will learn about the Apache Spark programming fundamentals such as RDD and see which operations can be used to perform a transformation or action operation on the RDD. You will then learn how to load and save data from various data sources as different type of files, No-SQL and RDBMS databases. Moving ahead, you will explore the advanced programming concepts such as managing key-value pairs and accumulators. You'll also discover how to create an effective Spark application and execute it on Hadoop cluster to the data and gain insights to make informed business decisions.
Moving ahead, you'll learn about data mining and data cleaning, wherein we will look at the input data structure and how input data is loaded. You'll be then writing actual jobs that analyze data. You'll learn how to handle big amount of unbounded infinite streams of data. Furthermore, you'll look at common problems when processing event streams: sorting, watermarks, deduplication, and keeping state (for example, user sessions). Finally you'll implement streaming processing using Spark Streaming and analyze traffic on a web page in real time.
After completing this Learning Path, you will have a sound understanding of the Spark framework, which will help you in analyzing and processing big data.
About the Author:
We have combined the best works of the following esteemed authors to ensure that your learning journey is smooth:
Nishant Garg has over 16 years of software architecture and development experience in various technologies, such as Java Enterprise Edition, SOA, Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, YARN, Impala, Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and MongoDB), and MPP databases (such as GreenPlum).
He received his MS in software systems from the Birla Institute of Technology and Science, Pilani, India, and is currently working as a senior technical architect for the Big Data R&D Labs with Impetus Infotech Pvt. Ltd. Nishant has also undertaken many speaking engagements on big data technologies and is also the author of Learning Apache Kafka & HBase Essestials, Packt Publishing.
Tomasz Lelek is a software engineer, programming mostly in Java, Scala. He is a fan of microservices architecture, and functional programming. He has dedicated considerable time and effort to be better every day. He recently dived into big data technologies such as Apache Spark and Hadoop. Recently, he was a speaker at conferences in Poland - Confitura and JDD (Java Developers Day) and also at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference.