Data Transformations with Apache Pig

Pluralsight

Course Summary

Pig is an open source engine for executing parallelized data transformations which run on Hadoop. This course shows you how Pig can help you work on incomplete data with an inconsistent schema, or perhaps no schema at all.

+
Course Description

Pig is an open source software which is part of the Hadoop eco-system of technologies. Pig is great at working with data which are beyond traditional data warehouses. It can deal well with missing, incomplete, and inconsistent data having no schema. In this course, Data Transformations with Apache Pig, you'll learn about data transformations with Apache. First, you'll start with the very basics which will show you how to get Pig installed and get started working with the Grunt shell. Next, you'll discover how to load data into relations in Pig and store transformed results to files via load and store commands. Then, you'll work on a real world dataset where you analyze accidents in NYC using collision data from the City of New York. Finally, you'll explore advanced constructs such as the nested foreach and also gives you a brief glimpse into the world of MapReduce and shows you how easy it is to implement this construct in Pig. By the end of this course, you'll have a better understanding of data transformations with Apache Pig.

Course Description

Pig is an open source software which is part of the Hadoop eco-system of technologies. Pig is great at working with data which are beyond traditional data warehouses. It can deal well with missing, incomplete, and inconsistent data having no schema. In this course, Data Transformations with Apache Pig, you'll learn about data transformations with Apache. First, you'll start with the very basics which will show you how to get Pig installed and get started working with the Grunt shell. Next, you'll discover how to load data into relations in Pig and store transformed results to files via load and store commands. Then, you'll work on a real world dataset where you analyze accidents in NYC using collision data from the City of New York. Finally, you'll explore advanced constructs such as the nested foreach and also gives you a brief glimpse into the world of MapReduce and shows you how easy it is to implement this construct in Pig. By the end of this course, you'll have a better understanding of data transformations with Apache Pig.

+
Course Syllabus

Course Overview
- 2m 5s

â€”Course Overview 2m 5s

Introducing Pig
- 20m 29s

â€”What You Need to Get Started 2m 29s
â€”Why Do We Need Data? 3m 1s
â€”Hive for Analytical Processing 2m 4s
â€”When Do We Use Apache Pig? 1m 50s
â€”Pig for Extract, Transform, and Load Operations 3m 47s
â€”Introducing Pig Latin 3m 19s
â€”Pig on Hadoop and Other Technologies 3m 56s

Using the GRUNT Shell
- 18m 22s

â€”Install and Set up Pig on Your Local Machine 4m 59s
â€”Pig Modes of Operation 3m 50s
â€”Basic Commands and Configuring Log Messages 4m 1s
â€”Running Pig Scripts in Batch Mode 2m 14s
â€”Behind the Scenes of Pig Commands 3m 17s

Loading Data into Relations
- 45m 27s

â€”The Structure of a Pig Script and the Concept of Relations 5m 15s
â€”Loading Data from Files and Directories 4m 3s
â€”Loading Data with Schema 3m 10s
â€”Storing Relations in Directories 3m 27s
â€”Case-sensitivity in Pig 1m 26s
â€”Scalar Data Types 3m 19s
â€”Complex Data Types: The Tuple 8m 54s
â€”Complex Data Types: The Bag 5m 3s
â€”Complex Data Types: The Map 5m 32s
â€”Working with Partial Schema Specification 5m 14s

Working with Basic Data Transformations
- 36m 26s

â€”Foreach-generate: Visualization 1m 55s
â€”Foreach-generate: Indexes and Column Names 3m 41s
â€”Foreach-generate: Complex Data Types 5m 37s
â€”Categories of Pig Functions 4m 21s
â€”Math, String, and Date-time Functions 5m 56s
â€”The Filter Operation 6m 12s
â€”Distinct, Limit, and Order By 4m 9s
â€”The Split Operation 4m 31s

Working with Advanced Data Transformations
- 48m 16s

â€”Download NYC Collision Data 7m 14s
â€”Visualize the Group by Operation 2m 37s
â€”The Group by Operation 4m 43s
â€”Aggregations on Grouped Data 4m 53s
â€”Join Operations on Relations 4m 58s
â€”Types of Joins 5m 12s
â€”Implement the Left Outer, Self, and Cross Joins 4m 6s
â€”The Union Operation 2m 48s
â€”The Union Onschema Operation 6m 44s
â€”The Flatten Function 4m 56s

Executing MapReduce Using Pig
- 24m 25s

â€”The Nested Foreach Operation 3m 10s
â€”Analyze NYC Collision Data Using the Nested Foreach 9m 48s
â€”An Overview of the MapReduce Programming Model 2m 51s
â€”Dataflow Through a MapReduce Operation 3m 47s
â€”MapReduce Operations in Pig Latin 4m 47s

Course Fee:

USD 29

Course Type:	Self-Study
Course Status:	Active
Workload:	1 - 4 hours / week

This course is listed under Open Source , Development & Implementations , Industry Specific Applications , Data & Information Management , Networks & IT Infrastructure and Operating Systems Community