HDPCD:Spark using Python (pyspark)

HDPCD:Spark using Python (pyspark)
HDPCD:Spark using Python (pyspark)
English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 11.5 Hours | 1.86 GB

Prepare for Hortonworks HDP Certified Developer - Spark using Python as programming language

Course cover the overall syllabus of HDPCD:Spark Certification.

  • Python Fundamentals - Basic Python programming required using REPL
  • Getting Started with Spark - Different setup options, setup process
  • Core Spark - Transformations and Actions to process the data
  • Data Frames and Spark SQL - Leverage SQL skills on top of Data Frames created from Hive tables or RDD
  • One month complementary lab access
  • Exercises - A set of self evaluated exercises to test skills for certification purpose
  • After the course one will gain enough confidence to give the certification and crack it.

What Will I Learn?

  • Basics of Python, Spark and required skills to give HDPCD:Spark certification using Python/pyspark with confidence
Table of Contents

1 Introduction
2 Using itversity platforms - Big Data Developer labs and forum

Python Fundamentals
3 Introduction and Setup Python Environment
4 Basic Programming Constructs
5 Functions in Python
6 Python Collections
7 Map Reduce operations on Python Collections
8 Setting up Data Sets for Basic IO Operations
9 Basic IO operations and processing data using Collections

Spark Getting Started
10 Setup Options
11 Setup using tarball
12 Setup using Hortonworks Sandbox
13 Using labs.itversity.com
14 Using Windows - Putty and WinScp
15 Using Windows - Cygwin
16 HDFS - Quick Preview
17 YARN - Quick Preview
18 Setup Data Sets
19 Curriculum

Core Spark - Transformations and Actions
20 Introduction
21 Problem Statement and Environment
22 Initializing the job using pyspark
23 Resilient Distributed Datasets - Create
24 Resilient Distributed Datasets - Persist and Cache
25 Previewing the data using actions - first taken count collect
26 Filtering the Data - Get completedclosed orders
27 Accumulators - Get completedclosed orders with count
28 Converting to key value pairs - using map
29 Joining data sets - join and outer join with examples
30 Get Daily revenue per product id - using reduceByKey
31 Get Daily revenue and count per product id - using aggregateByKey
32 Execution Life Cycle
33 Broadcast Variables
34 Sorting the data
35 Saving the data to the file system
36 Final Solution - Get Daily Revenue per Product

Spark SQL using pyspark
37 Introduction to Spark SQL and Objectives
38 Different interfaces to run SQL - Hive Spark SQL
39 Create database and tables of text file format - orders and order_items
40 Create database and tables of ORC file format - orders and order_items
41 Running SQLHive Commands using pyspark
42 Functions - Getting Started
43 Functions - String Manipulation
44 Functions - Date Manipulation
45 Functions - Aggregate Functions in brief
46 Functions - case and nvl
47 Row level transformations
48 Joining data between multiple tables
49 Group by and aggregations
50 Sorting the data
51 Set operations - union and union all
52 Analytics functions - aggregations
53 Analytics functions - ranking
54 Windowing functions
55 Creating Data Frames and register as temp tables
56 Write Spark Application - Processing Data using Spark SQL
57 Write Spark Application - Saving Data Frame to Hive tables
58 Data Frame Operations

Exercises or Problem Statements with Solutions
59 Introduction about exercises
60 General Guidelines about Exercises or Problem Statements
61 General Guidelines - Initializing the Job