English | MP4 | AVC 1920×1080 | AAC 48KHz 2ch | 2h 29m | 361 MB
Building and deploying data-intensive applications at scale using Python and Apache Spark
Apache Spark is an open-source distributed engine for querying and processing data. In this tutorial, we provide a brief overview of Spark and its stack. This tutorial presents effective, time-saving techniques on how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark.
You'll learn about different techniques for collecting data, and distinguish between (and understand) techniques for processing data. Next, we provide an in-depth review of RDDs and contrast them with DataFrames. We provide examples of how to read data from files and from HDFS and how to specify schemas using reflection or programmatically (in the case of DataFrames). The concept of lazy execution is described and we outline various transformations and actions specific to RDDs and DataFrames.
Finally, we show you how to use SQL to interact with DataFrames. By the end of this tutorial, you will have learned how to process data using Spark DataFrames and mastered data collection techniques by distributed data processing.
Filled with hands-on examples, this course will help you understand RDDs and how to work with them; you will learn about RDD actions and Spark DataFrame transformations. You will learn how to perform big data processing and use Spark DataFrames.
What You Will Learn
- Learn about Apache Spark and the Spark 2.0 architecture.
- Understand schemas for RDD, lazy executions, and transformations.
- Explore the sorting and saving elements of RDD.
- Build and interact with Spark DataFrames using Spark SQL
- Create and explore various APIs to work with Spark DataFrames.
- Learn how to change the schema of a DataFrame programmatically.
- Explore how to aggregate, transform, and sort data with DataFrames.
A Brief Primer on PySpark
1 The Course Overview
2 Brief Introduction to Spark
3 Apache Spark Stack
4 Spark Execution Process
5 Newest Capabilities of PySpark 2.0+
6 Cloning GitHub Repository
Resilient Distributed Datasets#
7 Brief Introduction to RDDs
8 Creating RDDs
9 Schema of an RDD
10 Understanding Lazy Execution
11 Introducing Transformations – .map(…)
12 Introducing Transformations – .filter(…)
13 Introducing Transformations – .flatMap(…)
14 Introducing Transformations – .distinct(…)
15 Introducing Transformations – .sample(…)
16 Introducing Transformations – .join(…)
17 Introducing Transformations – .repartition(…)
Resilient Distributed Datasets and Actions
18 Introducing Actions – .take(…)
19 Introducing Actions – .collect(…)
20 Introducing Actions – .reduce(…) and .reduceByKey(…)
21 Introducing Actions – .count()
22 Introducing Actions – .foreach(…)
23 Introducing Actions – .aggregate(…) and .aggregateByKey(…)
24 Introducing Actions – .coalesce(…)
25 Introducing Actions – .combineByKey(…)
26 Introducing Actions – .histogram(…)
27 Introducing Actions – .sortBy(…)
28 Introducing Actions – Saving Data
29 Introducing Actions – Descriptive Statistics
DataFrames and Transformations
31 Creating DataFrames
32 Specifying Schema of a DataFrame
33 Interacting with DataFrames
34 The .agg(…) Transformation
35 The .sql(…) Transformation
36 Creating Temporary Tables
37 Joining Two DataFrames
38 Performing Statistical Transformations
39 The .distinct(…) Transformation
Data Processing with Spark DataFrames
40 Schema Changes
41 Filtering Data
42 Aggregating Data
43 Selecting Data
44 Transforming Data
45 Presenting Data
46 Sorting DataFrames
47 Saving DataFrames
48 Pitfalls of UDFs
49 Repartitioning Data