Modern Web Scraping with Python using Scrapy and Splash

Modern Web Scraping with Python using Scrapy and Splash
Modern Web Scraping with Python using Scrapy and Splash
English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 6 Hours | 2.49 GB

Become an expert in web scraping and web crawling using Python 3, Scrapy and Scrapy Splash

Web Scraping nowdays has become one of the hottest topics, there are plenty of paid tools out there in the market that don’t show you anything how things are done as you will be always limited to their functionalities as a consumer.

In this course you won’t be a consumer anymore, i’ll teach you how you can build your own scraping tool ( spider ) using Scrapy.

You will learn:

  • The fundamentals of Web Scraping
  • How to build a complete spider
  • The fundamentals of XPath
  • How to locate content/nodes from the DOM using XPath
  • How to store the data in JSON, CSV… and even to an external database(MongoDb)
  • How to write your own custom Pipeline
  • Fundamentals of Splash
  • How to scrape Javascript websites using Scrapy Splash
  • The Crawling behavior
  • How to build a CrawlSpider
  • How to avoid getting banned while scraping websites
  • How to build a custom Middleware
  • Web Scraping best practices
  • How to scrape APIs
  • How to use Request Cookies
  • How to scrape infinite scroll websites
  • Host spiders in Heroku for free
  • Run spiders periodically with a custom script
  • Prevent storing duplicated data
  • Deploy Splash to Heroku
  • Write data to Excel files
  • Login to websites using FormRequest
  • Download Files & Images using Scrapy
  • Use Proxies with Scrapy Spider
  • Use Crawlera with Scrapy & Splash
  • Use Proxies with CrawlSpider
Table of Contents

Introduction – UPDATED –
1 Intro to Web Scraping & Scrapy
2 Setting up the Development Environment – Linux Users
3 Setting up the Development Environment – Windows Users
4 Hello World Scrapy
5 Frequently Asked Questions (Common errors)
6 Where to find all the code !

XPath Selectors
7 XPath Terminology
8 XPath Syntax
9 XPath Axes
10 XPath Predicates

Build a Complete Spider from A to Z
11 Locating, Quotes, Authors and Tags
12 Update: Author is not loading ?
13 Scrapy XPath Selectors
14 Pagination
15 Feed Exporters
16 Items and Item Loader
17 Input and Output Processors
18 Output isn’t showing correctly
19 FInal Touches

Writing a Custom Pipeline – Store the Data in MongoDb
20 MongoDb Terminology
21 Setting Up MongoDb on Linux
22 Setting Up MongoDb on Windows
23 Writing the MongoDb Pipeline (UPDATED)

Scraping Javascript Websites using Splash
24 Why using Splash
25 Setting up Splash on Linux
26 Writing Lua Scripts
27 Splash Request
28 Dealing with Pagination

The Crawl Spider
29 The Crawling Behaviour
30 The Crawl Spider Simplified
31 Setting up the Rules
32 Challenge Solution(Building the Parse Method)

Avoid Getting Banned
33 Technics Used by Websites Administrators to Prevent Web Scraping
34 Web Crawling/Scraping Best Practices
35 Custom Middleware (User Agent Rotator Middleware)

Scraping APIs(REST API) – Infinite Scroll Pagination
36 Introduction
37 REST API
38 Working With JSON Objects
39 The Airbnb JSON Object
40 Hidden XHR
41 Airbnb Spider
42 IMPORTANT NOTE
43 Infinite Scroll Pagination
44 Spider Arguments
45 Airbnb code UPDATE (Request Cookies) **NEW
46 Another way to scrape Airbnb restaurant detail page

Hosting spiders for free – Exclusive –
47 Deploy spiders to ScrapingHub cloud
48 Deploy spiders locally
49 Deploy spiders to Heroku
50 The MLab add-on
51 Execute spiders periodically
52 Prevent storing duplicated data
53 Deploy Splash to Heroku
54 Project source code

Writing data to Excel files
55 Introduction to XlsxWriter
56 Setting the Item class
57 Writing data to Excel files(Using a custom Pipeline)
58 Project source code
59 Challenge for those who are adventurous

Scrapy POST requests
60 Login to websites using FormRequest
61 XML Http Post Requests
62 Project source code
63 Code UPDATE XHR repeated data (Assignment)

The Media Pipeline
64 Media Pipelines
65 The Images Pipeline
66 Extending The Images Pipeline (Store images with custom names)
67 Files Pipeline (Article)
68 Challenge (Files Pipeline)
69 Project source code

Paid and Free proxies with Scrapy/Splash
70 Using Crawlera with Scrapy
71 Using Crawlera with Splash
72 Using Heroku as a Proxy (FREE)
73 Using FREE Proxies with the CrawlSpider
74 Challenge
75 Project source code

BONUS
76 Files Pipeline
77 Crawlera GIFT


Download from Rapidgator

Download from Turbobit