Tech

A novice’s information to internet scraping with Python and Scrapy

Since their inception, websites are used to share information. Whether it is a Wikipedia article, YouTube channel, Instagram account, or a Twitter handle. They all are packed with interesting data that is available for everyone with access to the internet and a web browser.

But, what if we want to get any specific data programmatically?

There are two ways to do that:

  1. Using official API
  2. Web Scraping

The concept of API (Application Programming Interface) was introduced to exchange data between different systems in a standard way. But, most of the time, website owners don’t provide any API. In that case, we are only left with the possibility to extract the data using web scraping.

Basically, every web page is returned from the server in an HTML format, meaning that our actual data is nicely packed inside HTML elements. It makes the whole process of retrieving specific data very easy and straightforward.

This tutorial will be an ultimate guide for you to learn web scraping using Python programming language. At first, I’ll walk you through some basic examples to make you familiar with web scraping. Later on, we’ll use that knowledge to extract data of football matches from Livescore.cz .

[Read: Neural’s market outlook for artificial intelligence in 2021 and beyond]

Getting Started

To get us started, you will need to start a new Python3 project with and install Scrapy (a web scraping and web crawling library for Python). I’m using pipenv for this tutorial, but you can use pip and venv, or conda.

pipenv install scrapy

At this point, you have Scrapy, but you still need to create a new web scraping project, and for that scrapy provides us with a command line that does the work for us.

Let’s now create a new project named web_scraper by using the scrapy cli.

If you are using pipenv like me, use:

pipenv run scrapy startproject web_scraper .

Otherwise, from your virtual environment, use:

scrapy startproject web_scraper .

This will create a basic project in the current directory with the following structure:

Building our first Spider with XPath queries

We will start our web scraping tutorial with a very simple example. At first, we’ll locate the logo of the Live Code Stream website inside HTML. And as we know, it is just a text and not an image, so we’ll simply extract this text.

The code

To get started we need to create a new spider for this project. We can do that by either creating a new file or using the CLI.

Since we know already the code we need we will create a new Python file on this path /web_scraper/spiders/live_code_stream.py

Here are the contents of this file.

Code explanation:

  • First of all, we imported the Scrapy library because we need its functionality to create a Python web spider. This spider will then be used to crawl the specified website and extract useful information from it.
  • We created a class and named it LiveCodeStreamSpider. Basically, it inherits from scrapy.Spider and that’s why we passed it as a parameter.
  • Now, an important step is to define a unique name for your spider using a variable called name. Remember that you are not allowed to use the name of an existing spider. Similarly, you can not use this name to create new spiders. It must be unique throughout this project.
  • After that, we passed the website URL using the start_urls list.
  • Finally, create a method called parse() that will locate the logo inside HTML code and extract its text. In Scrapy, there are two methods to find HTML elements inside source code. These are mentioned below.
  • CSS
  • XPath

You can even use some external libraries like BeautifulSoup and lxml . But, for this example, we’ve used XPath.
A quick way to determine the XPath of any HTML element is to open it inside the Chrome DevTools. Now, simply right-click on the HTML code of that element, hover the mouse cursor over “Copy” inside the popup menu that just appeared. Finally, click the “Copy XPath” menu item.

Have a look at the below screenshot to understand it better.