Web Scraping with Scrapy: Python Tutorial

Flipnode on Jun 13 2023

blog-image

Scrapy is a powerful Python framework for web crawling and web scraping, providing developers with a comprehensive package that eliminates the need for code maintenance. It is specifically designed for handling large-scale web scraping projects and offers a complete toolkit for data extraction, processing, and storage in the desired format.

In this tutorial, we will guide you through the essential steps of web scraping using Scrapy:

  1. Installation and creation of a Scrapy project.
  2. Extracting product information.
  3. Handling pagination.
  4. Running Scrapy from within a Python script.

By following this tutorial, you will gain a solid understanding of how to effectively utilize Scrapy for web scraping purposes.

How to use Scrapy

In this section, we will guide you on setting up a Scrapy project for web scraping purposes. Creating a Scrapy project in Python involves a simple three-step process.

Step 1: Install Scrapy

To begin, open your Python command terminal and execute the following pip command:

pip install scrapy

The installation process may take a few minutes, depending on your internet connection speed.

Step 2: Create a Scrapy Project

Once Scrapy is successfully installed, you can create a new Scrapy project using the following command:

scrapy startproject <project_name>

Replace <project_name> with the desired name for your project. For instance, running the command below will create a new Scrapy project named "scrapyproject":

scrapy startproject scrapyproject

This command will generate a folder named "scrapyproject" in your current directory and store all the necessary project files inside it.

Step 3: Generate a Spider

To create your first spider, navigate to the "scrapyproject" folder using the cd scrapyproject command. Then, generate a new spider using the following command:

scrapy genspider <spider_name> <url_domain>

Replace <spider_name> with the desired name for your spider and <url_domain> with the target URL domain for web scraping. For example, executing the command below will generate a spider named "books" with the target URL "books.toscrape.com":

scrapy genspider books books.toscrape.com

By following these steps, you will have set up your Scrapy project and created a spider for your specific web scraping target.

Scrapy project structure

Each Scrapy project consists of the following files and folders:

  • Spiders folder: This folder contains the spiders, which define how to extract data from specific websites. Each spider is tailored to target a specific website or a group of websites. Typically, a spider contains rules that govern site navigation and data extraction.
  • items.py: This file defines the items, which are objects representing the data that a spider aims to extract. Items are structured using Python classes and help organize the extracted data in a meaningful format.
  • middleware.py: Here, you can find middleware information for request routing. Additionally, you can implement custom proxy middleware in this file.
  • pipelines.py: Once the spider finishes extracting the data, it needs to be processed and stored in a structured manner. Pipelines define a series of processing steps for the extracted data.
  • settings.py: This file contains various configuration settings that control the behavior of the Scrapy framework. For instance, you can set the user agent string, configure download delay, limit concurrent requests, and manage middleware settings.
  • Scrapy.cfg: It's a plain text file that includes a set of configuration directives for the project. These directives specify the project name, location of spider modules, and settings to be used when running the web spiders.

It is recommended to refer to Scrapy's documentation for further details on the basic project structure. Now, let's explore how to customize the Scrapy project to meet your specific web scraping requirements.

Customizing Scrapy spider

In this section, we will focus on customizing a Scrapy project to scrape information from the Books Store. Before delving into the code, let's examine the structure of our target page.

Take note that the book titles are found in the title attribute of the <a> tag within an <h3> element. This <h3> element is enclosed within an <article> tag that has the product_pod class. Similarly, the book prices are located in a <p> tag with the price_color class.

The current page displays only the first 20 books out of a total of 1000 books. This implies that there are a total of 50 pages. To identify the CSS selector for the link to the next page, let's inspect the bottom of the same page.

The URL for the next page can be found in the href attribute of the <a> tag. This <a> tag is enclosed within an <li> tag with the next class. Remember this information, as we will utilize it in the following section.

Scrapy spider customization in action

Open the books.py spider file in your preferred IDE. Replace the existing template script with the following code:

import scrapy

class BooksSpider(scrapy.Spider):
name = 'books'

def start_requests(self):
url = 'https://books.toscrape.com/'
yield scrapy.Request(url=url, callback=self.parse_response)

def parse_response(self, response):
for selector in response.css('article.product_pod'):
yield {
'title': selector.css('h3 > a::attr(title)').extract_first(),
'price': selector.css('.price_color::text').extract_first()
}

next_page_link = response.css('li.next a::attr(href)').extract_first()
if next_page_link:
yield response.follow(next_page_link, callback=self.parse_response)

The above script includes two generators: start_requests() and parse_response(). The start_requests() generator is automatically executed whenever a crawl command is issued to this spider. It retrieves the contents from the specified URL and calls back to the parse_response() method.

The parse_response() generator extracts the desired product information from the response object. It iterates over the product selectors and yields dictionaries containing the title and price. After yielding all 20 products in the current response, it uses the response.follow() method to retrieve the contents of the next page. The method calls back to parse_response() again to extract and yield products from the new page. This cycle continues until there is no next page link available.

To execute the Scrapy project, run the following crawl command in the command terminal:

scrapy crawl books

You can also specify a file name using the -o option to write the output to a file. For example:

scrapy crawl -o out.csv books

Running a Scrapy project from within the Python script

Executing a Scrapy project from within a Python script provides convenience. The following spider script automatically issues a crawl command and saves the output in the "books_data.csv" file.

import csv
import scrapy
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.signalmanager import dispatcher

class BooksSpider(scrapy.Spider):
name = 'books'

def start_requests(self):
URL = 'https://books.toscrape.com/'
yield scrapy.Request(url=URL, callback=self.response_parser)

def response_parser(self, response):
for selector in response.css('article.product_pod'):
yield {
'title': selector.css('h3 > a::attr(title)').extract_first(),
'price': selector.css('.price_color::text').extract_first()
}

next_page_link = response.css('li.next a::attr(href)').extract_first()
if next_page_link:
yield response.follow(next_page_link, callback=self.response_parser)

def book_spider_result():
books_results = []

def crawler_results(item):
books_results.append(item)

dispatcher.connect(crawler_results, signal=signals.item_scraped)
crawler_process = CrawlerProcess()
crawler_process.crawl(BooksSpider)
crawler_process.start()
return books_results

if __name__ == '__main__':
books_data = book_spider_result()

keys = books_data[0].keys()
with open('books_data.csv', 'w', newline='') as output_file:
writer = csv.DictWriter(output_file, keys)
writer.writeheader()
writer.writerows(books_data)

The start_requests() and response_parser() methods remain the same as in our previous code. The __main__ function serves as a starting point for direct execution. It calls the book_spider_result() function and waits for it to return a value.

The book_spider_result() function works as follows:

Sets the control dispatcher to execute the crawler_results() function when an item_scraped signal is encountered. The item_scraped signal is generated whenever the spider scrapes an item from the target.

Creates a crawler process for the BooksSpider and starts it.

Whenever the BooksSpider completes scraping an item, it emits the item_scraped signal. This triggers the execution of the crawler_results() function, which appends the scraped item to the books_results list.

Once the crawler process finishes scraping the items, the book_spider_result() function returns the books_results list.

In the __main__ function, the returned books_data is written to the "books_data.csv" file.

The book spider is now self-contained, making it straightforward to execute. You can navigate to the "spiders" folder in the Scrapy project and double-click the books.py file. Alternatively, you can open the command terminal in the "spiders" folder and run the scrapy runspider books.py command to run the spider.

Conclusion

In conclusion, Scrapy is a powerful Python framework for web crawling and web scraping. This article covered the key steps of setting up a Scrapy project, customizing it for specific scraping needs, and executing it from the command terminal or within a Python script. With its intuitive structure, CSS selectors, and automated navigation capabilities, Scrapy simplifies the process of extracting and processing data from websites. Whether for small or large-scale projects, Scrapy is a valuable tool for developers seeking efficient web scraping solutions.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

Subscribe

Related articles

thumbnail
How to Use DataXPath vs CSS Selectors

Read this article to learn what XPath and CSS selectors are and how to create them. Find out the differences between XPath vs CSS, and know which option to choose.

Flipnode
author avatar
Flipnode
12 min read
thumbnail
ScrapersScraping Amazon Product Data: A Complete Guide

Master the art of building an Amazon scraper from scratch with this practical, step-by-step tutorial.

Flipnode
author avatar
Flipnode
11 min read
thumbnail
ScrapersPlaywright Scraping Tutorial for 2023

Uncover the full potential of Playwright for automation and web scraping in this comprehensive article.

Flipnode
author avatar
Flipnode
12 min read