How to Make Web Scraping Faster – Python Tutorial

Flipnode on May 30 2023

In today's fiercely competitive business landscape, effective management necessitates swift and efficient gathering of public data. The ability to access vast quantities of information, potentially spanning thousands or even millions of pages, is crucial. Speed is of the essence, as it allows you to optimize your time and promptly take action based on the acquired data. However, certain obstacles may hinder the rapid retrieval of public information. So, how can you ensure fast and efficient web scraping?

We invite you to read this article, where we explore several valuable techniques to expedite the collection of public data. Additionally, we provide sample code snippets that you can readily implement in your web scraping endeavors.

What slows down web scraping

The network delay is a prominent bottleneck in web scraping projects. It takes time to transmit a request to the web server, and once the request is received, there is another delay as the web server sends the response.

While browsing a website, this delay is inconsequential as we typically deal with one page at a time. For example, if it takes a second to send a request and receive the response, it feels fast when browsing a small number of pages. However, when running a web scraping code that needs to send requests to ten thousand pages, these delays accumulate, resulting in almost three hours of waiting time, which significantly diminishes the perceived speed.

The network delay is just one of the factors that can slow down the web scraping process. In addition to sending and receiving requests, your web scraping code also interacts with the data, which can encounter I/O or CPU-bound bottlenecks.

I/O bound

An I/O bottleneck refers to a problem associated with the input-output performance of a system and its peripheral devices like disk drives and internet interfaces. Any program that relies on the input-output system, such as reading and writing data, copying files, or downloading files, is considered I/O bound. The delays experienced in such programs are referred to as I/O bound delays.

CPU bound

Another scenario occurs when a program is CPU-bound. As the name implies, in this case, the speed at which the code is executed depends on the CPU, which is the central processing unit of a computing device. A faster CPU translates to faster code execution.

A classic example of a CPU-bound application is a task that involves a large number of calculations. For instance, High-Performance Computing (HPC) systems leverage the processing power of multiple processors within the CPU to achieve enhanced computing performance.

Understanding the distinction between I/O and CPU is crucial since the approach to improving program performance largely depends on the type of bottleneck encountered

How do you speed up web scraping in Python?

There are several approaches that can be employed to enhance the scraping speed:

Multiprocessing
Multithreading
Asyncio

However, let's begin by examining an unoptimized code to ensure a clear understanding of the distinctions. If you prefer, you can also watch our tutorial on this topic available on our YouTube channel.

Web scraping without optimization

We will scrape 1000 books from books.toscrape.com, a dummy book store that is ideal for learning purposes.

Preparation

The first step is to extract all 1000 book links and store them in a CSV file. Please run the provided code to create the links.csv file. Make sure you have installed the requests and Beautiful Soup packages for the code to function correctly.

import requests

from bs4 import BeautifulSoup

from urllib.parse import urljoin

def fetch_links(url="https://books.toscrape.com/", links=[]):

r = requests.get(url)

print(r.url, flush=True)

soup = BeautifulSoup(r.text, "html.parser")

for link in soup.select("h3 a"):

links.append(urljoin(url, link.get("href")))

next_page = soup.select_one("li.next a")

if next_page:

return fetch_links(urljoin(url, next_page.get("href"), links))

else:

return links

def refresh_links():

links = fetch_links()

with open('links.csv', 'w') as f:

for link in links:

f.write(link + '\n')

refresh_links()

The fetch_links function retrieves all the links, and refresh_links() stores the output in a file. We omitted sending the user agent as this is a test site, but you can easily include it using the requests library.

Writing an unoptimized web scraper

Our focus is on optimizing the web scraping of 1000 pages using Python.

First, install the requests library using pip:

pip install requests

To keep things simple, we will use regular expressions to extract the title element of the page. Note the get_links function that loads the URLs we saved in the previous step.

import csv

import re

import time

import requests

def get_links():

links = []

with open("links.csv", "r") as f:

reader = csv.reader(f)

for i, row in enumerate(reader):

links.append(row[0])

return links

def get_response(session, url):

with session.get(url) as resp:

print('.', end='', flush=True)

text = resp.text

exp = r'(<title>).*(</title>)'

return re.search(exp, text, flags=re.DOTALL).group(0)

def main():

start_time = time.time()

with requests.Session() as session:

results = []

for url in get_links():

result = get_response(session, url)

print(result)

print(f"{(time.time() - start_time):.2f} seconds")

main()

The unoptimized code took approximately 126 seconds to run.

Web scraping using multiprocessing

Multiprocessing involves utilizing multiple processor cores to improve performance. With the multiprocessing module in the Python standard library, we can write code that takes advantage of all available cores.

To begin, import Pool and cpu_count from the multiprocessing module:

from multiprocessing import Pool

There are a couple of changes required in the get_response and main functions:

def get_response(url):

resp = requests.get(url)

print('.', end='', flush=True)

text = resp.text

exp = r'(<title>).*(</title>)'

return re.search(exp, text, flags=re.DOTALL).group(0)

def main():

start_time = time.time()

links = get_links()

coresNr = multiprocessing.cpu_count()

with Pool(coresNr) as p:

results = p.map(get_response, links)

for result in results:

print(result)

print(f"{(time.time() - start_time):.2f} seconds")

if name == 'main':

main()

The crucial line of code is where we create a Pool. We use the cpu_count() function to dynamically determine the count of CPU cores. This ensures that the code runs on any machine without modification.

In this example, the execution time was approximately 49 seconds. It's a notable improvement compared to the unoptimized code, which took around 126 seconds. However, since our code is I/O bound, the improvement is not as significant as expected. Multiprocessing is more suitable for CPU-bound code. To further enhance the performance of our code, we can explore other methods.

Web scraping using multithreading

Multithreading is an excellent choice for optimizing web scraping code. In simple terms, a thread represents a separate flow of execution. Operating systems create numerous threads and switch CPU time among them. The switching process occurs rapidly, creating the illusion of multitasking. However, customization is limited since the CPU controls the switching mechanism.

To optimize our code using multithreading, we can utilize the concurrent.futures module in Python. However, it's important to note that managing threads can become challenging and error-prone as the code complexity increases.

To incorporate multithreading into our code, we only need a few modifications.

First, import ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor

Next, instead of creating a Pool, create a ThreadPoolExecutor:

with ThreadPoolExecutor(max_workers=100) as p:

results = p.map(get_response, links)

Note that you need to specify the maximum number of workers (threads). The appropriate number depends on the code's complexity. However, selecting an excessively high number can overload your code, so caution is required.

In this example, the script execution was completed in 7.02 seconds. For comparison, the unoptimized code took around 126 seconds. This represents a significant improvement in performance.

Asyncio for asynchronous programming

Asynchronous coding using the asyncio module allows for context switching controlled by the code itself, making coding easier and less prone to errors. It is particularly suitable for web scraping projects.

To utilize this approach, several changes are necessary. Firstly, the requests library is replaced with the aiohttp library for web scraping in Python. You need to install it separately:

python3 -m pip install aiohttp

Next, import the asyncio and aiohttp modules:

import aiohttp

import asyncio

The get_response() function needs to be transformed into a coroutine, and we'll use the same session for each execution. Optionally, you can include the user agent if required. Note the use of the async and await keywords:

async def get_response(session, url):

async with session.get(url) as resp:

text = await resp.text()

exp = r'(<title>).*(</title>)'

return re.search(exp, text, flags=re.DOTALL).group(0)

The main() function undergoes significant changes as well. It needs to be defined as a coroutine. We use aiohttp.ClientSession to create the session object. Tasks are created for each link, and all tasks are sent to an event loop using the asyncio.gather method:

async def main():

start_time = time.time()

scss

Copy code

async with aiohttp.ClientSession() as session:

	tasks = []

	for url in get_links():

		tasks.append(asyncio.create_task(get_response(session, url)))

	results = await asyncio.gather(*tasks)

	for result in results:

		print(result)

print(f"{(time.time() - start_time):.2f} seconds")

To run the main() coroutine, use asyncio.run(main()).

In this example, the execution time was 15.61 seconds. The asyncio approach, as expected, demonstrated significant improvements compared to the unoptimized script. It does, however, require a different mindset. If you have experience with async-await in any programming language, adapting to this approach for web scraping should not be too challenging.

Conclusion

When businesses attempt to gather substantial amounts of public data, they often face the challenge of slow web scraping. This can result in spending excessive time collecting the required public information, ultimately missing out on the opportunity to analyze it and make informed decisions ahead of market competitors.

The purpose of this article was to explore the factors that contribute to decreased scraping speed and offer several effective methods to address this issue. We discussed web scraping approaches such as multiprocessing, multithreading, and asyncio, comparing their execution times to help you determine the most suitable approach for your specific use case. By implementing the appropriate strategy, you can enhance the efficiency of your web scraping endeavors and gain a competitive edge in the market.