Web Scraping With RegEx

Flipnode on Jun 21 2023

blog-image

The demand for digital content has witnessed an exponential surge, leading to intensified competition among websites. As a consequence, websites are constantly evolving and modifying their structures.

While these quick updates benefit general consumers, they pose significant challenges for businesses involved in public data collection. Web scraping, which relies on customized routines tailored to specific website conditions, often faces disruptions due to frequent updates. This is where RegEx (Regular Expressions) comes to the rescue by simplifying complex elements of data acquisition and parsing processes, offering a solution to these challenges.

What is RegEx?

Regular Expressions, commonly known as RegEx, is a powerful method used to match specific patterns based on provided combinations. It serves as a useful tool for filtering and extracting desired output by defining and identifying patterns in text data.

How to use RegEx for web scraping?

Regular Expressions (RegEx) offer the capability to validate various types of character combinations, even including special characters such as line breaks. One notable advantage of RegEx is its consistent and efficient performance, as it compares data/input to a single regular expression regardless of its size.

Another advantage of Regular Expressions is their universality, as they can be implemented in any programming language, providing flexibility and compatibility across different platforms and systems.

Overview of RegEx tokens

Token      | Matches

-----------|-----------------------
^ | Start of a string
$ | End of a string
. | Any character (except \n)
| | Characters on either side of the symbol
\ | Escapes special characters
Char | The character given
* | Any number of previous characters
? | 1 previous character
+ | 1 or more previous characters
{Digit} | Exact number
{Digit-Digit} | Between range
\d | Any digit
\s | Any whitespace character
\w | Any word character
\b | Word boundary character
\D | Inverse of \d
\S | Inverse of \s
\W | Inverse of \w

Collecting data using RegEx

In this tutorial, we will be using RegEx to scrape book titles and prices from a training dummy website.

Project requirements:

  • The latest version of Python is required for this tutorial.
  • We will be using the Beautiful Soup 4 library to parse HTML.
  • The Requests library will be used to make HTML requests.

Importing libraries

Let's start by creating a virtual environment for the project:

python3 -m venv scrapingdemo

Activate the newly created virtual environment. Here's an example for Linux:

source ./scrapingdemo/bin/activate

Now, install the necessary Python modules.

Requests is a library used to send requests to websites and retrieve their responses. To install Requests, use the following command:

pip install requests

Beautiful Soup is a module used for parsing and extracting data from HTML responses. To install Beautiful Soup, use the following command:

pip install beautifulsoup4

The 're' module is a built-in Python module used for working with regular expressions.

Next, create an empty Python file, for example, demo.py.

To import the required libraries, add the following lines to your demo.py file:

import requests
from bs4 import BeautifulSoup
import re

Now you're ready to start coding!

Sending the GET request

To send a request to the web page you want to scrape data from using the Requests library, follow these steps. Let's use the example of scraping data from the website "https://books.toscrape.com/":

import requests
page = requests.get('https://books.toscrape.com/')

By executing the above code, a GET request will be sent to the specified URL, and the response will be stored in the page variable. You can then proceed to extract the desired data from the response using techniques such as parsing with Beautiful Soup or applying regular expressions.

Selecting data

To create a Beautiful Soup object and extract the desired data from the web page, follow these steps:

from bs4 import BeautifulSoup

# Create a Beautiful Soup object
soup = BeautifulSoup(page.content, 'html.parser')

# Find all elements with class 'product_pod'
content = soup.find_all(class_='product_pod')

# Convert the content to a string
content = str(content)

In the above code, the BeautifulSoup constructor is used to create a Beautiful Soup object by passing the page.content (the HTML content of the page) and specifying the parser type as 'html.parser'.

Then, the find_all() method is used to locate all elements with the class 'product_pod'. These elements contain the book titles and prices on the web page.

Finally, the content variable is converted to a string representation using the str() function. Now you can proceed to further process or extract the desired information from the content variable.

Processing the data using RegEx

To process the acquired data using Regular Expressions (RegEx), follow these steps:

Define the first expression to extract the book titles:

re_titles = r'title="(.*?)">'

This expression captures the data inside double quotes after the text title= in the format title="Titlename".

Define the second expression to extract the book prices:

re_prices = '£(.*?)</p>'

This expression captures the data inside the pound symbol £ and the </p> tag, representing the book prices.

Use the re.findall() function to find all substrings in the content variable that match the defined patterns and save them in the variables title_list and price_list:

title_list = re.findall(re_titles, content)
price_list = re.findall(re_prices, content)

By executing the above code, you will obtain the extracted book titles in the title_list variable and the corresponding prices in the price_list variable.

Saving the output

To save the output, you can modify the code as follows:

# Importing the required libraries.
import requests
from bs4 import BeautifulSoup
import re

# Requesting the HTML from the web page.
page = requests.get("https://books.toscrape.com/")

# Selecting the data.
soup = BeautifulSoup(page.content, "html.parser")
content = soup.find_all(class_="product_pod")
content = str(content)

# Processing the data using Regular Expressions.
re_titles = r'title="(.*?)">'
titles_list = re.findall(re_titles, content)
re_prices = '£(.*?)</p>'
price_list = re.findall(re_prices, content)

# Saving the output.
with open("output.txt", "w") as f:
for title, price in zip(titles_list, price_list):
f.write(f"{title}\t{price}\n")

This code snippet saves the extracted book titles and prices in the "output.txt" file. It loops over the pairs of titles and prices using the zip() function and writes them to the file with a tab-separated format. Each title and price combination is written on a new line.

Conclusion

This article provided an explanation of Regular Expressions, their usage, and the functionality of commonly used tokens. Additionally, it demonstrated an example of how Python and Regular Expressions can be utilized for scraping titles and prices from a web page.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

Subscribe

Related articles

thumbnail
How to Use DataXPath vs CSS Selectors

Read this article to learn what XPath and CSS selectors are and how to create them. Find out the differences between XPath vs CSS, and know which option to choose.

Flipnode
author avatar
Flipnode
12 min read
thumbnail
ScrapersScraping Amazon Product Data: A Complete Guide

Master the art of building an Amazon scraper from scratch with this practical, step-by-step tutorial.

Flipnode
author avatar
Flipnode
11 min read
thumbnail
ScrapersPlaywright Scraping Tutorial for 2023

Uncover the full potential of Playwright for automation and web scraping in this comprehensive article.

Flipnode
author avatar
Flipnode
12 min read