Web Scraping with Rust

Flipnode on Jun 13 2023

Rust is gaining popularity as a high-performance programming language, particularly for web scraping tasks where performance is crucial. While Rust may initially seem challenging compared to Python's ease of use, it doesn't mean that scraping with Rust is impossible or excessively difficult. With the right guidance, you can overcome the initial learning curve and harness the power of Rust for web scraping.

In this tutorial, we will guide you through the process of building a Rust scraper that extracts product data from an e-commerce store. By following this practical example, you can quickly get started with Rust and begin your web scraping journey with confidence.

Installing and running Rust

Learn how to install Rust on your operating system in this comprehensive guide.

Installing Rust on Windows

The installation steps are the following:

Visit the official Rust website at https://www.rust-lang.org/tools/install.
On the website, select the appropriate download for your operating system.
If you are using Windows, click on the "Download RUSTUP-INIT (64-bit)" button.
Before installing Rust, make sure to install the Visual Studio C++ Build tools.
Once the Visual Studio C++ build tools are installed, run the rustup-init executable that you downloaded.
The utility will open a command prompt window and prompt you to install the Visual Studio C++ build tools. Press 'y' to continue.
Review the installation information on the next screen and press '1' to proceed with the installation.
After the installation is complete, close the command prompt and open a new one to ensure that all environment variable changes take effect.

Installing Rust on macOS and Linux

For macOS, we recommend using the rustup utility to install Rust instead of Homebrew. Follow these steps:

Visit the official Rust website at https://www.rust-lang.org/tools/install.
On the website, you will see the installation instructions for macOS and Linux.
Copy the cURL command provided on the page to download and install the rustup utility.
Open a terminal on your macOS system and paste the cURL command.
Press Enter to run the command and proceed with the installation.
You will be presented with a confirmation screen. Review the information and press '1' to proceed with the installation.
After the installation is complete, close the terminal and open a new one to ensure that all environment variable changes take effect.

Rust scraper for scraping book data

To get started with web scraping in Rust, we will create a practical project using the popular website https://books.toscrape.com/. This dummy bookstore serves as an ideal platform for learning web scraping techniques.

Setup

Setting up the project is the initial step. Open your terminal or command prompt and follow these instructions:

Create a new Rust project by executing the following command in the terminal:

$ cargo new book_scraper

This command initializes a new project named "book_scraper" and generates the necessary files and folders, including Cargo.toml and the main.rs file in the src folder.

Open the project folder in a text editor or IDE of your choice. If you're using Visual Studio Code, consider installing the "rust-analyzer" extension for an enhanced coding experience.

In the Cargo.toml file, add the following lines under the [dependencies] section:

reqwest = { version = "0.11", features = ["blocking"] }
scraper = "0.13.0"

These lines define the dependencies we need for the project—reqwest and scraper. We will explore their usage later.

Return to the terminal and execute the following command to download the dependencies and compile the code:

$ cargo build

Upon successful compilation, you will see a message indicating the completion of the compilation process.

Run the compiled code with the following command:

$ cargo run

The program will execute, and you will see the output in the terminal. In this case, the output should be "Hello, world!"

Note: The executable file is generated in the path ./target/debug/book_scraper. If you are using Windows, the file name will be .\target\debug\book_scraper.exe.

Making an HTTP request

To perform HTTP requests, such as GET or POST, in Rust, we rely on the reqwest library, which offers a convenient solution. The library provides two types of HTTP clients: an asynchronous client and a blocking client.

For the purpose of this tutorial, we will focus on using the blocking client to simplify the learning process. To ensure we have the necessary features enabled, we have specified in the Cargo.toml file that we require the blocking feature from the reqwest library.

Here's an example code snippet that demonstrates sending a GET request using the blocking client:

fn main() {
    let url = "https://books.toscrape.com/";
    let response = reqwest::blocking::get(url).expect("Could not load URL.");
    let body = response.text().unwrap();
    print!("{}", body);
}

In the code above, we first define the target URL in the url variable. Then, we send a GET request to the specified URL using the blocking http client from the reqwest library. The response is stored in the response variable.

Next, we extract the HTML content from the response and store it in the body variable. Finally, we print the contents of the body variable.

After saving the main.rs file, navigate to the terminal and run the following command:

$ cargo run

This will execute the program, and the output will be the entire HTML content of the specified URL, which will be printed in the terminal.

Parsing HTML with Rust scraper

To build a web scraper in Rust, we need to utilize the scraper library. This library allows us to use CSS selectors to extract specific HTML elements.

If you haven't already done so, add the following line to your Cargo.toml file under dependencies:

scraper = "0.13.0"

Open the main.rs file and append the following line:

use scraper::{Html, Selector};

This line imports the necessary modules from the scraper library. We can then parse the web page using the parse_document function:

let document = Html::parse_document(&body);

This line takes the raw HTML, extracted using the reqwest Rust library, and parses it into a document object, which is stored in the document variable.

The parsed HTML document can now be queried using CSS selectors to locate the desired HTML elements. We can break this process into three steps: locating products via CSS selectors, extracting product descriptions, and extracting product links.

To locate products, we need to identify the CSS selectors that contain information related to each product. In our example, the product is a book.

Open the target website in your browser and examine the HTML markup. Identify the CSS selector that selects a book. For example, you might find that the selector article.product_pod selects a book.

First, add the following line at the beginning of the main.rs file:

use scraper::{Html, Selector};

Next, add the following line in the main function:

let book_selector = Selector::parse("article.product_pod").unwrap();

Now, the selector is ready to be used. Within the main function, add the following lines:

for element in document.select(&book_selector) {
    // More code here
}

You can now apply additional CSS selectors inside the loop to extract information about each book.

To extract product descriptions, you need to identify the appropriate CSS selectors that target the desired elements containing the descriptions. You can create new selectors and use them inside the loop to extract the necessary information.

Similarly, to extract product links, identify the CSS selectors that select the elements containing the links and create a selector for them. Use this selector inside the loop to extract the links.

By applying CSS selectors and iterating over the selected elements, you can extract the desired information for each book.

Feel free to add more code inside the loop to extract additional information or perform any necessary operations on the scraped data.

Extracting product description

By iterating over HTML elements that serve as containers for each product, we can easily create reusable web scraping code.

In this example, we will extract the product name and price.

First, create two selectors before the for loop:

let book_name_selector = Selector::parse("h3 a").unwrap();
let book_price_selector = Selector::parse(".price_color").unwrap();

Within the for loop, apply these selectors to each individual book:

for element in document.select(&book_selector) {
    let book_name_element = element.select(&book_name_selector).next().expect("Could not select book name.");
    let book_name = book_name_element.value().attr("title").expect("Could not find title attribute.");
    let price_element = element.select(&book_price_selector).next().expect("Could not find price");
    let price = price_element.text().collect::<String>();
    println!("{:?} - {:?}", book_name, price);
}

Note the following:

The book name is stored in the title attribute of the <a> element.

The price is present in the text of the element.

Save the files and execute the following command in your terminal:

$ cargo run

This will print the book names and prices on the terminal.

Extracting product links

You can extract the product links in a similar manner. Create a selector outside the for loop as shown below:

let book_link_selector = Selector::parse("h3 a").unwrap();

Within the for loop, add the following lines:

let book_link_element = element.select(&book_name_selector).next().expect("Could not find book link element.");
let book_link = book_link_element.value().attr("href").expect("Could not find href attribute");

Now, you have extracted all the necessary values. You can print them to the console or save them to a CSV for better organization.

Writing scraped data to a CSV file

To create a CSV file in your web scraping project, you can use the CSV Rust library. Here's how you can modify your code to achieve this:

Add the following line to your Cargo.toml under dependencies:

csv = "1.1"

Before the for loop, create a CSV writer as follows:

let mut wtr = csv::Writer::from_path("books.csv").expect("Could not create file.");

Optionally, write the headers before the for loop:

wtr.write_record(&["Book Name", "Price", "Link"]).expect("Could not write header.");

Within the for loop, write each record to the CSV file:

wtr.write_record([book_name, &price, &book_link]).expect("Could not write record.");

Finally, close the file after the for loop:

wtr.flush().expect("Could not close file");

Here's the modified main.rs file:

use scraper::{Html, Selector};

fn main() {
    // Existing code omitted for brevity

    let mut wtr = csv::Writer::from_path("books.csv").expect("Could not create file.");
    wtr.write_record(&["Book Name", "Price", "Link"]).expect("Could not write header.");

    for element in document.select(&book_selector) {
        // Existing code omitted for brevity

        wtr.write_record([book_name, &price, &book_link]).expect("Could not write record.");
    }
 
   wtr.flush().expect("Could not close file");

    println!("Done");
}

With these modifications, the program will write the scraped data to a CSV file named "books.csv" in the same directory.

Conclusion

In this article, we explored the process of creating a web scraper using Rust. We delved into the utilization of CSS selectors in web scraping, leveraging the powerful Rust Scraper library for this purpose.