Automated Web Scraper With Python AutoScraper [Guide]

Flipnode on Jun 14 2023

If you're interested in automating the regular scraping of public web data, you're in the right spot. This tutorial will walk you through the process of automating web scraping using AutoScaper, one of the Python libraries dedicated to web scraping.

Before we begin, it might be helpful to explore a comprehensive guide on building an automated web scraper using different web scraping tools supported by Python.

Now, let's dive right into it.

Automated web scraping with Python AutoScraper library

AutoScraper is a lightweight and user-friendly web scraping library written in Python 3. It is designed to be accessible to beginners, requiring minimal knowledge of web scraping techniques.

With AutoScraper, you can provide the URL or HTML of any website and let it automatically learn the scraping rules. It intelligently matches and extracts data based on these rules, making the scraping process efficient and effortless.

To install AutoScraper, we can use the Python package index (PyPI) repository and execute the following pip command:

pip install autoscraper

Now, let's explore an example of using AutoScraper to scrape data from the Books to Scrape website. This website contains a vast collection of books across various categories.

Note: The example assumes that you have installed AutoScraper as per the instructions above.

Scraping books category URLs

To scrape the links to all the category pages from the provided URL, you can use the following code snippet:

from autoscraper import AutoScraper

url_to_scrape = "https://books.toscrape.com"
wanted_list = ["https://books.toscrape.com/catalogue/category/books/travel_2/index.html"]

scraper = AutoScraper()
scraped_data = scraper.build(url_to_scrape, wanted_list=wanted_list)
print(scraped_data)

In the code, we import AutoScraper from the autoscraper library. Then, we specify the url_to_scrape variable with the URL from which we want to scrape the information.

The wanted_list variable is assigned with a sample data element that represents the desired information we want to scrape. In this case, we provide a single link to the Travel category page as a sample data element.

By creating an AutoScraper object using scraper = AutoScraper(), we can utilize various functions of the autoscraper library.

The scraper.build() method is used to scrape data similar to the wanted_list from the target URL.

After executing the Python script, the scraped_data list will contain all the category page links available at https://books.toscrape.com. The output of the script will display the scraped data, including links to different category pages.

Scraping book information from a single webpage

In order to scrape specific data from a book page, we can utilize AutoScraper. Let's take a look at an example of how to train and build an AutoScraper model to extract the title and price of a book:

from autoscraper import AutoScraper

url_to_scrape = "https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
wanted_list = ["It's Only the Himalayas", "£45.17"]

info_scraper = AutoScraper()
info_scraper.build(url_to_scrape, wanted_list=wanted_list)

In the code snippet above, we provide the URL of the book page and a sample of the desired information (title and price) to the AutoScraper model. The build() method trains the model to scrape the required information based on the provided URL and wanted list.

Now, let's apply this InfoScraper to a different book's URL and check if it returns the desired information:

another_book_url = 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'

scraped_data = info_scraper.get_result_similar(another_book_url)
print(scraped_data)

The code above applies the InfoScraper model to another_book_url and prints the scraped data. Please note that the get_result_similar() method may return additional information along with the desired data, as it retrieves similar information to the wanted list.

To obtain the exact desired information, we can use the get_result_exact() method:

another_book_url = 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'

scraped_data = info_scraper.get_result_exact(another_book_url)
print(scraped_data)

By employing the get_result_exact() method, we ensure the accurate retrieval of the book title and price in the defined order specified by the wanted list.

Scraping all the books on a specific category

Now that we have learned how to extract similar and exact information from a specific webpage, including URLs, let's explore how to scrape data from all the books within a specific category. We can achieve this by using two scrapers: one for extracting the URLs of all the books in the category and another for scraping information from each individual book link.

Let's put this strategy into action with the following Python script:

#BooksByCategoryScraper.py
from autoscraper import AutoScraper
import pandas as pd

# BooksUrlScraper section
TravelCategoryLink = 'https://books.toscrape.com/catalogue/category/books/travel_2/index.html'
WantedList = ["https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"]
BooksUrlScraper = AutoScraper()
BooksUrlScraper.build(TravelCategoryLink, wanted_list=WantedList)

# BookInfoScraper section
BookPageUrl = "https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
WantedList = ["It's Only the Himalayas", "£45.17"]

BookInfoScraper = AutoScraper()
BookInfoScraper.build(BookPageUrl, wanted_list=WantedList)

# Scraping info of each book and storing into an excel file
BooksUrlList = BooksUrlScraper.get_result_similar(TravelCategoryLink)
BooksInfoList = []
for Url in BooksUrlList:
    book_info = BookInfoScraper.get_result_exact(Url)
    BooksInfoList.append(book_info)
df = pd.DataFrame(BooksInfoList, columns=["Book Title", "Price"])
df.to_excel("BooksInTravelCategory.xlsx")

In the script above, there are three main parts: two sections for building the scrapers and a third section to scrape data from all the books in the Travel category and save it as an Excel file.

First, we build BooksUrlScraper to extract all the similar book links on the Travel Category page. These links are stored in the BooksUrlList. Then, for each URL in BooksUrlList, we apply BookInfoScraper to extract the desired information and append it to the BooksInfoList. Finally, the BooksInfoList is converted into a data frame and exported as an Excel file for future use.

The output will include the book titles and prices, reflecting the accomplishment of our initial goal to scrape this information from all eleven books in the Travel category.

Now, armed with the knowledge of using multiple AutoScraper models in combination, you can adapt the script above to scrape books from various categories and save them in separate Excel files for each category.

How to use AutoScraper with proxies

Using proxies is crucial in web scraping to mitigate risks such as IP blocking by the target website. AutoScraper provides support for using proxies, and the build function accepts request-related arguments through the request_args parameter.

Here's an example of how to use AutoScraper with proxy IPs:

from autoscraper import AutoScraper
UrlToScrap = "https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html"
WantedList = ["It's Only the Himalayas", "£45.17"]

proxy = {
    "http": 'PROXY_ENDPOINT_HERE',
    "https": 'PROXY_ENDPOINT_HERE',
}

InfoScraper = AutoScraper()
InfoScraper.build(UrlToScrap, wanted_list=WantedList, request_args={"proxies": proxy})

In the script above, PROXY_ENDPOINT_HERE should be replaced with the actual address of the proxy server in the correct format (e.g., http://127.0.0.1:8081). By including the proxy information in the request_args dictionary, the AutoScraper will make requests through the specified proxy.

Ensure that you provide valid and working proxy endpoints in the proxy dictionary for the script to function correctly.

Saving and loading an AutoScraper model

AutoScraper offers a convenient feature to save and load pre-trained scrapers. You can use the following script to save an InfoScraper object to a file:

InfoScraper.save('file_name')

Likewise, you can load a saved scraper using:

SavedScraper = AutoScraper()
SavedScraper.load('file_name')

By saving the scraper, you can reuse it later without needing to rebuild it from scratch. This is particularly useful when you want to apply the same scraping rules to different websites or when you want to save and share your scraping configuration with others.

Now that we have covered the process of building an automated web scraper using AutoScraper, let's proceed to the final part of this tutorial, which focuses on managing automation mechanisms.

Alternative options for web scraping automation

In this section, we will explore different options for scheduling Python scripts on macOS, Unix/Linux, and Windows operating systems.

Suppose you want your scraper to regularly visit the Travel category page and scrape any newly uploaded books. You can achieve this by scheduling the "BooksByCategoryScraper.py" script. This script, when executed, will scrape data from all the books on the Travel category page and save it in an Excel file.

Here are several methods you can use to schedule a Python script:

Schedule module in Python: You can utilize the "schedule" module in Python to schedule the execution of your script. This module provides a simple and intuitive way to define recurring tasks. You can refer to the tutorial on how to use the "schedule" module for more details.
Crontab (cron table): On Unix-based operating systems like Linux and macOS, you can add your script to the crontab, which is a time-based job scheduler. The crontab allows you to specify the schedule for running your Python script at specified intervals. You can refer to the tutorial for more information on how to add your script to the crontab.
Systemd: Another option for Unix-based systems is to create a daemon or background service using systemd. Systemd is a system and service manager that provides a way to manage background processes. You can follow the tutorial to learn how to create a daemon or background service to schedule your Python script.
Task Scheduler in Windows: If you are using Windows, you can utilize the Task Scheduler to schedule the execution of your Python script. The Task Scheduler allows you to specify the schedule and other parameters for running your script automatically. You can refer to the tutorial on how to use the Task Scheduler in Windows.

Please note that the crontab and systemd methods are specific to Unix-based operating systems, while the Task Scheduler is specific to Windows. Choose the method that is appropriate for your operating system to schedule your Python script accordingly.