Automating Web Scraping With Python and Cron
Flipnode on Jun 19 2023
When building an automated web scraper, the initial step typically involves writing a Python web scraper script. The subsequent step is automating the process, which offers various options, but one stands out as the simplest. Unix-like operating systems, including macOS and Linux, provide a built-in tool called cron, specifically designed for scheduling recurring tasks.
In this article, we will focus on teaching you how to schedule tasks using cron. As an example of automation, we have chosen a Python-based web scraper.
However, before diving into the configuration of cron, it is advisable to follow certain preparatory guidelines. By doing so, you can minimize the chances of encountering errors during the setup process.
Preparing the Python script
First and foremost, it is advisable to utilize a virtual environment. This ensures that the correct Python version and all required libraries are exclusively available for your Python web scraper, without affecting other users on the system.
Another good practice is to use absolute file paths. By doing so, you can avoid potential script failures due to missing files, especially when changing your working directory.
Lastly, incorporating logging is highly recommended as it provides a valuable log file for reference and troubleshooting in case of any issues.
To configure logging, you can achieve it with a single line of code after importing the logging module:
Once configured, you can write log messages to the file as follows:
logging.info("Informational message here")
For more detailed information on logging, refer to the official documentation.
To illustrate a realistic example, the following script showcases automated scraping similar to real-life scenarios:
from bs4 import BeautifulSoup
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
response = requests.get(url)
response.encoding = "utf-8"
soup = BeautifulSoup(response.text, 'lxml')
price = soup.select_one('#content_inner .price_color').text
with open(r'/Users/upen/data.csv', 'a') as f:
f.write(price + "\n")
Whenever you execute this script, it will append the latest price as a new line to the CSV file.
What is cron, and how does it work
The cron utility is responsible for checking if any scheduled tasks need to be executed and running them accordingly.
A crucial component of cron is the crontab, which stands for cron table. It allows the creation of files that are read by the cron utility, known as crontab files.
In this article, we will be working directly with these files. If you're interested in learning how to write cron jobs in Python using the python-crontab library, you can refer to the respective documentation.
When using python-crontab, it's possible to configure cron directly, including the addition and removal of crontab jobs using Python. However, in our example, we will focus on working with crontab itself.
Understanding the crontab utility's functionality is the first step towards building an automated web scraping task.
To view the list of currently configured crontab tasks, use the following command with the -l switch:
To edit the crontab file, use the -e switch:
This command will open the default editor, usually vi. If you prefer a different editor like nano for easier editing, you can set it as the default editor using the following command:
Note that certain editors like Visual Studio Code won't work due to their handling of system-level files. It's recommended to stick with vi or nano for editing crontab files.
Each crontab entry follows the pattern:
<schedule> <command to run>
Each line represents a schedule and the corresponding task to be executed.
Editing the crontab file
To edit the crontab file, open the terminal and execute the following command:
This command will open the default crontab editor. On some Linux distributions, you may be prompted to choose the program you want to use to edit the file.
In the editor, add each task and its frequency on separate lines.
How to run cron job frequently
Each entry in crontab begins with the frequency of the cron job. The schedule consists of five parts:
- Hour (in 24-hour format)
- Day of the month
- Day of the week
The possible values for each part are * (any value) or a specific number.
For example, if you want to run a task every hour, the schedule will be:
0 * * * *
In this case, the cron process runs every minute and compares the current system time with this entry. It will match only when the system time is at minute 0. The remaining fields are set to *, indicating they can accept any value.
With this schedule, the task will run at 4:00, 5:00, 6:00, and so on, effectively running every hour.
Here are a few more examples:
- To run a task at 10 am on the 1st day of every month:
0 10 1 * *
- To run a task at 2 pm (14:00) every Monday:
0 14 * * 1
There are websites like crontab.guru that can assist you in building and validating cron schedules.
How to remove python crontab job
To remove all crontab jobs, simply open the terminal and use the following command:
If you wish to remove a specific crontab job, you can edit the crontab file using the command:
Once you are in edit mode, locate the line corresponding to the job you want to remove and delete it. Save the file after making the changes. The crontab will be updated with the modified contents, effectively deleting the specified cron job.
How to schedule Python script in crontab
First, determine the command you want to execute. If you're not using a virtual environment, you can run your web scraping script with the following command:
In some cases, you might have specific dependencies. If you're following recommended practices, it's likely that you've set up a virtual environment.
Note that using the "source venv/bin/activate" command to activate your virtual environment is often unnecessary. For instance, ".venv/bin/python3 script.py" already utilizes the "python3" from the virtual environment.
Another suggestion is to create a shell script and include the aforementioned lines in that script to make it more manageable. If you do so, the command to run your scraper would be:
The second step is to create a schedule. Let's consider an example where the script needs to run every hour. The cron schedule would be as follows:
0 * * * *
Once you have determined the command and schedule, open the terminal and enter the command:
Then, add the following line to the crontab file, assuming you are using a shell script:
0 * * * * sh /Users/upen/shopping/run_scraper.sh
After saving the file, you may receive a system prompt stating that your system settings are being modified.
Common reasons why crontab Python script isn't running
On macOS, a common issue is the lack of permission for cron. To address this, follow these steps:
- Open System Preferences and click on Security & Privacy.
- Go to the Privacy tab and select Full Disk Access from the left sidebar.
- Add the path of the cron executable to the list. If you're unsure about the location of the cron executable, you can run the following command in the terminal:
Another frequent problem is the usage of the wrong version of Python (2 instead of 3, or vice versa). macOS and many Linux distributions come with both Python 2 and Python 3 installed. To resolve this, determine the complete path of the Python executable you wish to use. Follow these steps:
- Open the command prompt.
- Run the following command:
Make note of the Python executable you want to use. If you're not utilizing virtual environments, you need to specify the complete path to the Python file.
Incorrect script path is another common cause of failure. As a best practice, when working with cron, always use absolute paths for your scripts.
Cron job vs SystemD vs Windows Task Scheduler vs AutoScraper
Cron is a utility designed for Unix-like operating systems such as macOS and Linux. Similar tools include Systemd (pronounced as "system-d") and Anacron, but these are specific to Linux and not available on Windows.
For Windows, the recommended tool to use is the Windows Task Scheduler.
AutoScraper, on the other hand, is an open-source Python library that can handle various scraping scenarios. It's important to note that this library is not intended as a replacement for cron. While AutoScraper can automate the web scraping process, you still need to write the Python script and utilize cron or an alternative tool to execute it automatically.
Now that we have discussed the key elements of cron, crontab, and cron jobs, we hope you have gained a better understanding of how web scraping automation can be achieved through these practices. Before automating your web scraping projects, it is always important to conduct thorough research to determine the most suitable software and languages for your specific needs. Both Cron and Python have their own set of advantages and limitations compared to other alternatives, so it's essential to assess which options align best with your requirements.