How to Run Python Script as a Service (Windows & Linux)

Flipnode on Jun 19 2023

In the ever-evolving web environment, obtaining time-sensitive data like E-commerce listings just once is insufficient since it rapidly becomes outdated. To maintain competitiveness, it is essential to regularly and repeatedly run web scraping scripts to ensure fresh data.

One convenient approach is to run the script as a background service, regardless of the operating system being used, whether it's Linux or Windows. This guide will outline the process in a few straightforward steps, making it easy to implement.

Preparing a Python script for Linux

In this article, we will cover the process of scraping information from a list of book URLs. The script will continuously loop over the URLs, refreshing the data each time.

First, we use the Requests module to make a request and retrieve the HTML content of a page:

import requests

urls = [
    'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
    'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
    'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
]

index = 0
while True:
    url = urls[index % len(urls)]
    index += 1

    print('Scraping url', url)
    response = requests.get(url)

Next, we parse the HTML content using the Beautiful Soup library:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content, 'html.parser')
book_name = soup.select_one('.product_main').h1.text
rows = soup.select('.table.table-striped tr')
product_info = {row.th.text: row.td.text for row in rows}

We save the book information in JSON format to a data directory using the pathlib module:

import json
from pathlib import Path

data_folder = Path('./data')
data_folder.mkdir(parents=True, exist_ok=True)

json_file_name = re.sub('[\': ]', '-', book_name)
json_file_path = data_folder / f'{json_file_name}.json'
with open(json_file_path, 'w') as book_file:
    json.dump(product_info, book_file)

To handle shutdown requests from the operating system, we define a SignalHandler class:

import signal

class SignalHandler:
    shutdown_requested = False

    def __init__(self):
        signal.signal(signal.SIGINT, self.request_shutdown)
        signal.signal(signal.SIGTERM, self.request_shutdown)

    def request_shutdown(self, *args):
        print('Request to shutdown received, stopping')
        self.shutdown_requested = True

    def can_run(self):
        return not self.shutdown_requested

Finally, we modify the loop condition to check if a shutdown signal has been received:

signal_handler = SignalHandler()

while signal_handler.can_run():
    # Run the code only if you don't need to exit

The script will continuously refresh the JSON files with newly collected book information until a shutdown signal is received.

Running a Linux daemon

If you're looking to run a Python script on startup in Linux, there are several methods available. Many Linux distributions provide built-in GUI tools for this purpose. Let's take Linux Mint, which uses the Cinnamon desktop environment, as an example. It offers a startup application utility that allows you to add your script with a startup delay.

However, if you need more control over the script, such as managing restarts, systemd is a powerful option. Systemd is a service manager that uses easily configurable files to manage user processes.

To use systemd, follow these steps:

1. Navigate to the /etc/systemd/system directory:

cd /etc/systemd/system

2. Create a file named book-scraper.service:

touch book-scraper.service

3. Open the book-scraper.service file with your favorite editor and add the following content:

[Unit]
Description=A script for scraping book information
After=syslog.target network.target

[Service]
WorkingDirectory=/home/flipnode/Scraper
ExecStart=/home/flipnode/Scraper/venv/bin/python3 scrape.py

Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

Here's a brief explanation of the parameters used in the configuration file:

After ensures that the Python script starts after the network is up.
RestartSec specifies the sleep time before restarting the service.
Restart describes what to do if the service exits, is killed, or reaches a timeout.
WorkingDirectory sets the current working directory for the script.
ExecStart specifies the command to execute.

4. Once you've saved the file, run the following command to reload the systemd daemon:

systemctl daemon-reload

5. Start your service:

systemctl start book-scraper

6. To check the status of your service, use the following command:

systemctl status book-scraper

You should see output similar to this:

book-scraper.service - A script for scraping book information
     Loaded: loaded (/etc/systemd/system/book-scraper.service; disabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-09-08 15:01:27 EEST; 16min ago
   Main PID: 60803 (python3)
      Tasks: 1 (limit: 18637)
     Memory: 21.3M
     CGroup: /system.slice/book-scraper.service
             60803 /home/flipnode/Scraper/venv/bin/python3 scrape.py

Sep 08 15:17:55 laptop python3[60803]: Scraping url https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html
Sep 08 15:17:55 laptop python3[60803]: Scraping url https://books.toscrape.com/catalogue/sharp-objects_997/index.html
Sep 08 15:17:55 laptop python3[60803]: Scraping url https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html

You can use journalctl -S today -u book-scraper.service to monitor the logs of your service in real-time.

Congratulations! You can now control your service using systemd.

Running a Python script as a Windows service

Running a Python script as a Windows service requires some modifications. Let's start with the necessary script changes.

First, we need to update how the script is executed based on the number of command-line arguments it receives.

If the script receives a single argument, it means that the Windows Service Manager is attempting to start it. In this case, we need to run initialization code. If no arguments are passed, we will print some helpful information using win32serviceutil.HandleCommandLine:

import sys
import servicemanager
import win32serviceutil
import win32event
import win32service
import json
import re
from pathlib import Path

import requests
from bs4 import BeautifulSoup

class BookScraperService(win32serviceutil.ServiceFramework):
    _svc_name_ = 'BookScraperService'
    _svc_display_name_ = 'BookScraperService'
    _svc_description_ = 'Constantly updates book information'

    def __init__(self, args):
        win32serviceutil.ServiceFramework.__init__(self, args)
        self.event = win32event.CreateEvent(None, 0, 0, None)

    def GetAcceptedControls(self):
        result = win32serviceutil.ServiceFramework.GetAcceptedControls(self)
        result |= win32service.SERVICE_ACCEPT_PRESHUTDOWN
        return result

    def SvcDoRun(self):
        urls = [
            'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
            'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
            'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
        ]

        index = 0

        while True:
            result = win32event.WaitForSingleObject(self.event, 5000)
            if result == win32event.WAIT_OBJECT_0:
                break

            url = urls[index % len(urls)]
            index += 1

            print('Scraping URL:', url)
            response = requests.get(url)

            soup = BeautifulSoup(response.content, 'html.parser')
            book_name = soup.select_one('.product_main').h1.text
            rows = soup.select('.table.table-striped tr')
            product_info = {row.th.text: row.td.text for row in rows}

            data_folder = Path('C:\\Users\\User\\Scraper\\dist\\scrape\\data')
            data_folder.mkdir(parents=True, exist_ok=True)

            json_file_name = re.sub('[\': ]', '-', book_name)
            json_file_path = data_folder / f'{json_file_name}.json'
            with open(json_file_path, 'w') as book_file:
                json.dump(product_info, book_file)

    def SvcStop(self):
        self.ReportServiceStatus(win32service.SERVICE_STOP_PENDING)
        win32event.SetEvent(self.event)

if __name__ == '__main__':
    if len(sys.argv) == 1:
        servicemanager.Initialize()
        servicemanager.PrepareToHostSingle(BookScraperService)
        servicemanager.StartServiceCtrlDispatcher()
    else:
        win32serviceutil.HandleCommandLine(BookScraperService)

The changes to the script allow it to be executed as a Windows service.

To run the script as a service, open a Windows terminal of your choice. Please note that if you're using PowerShell, you should include the .exe extension when running binaries to avoid unexpected errors.

To proceed, follow these steps after opening the terminal:

1. Change the directory to the location of your script within the virtual environment. For example:

cd C:\Users\User\Scraper

2. Next, install the pypiwin32 module, which provides the experimental Python Windows extensions. Run the following commands to install the module and execute the post-install script:

.\venv\Scripts\pip install pypiwin32
.\venv\Scripts\pywin32_postinstall.py -install

However, if you encounter the following error when attempting to install your script as a Windows service:

**** WARNING ****
The executable at "C:\Users\User\Scraper\venv\lib\site-packages\win32\PythonService.exe" is being used as a service.

This executable doesn't have pythonXX.dll and/or pywintypesXX.dll in the same
directory, and they can't be found in the System directory. This is likely to
fail when used in the context of a service.

The exact environment needed will depend on which user runs the service and
where Python is installed. If the service fails to run, this will be why.

NOTE: You should consider copying this executable to the directory where these
DLLs live - "C:\Users\User\Scraper\venv\lib\site-packages\win32" might be a good place.

Follow the instructions provided by the error message. Copy the executable mentioned to the directory where the required DLLs are located, such as:

copy C:\Users\User\Scraper\venv\lib\site-packages\win32\PythonService.exe C:\Users\User\Scraper\venv\lib\site-packages\win32

3. To resolve the "The service did not respond to the start or control request in a timely fashion" issue, you have two options. First, add the Python libraries and interpreter to the Windows path. Alternatively, bundle your script and its dependencies into a single executable using pyinstaller. Run the following command:

venv\Scripts\pyinstaller --hiddenimport win32timezone -F scrape.py

Note that the --hiddenimport win32timezone option is necessary since the win32timezone module is required but not explicitly imported.

4. Finally, install your script as a service and run it by invoking the previously built executable. Use the following commands:

PS C:\Users\User\Scraper> .\dist\scrape.exe install
Installing service BookScraper
Changing service configuration
Service updated

PS C:\Users\User\Scraper> .\dist\scrape.exe start
Starting service BookScraper
PS C:\Users\User\Scraper>

That's it! You can now open the Windows Services utility to see your new service running.

Making your life easier by using NSSM on Windows

Developing a Windows service using win32serviceutil can be a cumbersome process. However, you can simplify it by leveraging NSSM (Non-Sucking Service Manager). Here's the updated process:

1. Let's retain the code responsible for web scraping and discard the rest. Here's the simplified script:

import json
import re
from pathlib import Path

import requests
from bs4 import BeautifulSoup

urls = [
    'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
    'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
    'https://books.toscrape.com/catalogue/sharp-objects_997/index.html'
]

index = 0

while True:
    url = urls[index % len(urls)]
    index += 1

    print('Scraping url', url)
    response = requests.get(url)

    soup = BeautifulSoup(response.content, 'html.parser')
    book_name = soup.select_one('.product_main').h1.text
    rows = soup.select('.table.table-striped tr')
    product_info = {row.th.text: row.td.text for row in rows}

    data_folder = Path('C:\\Users\\User\\Scraper\\data')
    data_folder.mkdir(parents=True, exist_ok=True)

    json_file_name = re.sub('[\': ]', '-', book_name)
    json_file_path = data_folder / f'{json_file_name}.json'
    with open(json_file_path, 'w') as book_file:
        json.dump(product_info, book_file)

2. Create a binary using pyinstaller:

venv\Scripts\pyinstaller -F simple_scrape.py

3. Download NSSM from the official website and extract it to a folder of your choice. Add this folder to the PATH environment variable for convenience.

4. Run the terminal as an administrator.

5. Change the directory to the location of your script:

cd C:\Users\User\Scraper

6. Install the script using NSSM and start the service:

nssm.exe install SimpleScrape C:\Users\User\Scraper\dist\simple_scrape.exe
nssm.exe start SimpleScrape

Pro tip: If you encounter any issues, redirect the standard error output of your service to a file to investigate the problem:

nssm set SimpleScrape AppStderr C:\Users\User\Scraper\service-error.log

NSSM ensures that the service runs in the background, and if any issues arise, you will have visibility into the error logs.

Conclusion

No matter the operating system you're using, there are several options available to set up Python scripts for recurring web scraping tasks. Depending on your needs and preferences, you can choose from the configurability of systemd, the flexibility of Windows services, or the simplicity of NSSM. To help you navigate their features effectively, follow this reliable guide for successful setup and execution of your scraping tasks.