How to Rotate Proxies in Python

Flipnode on Jun 09 2023

In various web scraping scenarios such as market research, price monitoring, or brand protection, proxies play a crucial role. The rotation of proxies is a fundamental practice when it comes to web scraping. But why is it necessary?

This comprehensive guide will address the importance of rotating proxies during the scraping process. It will provide you with step-by-step instructions on how to rotate proxies using Python. In addition, you'll receive valuable professional tips and tricks on proxy rotation in the final section of the article. Let's dive in and explore the world of proxy rotation!

What is proxy rotation and why is it important?

Proxy rotation involves automatically assigning different IP addresses to each new web scraping session. This process can be based on factors such as a specific time frame, status code, or the number of requests made.

In the realm of web scraping, one common challenge is avoiding detection and blocking by target websites. Proxy rotation serves as a valuable solution to this issue. Websites are typically wary of bot activity and may view a large volume of requests originating from the same IP address as suspicious. However, by utilizing rotating proxy IP addresses, you can enhance your anonymity, simulate the behavior of multiple organic users, and effectively bypass most anti-scraping measures.

When it comes to rotating IP addresses, you have two primary options: utilizing a third-party rotator tool or constructing your own solution using Python. In this article, we will focus on the latter option, delving into the process of building your own IP rotation mechanism in Python.

Rotating proxies in Python: installing prerequisites

To begin, start by setting up a virtual environment by executing the following command:

$ virtualenv venv

This command will install Python, pip, and common libraries within the newly created "venv" folder.

Next, activate the virtual environment by running the source command:

$ source venv/bin/activate

With the virtual environment activated, proceed to install the requests module by executing the following command:

$ pip install requests

Great! You have now successfully installed the requests module in your current virtual environment.

Next, create a new file with the .py extension and include the following script:

import requests

response = requests.get('https://ip.flipnode.io/ip')
print(response.text)

Save the file and run it from your terminal using the following command:

$ python no_proxy.py

Upon running the script, you will see the output displaying your current IP address (e.g., 128.90.50.100).

Our objective is to demonstrate how to conceal your IP address and rotate through different IP addresses to maintain anonymity and avoid being blocked. Let's proceed to the next steps to accomplish this.

Sending GET requests through a proxy

Now, let's start with the basics: how do we use a single proxy? To utilize a proxy server, you'll need the following information:

Scheme (e.g., http)
IP address
Port (e.g., 3128)
Username and password to connect to the proxy (optional)

Once you have all the necessary information, you can set it up in the following order:

SCHEME://USERNAME:PASSWORD@YOUR_PROXY_IP:YOUR_PROXY_PORT

Here are a few examples of proxy formats you may encounter:

http://2.56.215.247:3128
https://2.56.215.247:8091
https://my-user:[email protected]:8044

Note that you can specify multiple protocols and even define specific domains for which a different proxy will be used. For example:

scheme_proxy_map = {
    'http': PROXY1,
    'https': PROXY2,
    'https://example.org': PROXY3,
}

Finally, you can make a request by calling requests.get and passing all the variables we defined earlier. The script can handle exceptions and display an error when a network issue occurs.

try:
    response = requests.get('https://ip.flipnode.io/ip', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS)
except (ProxyError, ReadTimeout, ConnectTimeout) as error:
    print('Unable to connect to the proxy:', error)
else:
    print(response.text)

The output of this script should display the IP of your proxy:

$ python single_proxy.py
2.56.215.247

Congratulations! You are now hidden behind a proxy when making requests through the Python script. Next, we will learn how to rotate a list of proxies instead of using a single one.

Rotating proxies using a proxy pool

In this section of the tutorial, we will utilize a list of proxies stored in a CSV file named proxies.csv. Each proxy server is listed on a separate line in the following format:

http://2.56.215.247:3128
https://88.198.24.108:8080
http://50.206.25.108:80
http://68.188.59.198:80

and so on for any additional proxy servers.

To begin, create a Python file and specify the filename of the CSV file along with the desired timeout for a single proxy to respond:

TIMEOUT_IN_SECONDS = 10
CSV_FILENAME = 'proxies.csv'

Next, implement the code that opens the CSV file, reads each line representing a proxy server into a csv_row variable, and constructs the scheme_proxy_map configuration required by the requests module:

with open(CSV_FILENAME) as open_file:
    reader = csv.reader(open_file)
    for csv_row in reader:
        scheme_proxy_map = {
            'https': csv_row[0],
        }

To verify if everything is functioning correctly, we will use the same scraping code as before to access the website through the proxies:

with open(CSV_FILENAME) as open_file:
    reader = csv.reader(open_file
    for csv_row in reader:
        scheme_proxy_map = {
            'https': csv_row[0],
        }

        # Access the website via proxy
        try:
            response = requests.get('https://ip.flipnode.io/ip', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS)
        except (ProxyError, ReadTimeout, ConnectTimeout) as error:
            pass
        else:
            print(response.text)

If you wish to scrape publicly available content using any functional proxy from the list, you can add a break statement after printing the response to stop iterating through the proxies in the CSV file:

            response = requests.get('https://ip.flipnode.io/ip', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS)
        except (ProxyError, ReadTimeout, ConnectTimeout) as error:
            pass
        else:
            print(response.text)
            break  # notice the break here

Now, the only remaining aspect preventing us from maximizing our potential is speed.

How to rotate proxies using async

To rotate proxies using async, you will need to utilize the aiohttp module. You can install it by executing the following command in your command-line interface:

$ pip install aiohttp

After installing aiohttp, create a Python file and define the following variables:

CSV_FILENAME = 'proxies.csv'
URL_TO_CHECK = 'https://ip.flipnode.io/ip'
TIMEOUT_IN_SECONDS = 10

Next, define an async function and execute it using the asyncio module. The function accepts two parameters: the URL to request and the proxy to use for accessing it. Finally, print the response received. If any errors occur while accessing the URL via the proxy, they will also be printed:

async def check_proxy(url, proxy):
    try:
        session_timeout = aiohttp.ClientTimeout(total=None, sock_connect=TIMEOUT_IN_SECONDS, sock_read=TIMEOUT_IN_SECONDS)
        async with aiohttp.ClientSession(timeout=session_timeout) as session:
            async with session.get(url, proxy=proxy, timeout=TIMEOUT_IN_SECONDS) as resp:
                print(await resp.text())
    except Exception as error:
        # You can comment out this line to only see valid proxies printed in the command line
        print('Proxy responded with an error:', error)
        return

Next, define the main function that reads the CSV file and creates an asynchronous task to check each proxy listed in the file:

async def main():
    tasks = []
    with open(CSV_FILENAME) as open_file:
        reader = csv.reader(open_file)
        for csv_row in reader:
            task = asyncio.create_task(check_proxy(URL_TO_CHECK, csv_row[0]))
            tasks.append(task)
    await asyncio.gather(*tasks)

Run the main function and wait until all async tasks are completed:

asyncio.run(main())

That's it! Your proxies will now rotate with optimal speed.

More tips on proxy rotation

Lastly, let's explore some important tips for proxy rotation to ensure a smooth web scraping process.

Avoid free proxy services

It's best to steer clear of free proxy IP addresses. While they may seem enticing, they often come with more drawbacks than benefits. Free proxies are typically slower due to high usage by multiple users, and their availability is not guaranteed. You may find that the proxies you used one day are no longer accessible the next. Additionally, free proxies often lack support for encrypted HTTPS connections, leading to security and privacy concerns.

Combine IP rotation with user-agent rotation

User-agents are strings in HTTP requests that provide information about the browser, operating system, software, and device type. When multiple requests originate from the same browser and operating system within a short timeframe, the target website can detect suspicious activity and block your access. In addition to rotating proxies, it's essential to rotate user agents to avoid detection and mitigate the risk of blocks.

Choose a reliable premium proxy service

Instead of relying on free proxies, it is strongly recommended to opt for a reputable premium proxy provider. Premium providers offer numerous advantages, including enhanced data privacy, security, and faster speeds. Look for a provider that is transparent about their proxy sourcing practices and can provide proof of ethical acquisition of proxies.

By following these tips, you can improve the effectiveness of your proxy rotation strategy and ensure a more efficient and secure web scraping experience.

Conclusion

Proxy rotation plays a crucial role in the success of web scraping projects, and the good news is that creating a proxy rotator in Python is a straightforward task.