Web Scraping with Selenium and Python

Flipnode on Jun 07 2023

To grasp the fundamentals of data scraping with Python and gain a general understanding of web scraping, it is essential to explore different frameworks and request libraries. Familiarizing yourself with various HTTP methods, particularly GET and POST, can greatly simplify the web scraping process.

One widely recognized and frequently utilized tool for automating web browser interactions is Selenium. When combined with other technologies like Beautiful Soup, it enables a deeper understanding of the basics of web scraping.

But how does Selenium work? It automates the execution of your scripted processes by interacting with a web browser. This automation allows for repetitive tasks such as clicking, scrolling, and more. While Selenium is primarily designed for automating web applications for testing purposes, its applications extend beyond that scope.

In this guide on web scraping with Selenium, we will primarily use Python 3.x as our programming language of choice. Python is not only the most popular language for scraping but also the one we closely work with.

Setting up Selenium

To begin, you can download the Selenium package by running the following command in your terminal:

pip install selenium

Additionally, you'll need to install the Selenium drivers, which allow Python to control the browser through OS-level interactions. If you choose to perform a manual installation, ensure that the drivers are accessible via the PATH variable.

You can download the drivers for Firefox, Chrome, and Edge from the following location

Quick starting Selenium

Let's start automating the process by launching your web browser:

To open a new browser window using Firefox:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://flipnode.io/')

This will open the browser in the headful mode. If you want to run the browser in headless mode on a server, you can use the following code:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Firefox(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.flipnode.io/")
print(driver.page_source)
driver.quit()

Make sure to replace DRIVER_PATH with the actual path to the Firefox driver executable on your system. The driver.page_source line prints the HTML source code of the loaded web page, and driver.quit() closes the browser after the script finishes executing.

Data extraction with Selenium by locating elements

Finding elements in web pages can sometimes be challenging. Fortunately, Selenium offers two methods that you can use to extract data from one or multiple elements. These methods are:

find_element: Locates a single element.

find_elements: Locates multiple elements.

Let's take an example of locating the H1 tag on the flipnode.io homepage using Selenium:

<html>
    <head>
        ... something
    </head>
    <body>
        <h1 class="someclass" id="greatID"> Partner Up With Proxy Experts</h1>
    </body>
</html>

from selenium.webdriver.common.by import By

h1 = driver.find_element(By.TAG_NAME, 'h1')
h1 = driver.find_element(By.CLASS_NAME, 'someclass')
h1 = driver.find_element(By.XPATH, '//h1')
h1 = driver.find_element(By.ID, 'greatID')

Using Developer Tools:

You can also use the find_elements method to obtain a list of elements. For example:

all_links = driver.find_elements(By.TAG_NAME, 'a')

This will retrieve all anchor tags on the page.

However, some elements may not have an easily accessible ID or a simple class. In such cases, you can leverage XPath.

XPath

XPath is a syntax language that helps locate specific objects in the DOM. XPath syntax can find a node from the root element using an absolute path or a relative path. For example:

/: Selects a node from the root. /html/body/div[1] will find the first div.
//: Selects a node from the current node regardless of its location. //form[1] will find the first form element.
[@attributename='value']: A predicate that looks for a specific node or a node with a specific value.

Example:

<html> 
 <body> 
   <div class="content-login"> 
     <form id="loginForm"> 
         <div> 
            <input type="text" name="email" value="Email Address:"> 
            <input type="password" name="password" value="Password:"> 
         </div> 
        <button type="submit">Submit</button> 
     </form> 
   </div> 
 </body> 
</html>

//input[@name='email'] will find the first input element with the name "email".

WebElement:

WebElement in Selenium represents an element from HTML pages. Here are some commonly used actions:

element.text: Accesses the text content of an element.
element.click(): Performs a click action on the element.
element.get_attribute('class'): Accesses the value of an attribute.
element.send_keys('mypassword'): Enters text into an input field.

Solutions for Slow Website Rendering:

Some websites heavily rely on JavaScript to render dynamic content, making them challenging to handle due to numerous AJAX calls. Here are a few ways to address this issue:

time.sleep(ARBITRARY_TIME): Add a pause to allow the elements to load.
WebDriverWait(): Explicitly wait for specific conditions before proceeding.

Example:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "mySuperId"))
    )
finally:
    driver.quit()

This code waits for the element with the ID "mySuperId" to be present for a maximum of 10 seconds. For more detailed information, refer to the official Selenium documentation.

Executing Javascript with Selenium

To execute JavaScript code using Selenium, we can utilize the execute_script method from the WebDriver module. This method allows us to pass JavaScript code as a string argument. Take a look at the following example:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://www.example.com")
driver.execute_script("alert('Hello World');")

In the above code, we create a WebDriver instance for a Firefox browser, navigate to the desired website, and then use execute_script to run a simple JavaScript snippet that displays an alert box with the text "Hello World" on the website.

The execute_script method also supports additional arguments that can be passed to the JavaScript code. For instance, if we want to click a button using JavaScript, we can achieve it with the following code:

from selenium.webdriver.common.by import By
from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://www.example.com")
button = driver.find_element(By.TAG_NAME, "button")
driver.execute_script("arguments[0].click();", button)

In this example, we locate the button element using the tag name and then pass it as an argument to execute_script, which utilizes JavaScript to click the button. Note that we use arguments[0] within the JavaScript code to reference the first argument passed to execute_script.

Capturing Screenshots using Selenium:

Selenium WebDriver also provides a feature to capture screenshots of websites. These screenshots can be saved locally for further analysis. Consider the following example:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://books.toscrape.com")
driver.save_screenshot("screenshot.png")
driver.close()

We utilize the save_screenshot() method to capture a screenshot of the website. The argument "screenshot.png" is used to specify the name of the image file that will be saved in the current folder. Selenium automatically saves the image in the PNG format based on the file extension provided.

Scraping Multiple URLs using Selenium:

We can leverage Selenium to scrape multiple URLs in Python. This allows us to use the same WebDriver instance to browse multiple websites or web pages and collect data efficiently. Take a look at the following example:

urls = ["https://www.example.com/page/{}".format(i) for i in range(1, 11)]
for url in urls:
   driver.get(url)
   # perform scraping operations

In this example, we want to browse the first ten pages of a website. We utilize Python's list comprehension to create a list of URLs, and then we iterate over the list using a for loop. Inside the loop, we use Selenium's get() method to navigate to each URL and perform the desired scraping operations.

Scrolling Down using Selenium:

To scroll down a website using Selenium and Python, we can leverage Selenium's JavaScript support and the execute_script method to execute JavaScript code that scrolls the page. Consider the following example:

driver.execute_script("window.scrollBy(0, 500);")

In this example, we use the scrollBy method, which takes two arguments representing the pixel values for horizontal and vertical scrolling. Here, we instruct Selenium to scroll the page 500 pixels down.

Selenium vs Puppeteer

The popularity and complexity of Selenium stem from its ability to write tests in various programming languages, such as C#, Groovy, Java, Perl, PHP, Python, Ruby, Scala, and even JavaScript. Additionally, Selenium supports multiple browsers, including Chrome, Firefox, Edge, Internet Explorer, Opera, and Safari.

However, when it comes to web scraping, using Selenium can be more complicated than necessary. It's important to remember that Selenium's primary purpose is functional testing, where it emulates human interaction in a browser. Therefore, Selenium requires three main components:

A driver specific to each browser.
Installation of each desired browser.
The corresponding package or library depending on the programming language being used.

In contrast, Puppeteer simplifies the process. The Puppeteer node package includes Chromium, eliminating the need for separate browser installations or drivers. This streamlined approach makes web scraping with Puppeteer more straightforward. Additionally, Puppeteer also supports the Chrome browser if that is your preferred choice.

Selenium vs. scraping tools

Selenium and scraping tools are two different approaches to web scraping, each with its own advantages and use cases.

Selenium is primarily designed for web automation and functional testing. It allows you to control a web browser programmatically, mimicking human interactions such as clicking, scrolling, and form submission. Selenium supports multiple programming languages and browsers, making it a versatile tool for automating browser tasks.

On the other hand, scraping tools are specifically built for data extraction from websites. These tools often provide simplified interfaces and features that are tailored for scraping tasks. They may offer convenient methods for parsing HTML, handling pagination, dealing with anti-scraping measures, and exporting data in various formats.

Here are some considerations when deciding between Selenium and scraping tools:

Complexity: Selenium can be more complex to set up and use, especially for beginners, due to its extensive features and integration requirements. Scraping tools, on the other hand, are usually designed to be user-friendly and provide straightforward scraping functionalities.
Browser Control: Selenium allows you to automate browser interactions and execute JavaScript on web pages, making it suitable for scraping dynamic websites that heavily rely on client-side rendering. Scraping tools may not offer the same level of control over browser actions but can still handle many scraping tasks efficiently.
Speed: Selenium automation involves launching a browser, which can be slower compared to scraping tools that directly parse HTML content. If speed is a critical factor, scraping tools may offer faster scraping capabilities.
Target Websites: Some websites may have anti-scraping measures in place, such as CAPTCHAs or JavaScript challenges. Selenium's ability to interact with web pages and execute JavaScript can help bypass these obstacles more effectively. Scraping tools may have built-in features to handle anti-scraping mechanisms, but their effectiveness may vary.
Flexibility: Selenium's support for multiple programming languages and browsers provides flexibility in choosing the environment that best suits your needs. Scraping tools often come with their own programming interfaces and may have limitations on language and browser compatibility.

Conclusion

Web scraping with Selenium is an excellent starting point for learning the fundamentals. However, depending on your objectives, opting for pre-existing scraping tools can often be a more convenient choice. Developing your own scraper can be a time-consuming and resource-intensive process, which may not always justify the investment of time and effort.