Playwright Scraping Tutorial for 2023

Flipnode on Jun 20 2023

In recent years, the internet has experienced significant growth and impact, largely due to the advancement of technologies that enable the creation of more user-friendly applications. This growth has also seen an increase in automation across various stages of web application development and testing.

Having reliable tools for testing web applications is essential in this landscape. Libraries like Playwright have emerged to streamline processes by facilitating tasks such as opening web applications in browsers, performing user interactions like clicking elements and typing text, and even extracting public data from the web.

This article serves as a comprehensive guide to Playwright, detailing its capabilities and demonstrating how it can be utilized for automation and web scraping purposes.

What is Playwright?

Playwright is a powerful testing and automation framework designed to automate web browser interactions. With Playwright, you can write code that opens a web browser, leveraging all its capabilities. This includes navigating to URLs, entering text, clicking buttons, extracting data, and more. One of the most notable features of Playwright is its ability to work with multiple pages simultaneously, without being blocked or waiting for operations to finish.

Playwright supports popular browsers such as Google Chrome, Microsoft Edge (Chromium-based), and Firefox. Additionally, Safari is supported when using WebKit. The framework excels in cross-browser web automation, allowing you to write code that can efficiently execute across different browsers. Furthermore, Playwright supports multiple programming languages, including Node.js, Python, Java, and .NET. This means you can use your preferred language to write code that opens websites and interacts with them.

The documentation for Playwright is comprehensive, covering everything from getting started to detailed explanations of all the available classes and methods. Whether you are a beginner or an experienced user, the documentation provides the necessary resources to help you harness the full potential of Playwright.

Basic web scraping with Playwright

Let's explore another topic that will guide you on getting started with Playwright using Node.js and Python. Additionally, we have a separate blog post on how to scrape Amazon with Python, which you might find useful.

If you're using Node.js, start by creating a new project and installing the Playwright library. You can do this by running the following two simple commands:

npm init -y
npm install playwright

Here's a basic script that opens a dynamic page using Playwright:

const playwright = require("playwright");

(async () => {
  for (const browserType of ['chromium', 'firefox', 'webkit']) {
    const browser = await playwright[browserType].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto("https://amazon.com");
    await page.waitForTimeout(1000);
    await browser.close();
  }
})();

Let's examine the code. The first line imports the Playwright library. The script then launches multiple instances of browsers, allowing automation for Chromium, Firefox, and WebKit. It proceeds to open a new browser page and navigates to the Amazon web page using the page.goto() function. A 1-second wait follows to ensure the page is visible to the end user. Finally, the browser is closed.

The same code can be easily written in Python. First, install the Playwright Python library using the pip command, and then install the necessary browsers using the install command:

python -m pip install playwright
playwright install

Please note that Playwright supports both synchronous and asynchronous variations. The following example uses the asynchronous API:

from playwright.async_api import async_playwright
import asyncio

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto('https://amazon.com')
        await page.waitForTimeout(1000)
        await browser.close()

asyncio.run(main())

This code is similar to the Node.js code, with the main difference being the use of the asyncio library. The browser object launches a headful Chrome instance, which can be changed to launch in headless mode by passing headless=True. Additionally, the function names change from camelCase to snake_case.

In Node.js, if you want to create multiple browser contexts or have finer control, you can create a context object and create multiple pages within that context. This will open pages in new tabs:

const context = await browser.newContext();
const page1 = await context.newPage();
const page2 = await context.newPage();

You may also need to handle page contexts in your code. It's possible to obtain the browser context to which a page belongs using the page.context() function.

Locating elements

To extract information from an element or perform a click action, the initial step is to locate the element using selectors. Playwright provides support for both CSS and XPath selectors.

To better understand this concept, let's consider a practical example. Open the following Amazon link: https://www.amazon.com/b?node=17938598011

On this page, you'll notice that all the items are categorized under "International Best Seller," which is represented by <div> elements with the class names "a-section" and "a-spacing-base."

Using the developer tools to locate HTML elements, you can select these <div> elements using one of the CSS selectors mentioned earlier:

.a-spacing-base

Similarly, the XPath selector for these elements would be:

//*[contains(@class, "a-spacing-base")]

Once you have selected the elements using these selectors, you can utilize the following common functions:

$eval(selector, function): This function selects the first element matching the selector, passes it to the provided function, and returns the result of the function.

$$eval(selector, function): Similar to $eval, but selects all elements matching the selector and applies the function to each of them.
querySelector(selector): Returns the first element matching the selector.
querySelectorAll(selector): Returns all elements matching the selector.

These methods work effectively with both CSS and XPath selectors, providing flexibility in element selection and manipulation.

Scraping text

Continuing with the example of Amazon, after the page has been loaded, you can use a selector to extract all products using the $$eval function:

const products = await page.$$eval('.a-spacing-base', all_products => {
   // run a loop here
})

Now, you can extract the elements containing product data within a loop:

all_products.forEach(product => {
   const title = product.querySelector('span.a-size-base-plus').innerText
})

To extract the data from each data point, you can use the innerText attribute. Here's the complete code in Node.js, replacing "oxylabs" with "flipnode":

const playwright = require("playwright")
(async() =>{
for (const browserType of ['chromium', 'firefox',  'webkit']){
   const launchOptions = {
       headless: false,
       proxy: {
          server: "http://pr.flipnode.io:7777",
          username: "USERNAME",
          password: "PASSWORD"
        }
     }
   const browser = await playwright[browserType].launch(launchOptions)
   const context = await browser.newContext()
   const page = await context.newPage()
   await page.goto('https://www.amazon.com/b?node=17938598011');
   const products = await page.$$eval('.a-spacing-base', all_products => {
       const data = []
       all_products.forEach(product => {
           const title = product.querySelector('span.a-size-base-plus').innerText
           const price = product.querySelector('span.a-price').innerText
           const rating = product.querySelector('span.a-icon-alt').innerText
           data.push({ title, price, rating})
       });
       return data
   })
   console.log(products)
   await browser.close()
   }
})

The Python code will be slightly different. Python has a function eval_on_selector, which is similar to $eval in Node.js, but it requires JavaScript as the second parameter. In this case, it's better to write the entire code in Python and use query_selector and query_selector_all, which return an element and a list of elements, respectively.

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(
            proxy={
                'server': 'http://pr.flipnode.io:7777',
                'username': 'USERNAME',
                'password': 'PASSWORD'
            },
            headless=False
        )

        page = await browser.new_page()
        await page.goto('https://www.amazon.com/b?node=17938598011')
        await page.wait_for_timeout(5000)

        all_products = await page.query_selector_all('.a-spacing-base')
        data = []
        for product in all_products:
            result = dict()
            title_el = await product.query_selector('span.a-size-base-plus')
            result['title'] = await title_el.inner_text()
            price_el = await product.query_selector('span.a-price')
            result['price'] = await price_el.inner_text()
            rating_el = await product.query_selector('span.a-icon-alt')
            result['rating'] = await rating_el.inner_text()
            data.append(result)
        print(data)
        await browser.close()

if __name__ == '__main__':
    asyncio.run(main())

The output of both the Node.js and Python code will be the same. You can find the complete code used in this article on our GitHub repository for your convenience.

Scraping Images

In today's digital world, images play a crucial role in various domains, from e-commerce and social media to research and creative projects. When it comes to web scraping, the ability to extract images from websites can add a new dimension to your data collection efforts. In this chapter, we will explore how to leverage Playwright for scraping images, enabling you to gather visual content for analysis, visualization, or any other purpose that aligns with your business or personal goals.

Understanding the Importance of Scraping Images

Images provide valuable insights: Images often convey information that text alone cannot. Extracting images allows you to capture visual cues, product photos, infographics, and more.
Enhancing data analysis: Visual data can enrich your analysis, allowing you to identify patterns, trends, or anomalies that might not be apparent through textual data alone.
Supporting creative projects: For designers, artists, or content creators, scraping images opens up possibilities for inspiration, reference materials, or even generating new artwork.

Fetching Images with Playwright

Navigating to image-containing web pages: Playwright provides powerful navigation capabilities to reach specific web pages that host the images you want to scrape.
Inspecting and locating image elements: Use Playwright's element selection methods to identify and target image elements within the HTML structure of a page.
Extracting image URLs: Once you have identified the image elements, you can retrieve the URLs pointing to the image files using Playwright's methods.

Downloading Images

Saving images locally: Playwright offers functionality to download and save images to your local machine or a specified directory.
Handling multiple images: Iterate through the list of image URLs and download each image individually or in batches, depending on your specific requirements.
Managing file naming and organization: Implement logic to assign meaningful names to downloaded images and organize them in a structured manner for easier retrieval and further processing.

Handling Image Processing and Analysis

Image manipulation: Depending on your use case, you may need to perform image processing tasks such as resizing, cropping, or applying filters. Explore libraries like OpenCV or Pillow to extend the capabilities of Playwright in image manipulation.
Integrating with machine learning frameworks: If you aim to leverage machine learning techniques on scraped images, consider integrating Playwright with popular frameworks like TensorFlow or PyTorch.

Respecting Copyright and Legal Considerations

Ensure compliance with copyright laws: When scraping images, it's crucial to respect the intellectual property rights of image owners. Familiarize yourself with copyright laws and usage rights associated with the images you scrape.
Attribute and obtain permissions: If you plan to use scraped images for commercial purposes or share them publicly, it's essential to attribute the source and, if necessary, obtain proper permissions from the copyright holders.

In this chapter, we will provide step-by-step examples and practical tips to help you effectively scrape and work with images using Playwright. By mastering the art of image scraping, you can unlock a wealth of visual data and enhance your data-driven initiatives, creative projects, or research endeavors. So let's dive into the exciting world of image scraping with Playwright and unleash the potential of visual content in your web scraping workflows.

Playwright vs Puppeteer and Selenium

When it comes to web scraping and browser automation, Playwright is not the only player in the field. In this chapter, we will compare Playwright with two other popular frameworks: Puppeteer and Selenium. Understanding the similarities and differences between these tools will help you make an informed decision about which one suits your specific scraping needs.

Playwright: Power and Simplicity Combined

Built for modern web development: Playwright, developed by Microsoft, is a relatively new addition to the web scraping ecosystem. It provides a unified API to automate browsers such as Chrome, Firefox, and WebKit, making it versatile and adaptable to various web development technologies.
Language support: Playwright supports multiple programming languages, including JavaScript, Python, and .NET, making it accessible to developers from different backgrounds.
Powerful cross-browser capabilities: One of Playwright's standout features is its ability to handle cross-browser automation with consistent APIs. This means you can write scripts that work seamlessly across different browsers without needing to rewrite your code.
Enhanced automation capabilities: Playwright offers advanced automation features such as automatic waiting for page loads, network interception and modification, and fine-grained control over browser behavior.

Puppeteer: A Pioneer in Browser Automation

Developed by Google: Puppeteer, developed by the Chrome team at Google, was one of the first frameworks to provide a high-level API for browser automation. It is specifically tailored for Chrome and supports only the JavaScript programming language.
Chrome-centric automation: If your scraping needs revolve primarily around Chrome browser automation, Puppeteer provides a comprehensive set of features and excellent integration with Chrome's DevTools.
Well-documented and active community: Being around for a longer time, Puppeteer has a mature and well-documented ecosystem. The active community ensures continuous support, regular updates, and a wealth of resources for troubleshooting and learning.

Selenium: Widely Adopted and Cross-Platform

Industry standard for browser automation: Selenium has long been the go-to framework for browser automation and web scraping. It supports multiple browsers, including Chrome, Firefox, Safari, and Edge, making it a versatile choice.
Multi-language support: Selenium supports a range of programming languages, including Java, Python, C#, Ruby, and more, catering to developers with diverse language preferences.
Extensive browser compatibility: Selenium's broad browser compatibility makes it suitable for scraping scenarios where cross-browser testing or scraping from different browser environments is necessary.
Established ecosystem and integrations: Selenium's maturity is reflected in its extensive ecosystem, including plugins, integrations with popular testing frameworks, and ample community support.

Choosing the Right Framework for Your Needs

Consider your specific requirements: When deciding between Playwright, Puppeteer, and Selenium, evaluate factors such as browser compatibility, language support, automation capabilities, community support, and your familiarity with the programming language.
Scalability and future-proofing: Consider the scalability of your scraping project and the potential need for cross-browser compatibility or integration with other tools. Playwright's cross-browser capabilities make it a compelling choice for long-term flexibility.
Project complexity and learning curve: Puppeteer's simplicity and focused approach may be preferable for straightforward scraping tasks, while Selenium's extensive ecosystem and documentation make it a solid choice for more complex scenarios.

Here we have compared Playwright, Puppeteer, and Selenium, highlighting their key features, strengths, and considerations. By understanding the nuances of each framework, you can make an informed decision that aligns with your web scraping objectives.

Comparison of performance

As mentioned in the previous section, it is challenging to compare every scenario due to the significant differences in programming languages and supported browsers.

The only combination that can be directly compared is when using JavaScript to automate Chromium. This is the only combination supported by all three tools.

Providing a detailed comparison is beyond the scope of this article. However, you can find more information about the performance of Puppeteer, Selenium, and Playwright in the linked article. The main takeaway is that Puppeteer is the fastest, followed by Playwright. It's worth noting that in some scenarios, Playwright has shown better performance. Selenium is the slowest among the three options.

It's important to remember that Playwright offers additional advantages, such as multi-browser support and compatibility with multiple programming languages.

If you prioritize fast cross-browser web automation or if you are unfamiliar with JavaScript, Playwright becomes the only viable choice for you.

Conclusion

This article delved into the capabilities of Playwright as a web testing tool that is also well-suited for web scraping dynamic websites. With its asynchronous nature and cross-browser support, Playwright has emerged as a popular alternative to other tools. The article provided code examples in both Node.js and Python to showcase Playwright's versatility.

Playwright offers a range of functionalities, including navigating to URLs, entering text, clicking buttons, and extracting text. What sets it apart is its ability to extract dynamically rendered text. While other tools like Puppeteer and Selenium can perform similar tasks, Playwright shines when it comes to working with multiple browsers or languages other than JavaScript/Node.js. If you require such capabilities, Playwright is an excellent choice.