Puppeteer on AWS Lambda

Flipnode on Jun 16 2023

In the realm of web scraping and automation, Puppeteer has emerged as a powerful tool that enables developers to control web browsers programmatically. Its versatility and ease of use make it a popular choice among developers. However, when it comes to running Puppeteer on a scalable and serverless infrastructure, AWS Lambda stands out as an excellent option. In this article, we delve into the world of Puppeteer on AWS Lambda and explore how this combination can revolutionize your web scraping and automation workflows. Discover the benefits, challenges, and best practices of using Puppeteer on AWS Lambda, and unlock the full potential of browser automation in a serverless environment.

What is Puppeteer?

Puppeteer is a Node.js library developed by the Chrome team at Google. It provides a high-level API for controlling and automating web browsers, primarily targeting Google Chrome. With Puppeteer, developers can programmatically interact with web pages, perform actions such as clicking buttons, filling forms, and navigating through different pages. It allows for capturing screenshots and generating PDFs of web pages, as well as handling network requests and responses.

One of the key advantages of Puppeteer is its ability to handle dynamic content and JavaScript-heavy websites. It executes JavaScript in the context of the page, enabling developers to scrape data from dynamically generated elements and perform complex interactions with web applications.

Puppeteer offers a comprehensive set of features, including page manipulation, data extraction, and web testing capabilities. Its intuitive and well-documented API makes it accessible to both beginners and experienced developers alike. With Puppeteer, you have the power to automate tasks that would otherwise require manual intervention, saving time and effort in the process.

Now, let's explore how Puppeteer can be seamlessly integrated with AWS Lambda, providing a scalable and cost-effective solution for web scraping and browser automation needs.

What is AWS Lambda?

AWS Lambda is a serverless compute service provided by Amazon Web Services (AWS). It allows you to run your code without provisioning or managing servers. With Lambda, you can focus on writing your application logic while AWS takes care of scaling, patching, and managing the underlying infrastructure.

Lambda follows an event-driven architecture, where your code is executed in response to events such as HTTP requests, database updates, file uploads, or scheduled triggers. Each Lambda function is independent and stateless, designed to perform a specific task. It automatically scales based on the incoming workload, ensuring that your code runs efficiently and cost-effectively.

One of the significant advantages of AWS Lambda is its pay-per-use pricing model. You only pay for the actual compute time consumed by your code, without any charges for idle resources. This makes it highly cost-effective, especially for applications with sporadic or unpredictable traffic patterns.

AWS Lambda supports multiple programming languages, including Node.js, Python, Java, C#, and more, allowing you to write your code in the language of your choice. It seamlessly integrates with other AWS services, enabling you to build sophisticated serverless architectures and leverage the rich ecosystem of AWS.

By combining the power of Puppeteer with AWS Lambda, you can unlock a serverless infrastructure for running Puppeteer-based web scraping and automation tasks. This fusion of technologies offers numerous benefits, including scalability, cost-efficiency, and simplified deployment and management. In the following sections, we will explore the considerations and best practices for using Puppeteer on AWS Lambda to maximize the potential of your web scraping workflows.

Problem #1 – Puppeteer is too big to push to Lambda

When directly pushing a zip file to AWS Lambda, there is a strict 50 MB limit. However, this limitation can be bypassed when loading the function from an S3 bucket. You can refer to the documentation for more details.

To overcome the 250 MB unzipped size limitation, you can follow these steps:

Create an S3 bucket in Amazon S3.
Utilize a Node script to upload the necessary files to the S3 bucket.
Update your Lambda code by referencing the files from the S3 bucket.

Here is an example script that demonstrates this process:

{
  "zip": "npm run build && 7z a -r function.zip ./dist/* node_modules/",
  "sendToLambda": "npm run zip && aws s3 cp function.zip s3://chrome-aws && rm function.zip && aws lambda update-function-code --function-name puppeteer-examples --s3-bucket chrome-aws --s3-key function.zip"
}

In the script above, the "zip" command builds the necessary files and packages them into a zip file named "function.zip." Then, the "sendToLambda" command uploads the zip file to the S3 bucket named "chrome-aws," removes the local zip file, and updates the Lambda function code by referencing the uploaded zip file from the S3 bucket named "chrome-aws" with the key "function.zip."

Problem #2 – Puppeteer on AWS Lambda doesn’t work

By default, Linux distributions, including AWS Lambda, do not include the necessary libraries required for Puppeteer to function properly.

However, there is a solution available called Chrome AWS Lambda package, which utilizes Chromium. You can find this package here. To use Puppeteer with AWS Lambda, you need to install both the Chrome AWS Lambda package and puppeteer-core in your function.

You can install these packages using the following command:

npm i --save chrome-aws-lambda puppeteer-core

When setting up your Puppeteer code to launch a browser, you can use the following code snippet as a reference:

const browser = await chromium.puppeteer.launch({
  args: chromium.args,
  defaultViewport: chromium.defaultViewport,
  executablePath: await chromium.executablePath,
  headless: chromium.headless
});

Make sure to include this code snippet in your Lambda function to ensure proper browser launching using Puppeteer with the Chrome AWS Lambda package.

Final note

When working with Puppeteer, it's important to consider the memory requirements of your script. Puppeteer typically requires more memory compared to regular scripts. Therefore, it is recommended to allocate at least 512 MB of memory to your AWS Lambda function when using Puppeteer.

Additionally, it's crucial to include the await browser.close() statement at the end of your script. Failing to do so may result in your function running until it times out, even when there are no commands being executed. This is because the browser instance will remain active and waiting for further instructions.

To prevent unnecessary resource consumption, ensure that you include await browser.close() to properly terminate the browser instance at the end of your script execution.