Guide to Using Google Sheets for Basic Web Scraping
Flipnode on Jun 14 2023
Scraping data from websites using Google Sheets is a user-friendly and code-free method. Unlike traditional web scraping techniques that require coding, Google Sheets provides a built-in function that simplifies the process. This built-in function acts as a basic web scraper, making it accessible even to users without coding knowledge.
In this guide, we will walk you through a practical example of how to scrape website data using Google Sheets. You will learn how to leverage the scraping capabilities of Google Sheets to effortlessly extract data from websites.
Importing XML and HTML
The IMPORTXML function is a versatile tool for importing data from different structured data formats. While commonly used for scraping XML documents, it can also extract information from HTML documents. In addition to XML and HTML, the IMPORTXML formula can handle other structured data types such as Comma Separated Values (CSV), Tab Separated Values (TSV), Really Simple Syndication (RSS), and Atom XML feeds.
How IMPORTXML works
The IMPORTXML function requires two parameters: the URL of the webpage to analyze and the XPath query.
If you're interested in viewing the official documentation, you can click here. However, this article will provide a comprehensive explanation of the IMPORTXML function and other related functions. Let's explore some examples to gain a better understanding of how the IMPORTXML function works.
To extract the title element from the webpage https://quotes.toscrape.com/, you can use the following formula:
=IMPORTXML("https://quotes.toscrape.com/","//title")
In this example, the first parameter represents the website URL, and the second parameter is the XPath query. It's important to enclose both parameters in quotation marks, unless they are referencing other cells.
You can create an XPath query directly from your browser. Simply open the webpage in your preferred browser, right-click on the desired element, and select "Inspect". This will open the developer tools with the HTML element highlighted. Right-click on the highlighted HTML element, choose "Copy", and then select "Copy XPath". This action will copy the XPath and store it in your clipboard.
Let's consider another example. If you want to extract the first quote from the webpage, the formula would be as follows:
=IMPORTXML("https://quotes.toscrape.com/","(//*[@class='text']/text())[1]")
If you find this XPath query unfamiliar or challenging, we recommend reading the XPath section on our blog to learn more about crafting XPath queries.
Alternatively, you can input the URL and XPath query in cells. For instance, if you enter the URL in cell A1 and the XPath in cell A2, your formula will reference these cells as follows:
=IMPORTXML(A1,A2)
Quick XPath introduction
IMPORTXML is exclusively designed to work with XPath queries. This section provides a brief introduction to XPath, which will be sufficient to begin using the IMPORTXML formula.
If you have an HTML structure where the body tag directly contains an h1 element, you can use the following XPath to locate that specific h1 element:
/html/body/h1
In this XPath expression, the forward slash at the beginning represents the document. The breakdown of the above XPath is as follows:
- Start at the root of the document
- Locate an html tag within the document
- Find a body tag within the html tag
- Retrieve an h1 tag
If you want to match all h1 tags, you can use a shorter XPath expression:
//h1
Using two forward slashes "//" signifies matching all descendants. This XPath expression will locate all h1 elements found anywhere in the document. To extract the text contained within the h1 element, you can employ the text() function:
//h1/text()
You can also extract the value of any attribute. For instance, if you wish to extract the value of the class attribute from h1 tags, the XPath would be as follows:
//h1/@class
As demonstrated, when working with attributes, you need to prefix the attribute name with the "@" character.
Let's now explore filtering. Suppose you only want to retrieve h1 elements with a class named "pdp_orange" applied. The following XPath accomplishes that:
//h1[@class="pdp_orange"]
From this example, we learn that square brackets are used for filtering purposes.
How to extract data from a website to Google Sheets
Follow the steps outlined below to extract data from a website and import it into Google Sheets without using add-ons:
- Start by finding the XPath to select the desired elements. Load the website in Google Chrome, right-click on the element you want to extract, and choose "Inspect."
The Developer Tools window will open with the Elements panel. To select a single element, right-click on the corresponding HTML element, select "Copy," and click on "Copy XPath."
For example, let's work with the website https://books.toscrape.com/ and extract all the book titles. This requires creating a custom XPath like this:
//h3/a/@title
While the Elements tab is open in the Developer Tools, press Ctrl+F or Command+F, and paste the XPath. You will see a preview of the 20 selected items.
- Open Google Sheets and create a new sheet. If you haven't logged into your Google account, you will need to do so. Depending on your interface, you may find a "New sheet" tile at the top or a floating "New Sheet" button in the lower right corner. You can also directly create a new sheet by visiting sheets.new.
- In two cells of the newly created sheet, enter the URL of the webpage and the XPath. By entering these values in cells, we can easily maintain the formula.
- Now, we can start extracting website data using Google Sheets. In a new cell, for example, A4, enter the following formula:
=IMPORTXML(B1, B2)
This formula effectively calls the function below, assuming the URL is in cell B1 and the XPath is in cell B2:
=IMPORTXML("https://books.toscrape.com/","//h3/a/@title")
Press Enter to execute the function. It will take a moment and retrieve all twenty book titles.
You will notice that a single formula efficiently extracts all the book titles. To extract book prices, the first step is to create the XPath for prices, which would be as follows:
//*[@class="price_color"]/text()
Enter this XPath in a cell, let's say B3. Then, enter the following formula in cell B4:
=IMPORTXML(B1, B3)
You will now see the fetched book prices.
Other related functions
In addition to IMPORTXML, Google Sheets provides several other functions that allow web scraping directly from the document without the need for add-ons. These native functions include:
- IMPORTHTML: This function is used to extract data from tables and lists on a webpage.
- IMPORTFEED: With the IMPORTFEED function, you can import data from RSS or Atom feeds.
- IMPORTDATA: The IMPORTDATA function enables scraping data from a website URL that contains data in CSV or TSV format.
These functions provide additional flexibility for web scraping within Google Sheets, allowing you to extract data from different sources and formats. In the following sections, we will delve into these functions in more detail.
Import a table from a website to Google Sheets
If the webpage you want to scrape contains data in a table format, the IMPORTHTML function in Google Sheets is the ideal choice. This function requires three parameters:
- URL: This is the complete URL of the webpage you wish to scrape, including the "http://" or "https://" part.
- Either "table" or "list": You can use the IMPORTHTML function to extract data from both tables and lists. Set this parameter to "table" if you want to scrape a table.
- The index of the table or list you want to scrape: Note that the index starts from 1. If the webpage contains multiple tables or lists, the number 1 may refer to different tables or lists depending on the second parameter.
Let's take an example using the webpage "https://en.wikipedia.org/wiki/List_of_highest-grossing_films". Follow these steps:
- Create a new sheet in Google Sheets and enter the URL in a cell, such as B1.
- In cell A3, enter the following formula:
=IMPORTHTML(B1, "table", 1)
This formula retrieves the first table from the webpage and imports it into the Google spreadsheet.
You can also extract a single column from the table using the INDEX function. The INDEX function requires the table reference as the mandatory parameter and row and column as optional parameters. For example, if you want to extract only the movie titles from column 3, use the following formula:
=INDEX(IMPORTHTML("https://en.wikipedia.org/wiki/List_of_highest-grossing_films", "table", 1), , 3)
Here, we have omitted the optional "row" parameter to keep all the rows.
Now let's move on to another useful function, IMPORTFEED.
Import data from XML feeds to Google Sheets
RSS and Atom are two common XML formats used to generate website feeds. While it's possible to scrape data from these feeds using the IMPORTXML function and XPath queries, a specialized function is more convenient. Although there are third-party add-ons available, we can achieve the same result without them.
Enter the IMPORTFEED function, which only requires one argument: the URL of the feed. This function automatically fetches the data from the feed and presents it in Google Sheets. Let's use the New York Times Technology feeds as an example to demonstrate this function.
Follow these steps:
1.Create a new sheet in Google Sheets and enter the feed URL in cell B1:
https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml
2.In cell A2, enter the following formula:
=IMPORTFEED(B1)
3.Press Enter, and you'll see twenty feed items imported into your Google Sheets document.
By using the IMPORTFEED function, you can easily retrieve and display the feed data without the need for additional add-ons or complex XPath queries.
Customizing data imported by IMPORTFEED
The IMPORTFEED function provides several optional parameters:
- Query: You can use this parameter to specify the specific information you want to import from the feed.
- Headers: By default, imported data doesn't include column headers. If you want to include headers, set this parameter to TRUE.
- num_items: This parameter allows you to control the number of items to be imported from the feed. For example, if you want to import only five items, set this parameter to 5.
Let's update the function call with these parameters:
=IMPORTFEED(B1,,TRUE,5)
In this updated formula, we have left the query parameter blank, which imports all columns. However, only five items will be fetched along with column headers.
If you want to extract specific information from the feed, you can provide a list of possible items as the second parameter. For instance, to retrieve only the feed information, you can use the following formula:
=IMPORTFEED(B1,"feed")
This formula will return a single row with the feed information, such as "NYT > Technology" and "https://www.nytimes.com/section/technology".
To narrow down the extracted feed information further, you can specify the specific piece of information you want. For example, if you only need the feed title, modify the formula as follows:
=IMPORTFEED(B1,"feed title")
Other available options include feed description, feed author, and feed URL. If any of the specified information is not found in the feed, you will see an error.
If you want to extract a specific column, you can use the query format "items <type>". For instance, to retrieve only the titles, use the following formula:
=IMPORTFEED(B1,"items title")
Please note that you can't specify multiple columns in this manner. If you need to import multiple columns, you will need to make separate function calls for each column.
Importing Data from CSV to Google Sheets
If you have a website URL that includes a CSV file, you can utilize the IMPORTDATA function to retrieve the data. Follow these steps:
Create a new sheet and input the URL in cell B1. For instance:
https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment.csv
In cell A2, enter the following formula:
=IMPORTDATA(B1)
Upon pressing Enter, the function will fetch and display the data from the CSV file in your Google Sheets document.
Does the data stay fresh?
If you keep your Google Sheet open, the mentioned functions will automatically check for updated data approximately every hour. Additionally, if you delete a cell and add it again, the data will be refreshed. However, please note that the data will not be refreshed if you manually refresh the entire sheet or if you copy and paste a cell containing these functions.
Advantages and drawbacks of import functions
Using the import functions in Google Sheets offers several advantages:
- No coding required: One of the main advantages is that you don't need coding skills to use these functions. You can create a web scraper in Google Sheets without the need for additional add-ons.
- Automatic data freshness: The extracted data remains reasonably fresh as the functions periodically check for updates. This ensures that you have access to the most up-to-date information without manual intervention.
- Dynamic imports: The import functions can be used as regular formulas in Google Sheets, allowing you to create dynamic imports by referencing other cells. This flexibility enables you to customize and automate your data retrieval process.
However, there are also some limitations to consider:
- Lack of scalability: Google Sheets is not designed for handling large-scale data imports. If you need to import millions of records, you may encounter performance issues or limitations with Google Sheets' capacity.
- Limited header customization: The headers sent with the requests are standard Google headers, including the user-agent. This can sometimes result in websites blocking the requests, and there is no option to customize the headers to bypass such restrictions.
- No support for POST requests: The import functions in Google Sheets only support GET requests. If your web scraping task requires more complex interactions or involves sending POST requests, these functions may not be suitable.
It's important to consider these drawbacks when deciding whether to use the import functions in Google Sheets for your web scraping needs.
Common Errors
Here are some common errors that you may encounter when creating a Google spreadsheet for web scraping:
- Error: Array result was not expanded
This error occurs when the imported data is too large to fit into the selected range of cells. To resolve this error, you need to expand the range of cells to accommodate the complete data. - Error: Result too large
If you receive this error, it means that the imported data is too large to be handled by Google Sheets. This typically happens when using the IMPORTXML function with a large amount of data. To fix this, you can try refining your XPath query to retrieve a smaller set of data. - Errors related to volatile functions
Google Sheets restricts the use of volatile functions like NOW(), RAND(), or RANDBETWEEN() in the import functions. If you encounter an error mentioning these functions, it means you're trying to reference them directly or indirectly within the import function. To resolve this, copy and paste the values from these volatile functions and then reference the static values in your import function.
It's important to troubleshoot and resolve these errors to ensure your web scraping workflow functions smoothly.
Conclusion
Google Sheets provides a user-friendly approach to web scraping, allowing you to gather data without the need for coding. One of the key advantages is the ability to collect data directly into a spreadsheet. However, it's important to note that Google Sheets has certain limitations, and for more advanced web scraping tasks, you may need to explore the use of specialized web scraping tools or create custom web scrapers.