Guide to Scraping Data from Websites to Excel with Web Query
Flipnode on May 22 2023
Microsoft Excel provides a convenient feature called Web Query, which enables data extraction from websites directly to Excel. This article explores the various methods for utilizing Web Query to extract public web data effectively.
How web scraping in Excel works
How to scrape website data using Web Query Excel
In this tutorial, we will walk through the process of scraping data from the "books.toscrape.com" website and saving it to an Excel spreadsheet using Web Query. Before we begin, make sure you have the following:
- An active internet connection.
- Microsoft Office is installed on your computer to use Microsoft Excel. If you don't have Microsoft Office, you can download and install it from here.
After installing Microsoft Office, you can proceed with the following steps:
Step 1: Start by opening a new, empty spreadsheet.
Open Microsoft Excel and click on the "Data" tab in the menu.
Step 2: Choose the "From Web" option.
After clicking on "Data," a new menu will appear offering several data extraction options. Select the "From Web" option and click on it to open a new window.
Step 3: In the address bar of the new window, enter the URL of the web page and click on the "Go" button.
In the "New Web Query" window, enter the URL "https://books.toscrape.com" in the address bar and click the "Go" button.
Step 4: Navigate to the book page
Once you click the "Go" button, a website will appear in the mini web browser. You can interact with and navigate the website within this browser. Scroll down the page and click on a book link to open the corresponding book page.
Step 5: Select the desired table to scrape
As you scroll down the book page, you'll come across a table. Look for a small yellow arrow icon next to the table. Clicking on this icon will select the table. Once the table is selected, simply click the "Import" button located below it.
Step 6: Select the "Existing worksheet" option and click "OK"
After clicking the "Import" button, a small window will appear. Make sure to select the "Existing worksheet" option and click "OK" to import the data.
And there you have it! Web Query will initiate a background process to execute the web request and retrieve the website. Once the website is fetched, it will automatically parse the table and extract the data into the Excel columns. You will see the output similar to the example provided below.
You can cross-reference it with the website to ensure the accuracy of the table data. Each column and row will be linked to the web query, allowing Microsoft Excel to update the corresponding rows and columns whenever you manually or automatically refresh the data. The following section delves into various methods for refreshing and updating the data.
How to update and refresh data
There are two primary methods for refreshing the Web Query:
- Automatic Refresh:
In automatic mode, Excel will regularly retrieve the data in the background and update the sheet accordingly. You can customize the refresh frequency to suit your requirements, which will be explained later in this article.
- Manual Refresh:
Now, let's explore different ways to manually refresh the data. We'll start by deleting a few items from the extracted data to ensure that the refresh function operates as expected.
Notice that the price, tax, and availability data have been removed.
Option 1: Click "Refresh" from the top menu
In the top menu, under the "Data" submenu, there is a "Refresh" button. You can update the data by clicking this button or using the shortcut on your keyboard: "CTRL + ALT + F5". Once you perform the refresh, the missing price, tax, and availability values will be updated with the latest information.
Option 2: Refresh from the context menu
To test a different method of data refresh, let's remove a few fields once again. This time, we'll use the "Refresh" button from the context menu instead of the menu button. Right-click on a cell that is managed by Web Query and select "Refresh." Keep in mind that the "Refresh" button will only appear when you right-click on cells managed by Web Query, not on other cells.
Option 3: Edit & rerun the Query
To update the data using another method, let's remove a few fields again. After removing the data, right-click on the cell and select the "Edit Query" option.
It will open a new window called "Edit Web Query." By clicking "Import," the web query will initiate a background process to retrieve the latest updates from the website and refresh the existing data, replacing outdated information with new data.
This method is particularly useful when you need to update the scraper, such as modifying the website URL or adjusting the query to fetch a different table or page.
Now, let's explore how to automate the refresh/update process.
Configure automatic refresh from properties
The steps to configure the automatic refresh process from properties are straightforward. To begin, right-click on a cell linked to the Web Query to open the context menu. Then, choose "Data Range Properties" from the menu.
This will open the "External Data Range Properties" window
In this window, locate the "Refresh Controls" section. By adjusting the settings in this section, you can enable automatic data refresh. Firstly, select the "Enable background refresh" checkbox. Then, using the second checkbox, you can specify the frequency at which the data should be refreshed. For instance, if you set it to refresh every 5 minutes, Excel will automatically retrieve the data at 5-minute intervals, utilizing a background process to update the table automatically.
If you activate the third checkbox, "Refresh data when opening the file," Excel will retrieve the web data and update the sheet every time you open the spreadsheet.