Scraping Amazon Product Data: A Complete Guide
Flipnode on Jun 20 2023
In today's highly competitive e-commerce landscape, having access to comprehensive and up-to-date product data is crucial for making informed business decisions and gaining a competitive edge. Amazon, being the largest online marketplace, holds a treasure trove of valuable information. But how can you extract this data efficiently and effectively?
In this comprehensive guide, we will take you on a journey to become a proficient Amazon scraper. Whether you're an e-commerce entrepreneur, a data analyst, or a web scraping enthusiast, this step-by-step tutorial will equip you with the knowledge and skills needed to scrape product data from Amazon.
We will start from scratch, exploring the fundamental concepts of web scraping and gradually diving into advanced techniques tailored specifically for scraping Amazon. From handling anti-scraping measures to extracting product details, customer reviews, pricing information, and more, you will learn the ins and outs of building a robust and reliable Amazon scraper.
Join us as we unravel the secrets behind scraping Amazon product data and empower yourself with the tools to extract valuable insights and gain a competitive advantage in the ever-evolving world of e-commerce. Let's embark on this exciting journey and unlock the full potential of Amazon scraping together.
Setting up for scraping
To get started with web scraping Amazon, you'll need Python installed on your system. If you don't have Python 3.8 or above, you can download and install it from python.org.
Once Python is installed, create a folder to store your web scraping code for Amazon. It's a good practice to create a virtual environment for your project.
On macOS and Linux, use the following commands to create and activate the virtual environment:
$ python3 -m venv .env
$ source .env/bin/activate
On Windows, the commands will be slightly different:
d:\amazon> python -m venv .env
d:\amazon> .env\scripts\activate
Next, you'll need to install the required Python packages. There are two main steps involved: retrieving the HTML and parsing it to extract relevant data.
The Requests library is commonly used for making HTTP requests in Python. It simplifies the process of sending requests and receiving responses from web servers. However, it returns HTML as a string, making it challenging to extract specific elements such as prices when working with web scraping.
To overcome this limitation, we can use Beautiful Soup. It's a Python library specifically designed for web scraping, allowing you to extract information from HTML and XML files by searching for tags, attributes, or specific text.
You can install both libraries using the following command:
$ python3 -m pip install requests beautifulsoup4
For Windows users, replace python3 with python in the command.
Note that we're installing version 4 of the Beautiful Soup library.
Now let's try out the Requests scraping library. Create a new file named amazon.py and enter the following code:
import requests
url = 'https://www.amazon.com/Bose-QuietComfort-45-Bluetooth-Canceling-Headphones/dp/B098FKXT8L'
response = requests.get(url)
print(response.text)
Save the file and run it from the terminal using the following command:
$ python3 amazon.py
However, in most cases, you won't be able to view the desired HTML. Amazon blocks such requests, and the response will show a message like:
"To discuss automated access to Amazon data please contact [email protected]."
Instead of the expected success status code (200), you'll receive an error code like 503. Amazon detects that the request is not coming from a browser and blocks it.
To overcome this, you can send headers along with your request to mimic a browser. One critical header is the User-Agent, which identifies the browser being used. You can find the User-Agent by inspecting network requests in your browser's developer tools.
Copy the User-Agent and create a dictionary for the headers. Here's an example with the User-Agent and Accept-Language headers:
custom_headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'Accept-Language': 'en-GB,en;q=0.9',
}
You can pass this dictionary as the headers parameter in the get method:
response = requests.get(url, headers=custom_headers)
Executing the code with these changes will fetch the expected HTML with the product details.
It's worth noting that by sending as many headers as possible, you may not need JavaScript rendering. However, if rendering is required, tools like Playwright or Selenium can be used.
Scraping Amazon product data
When conducting web scraping on Amazon products, typically, there are two types of pages to work with: the category page and the product details page.
For instance, you can open the URL https://www.amazon.com/b?node=12097479011 or perform a search for "Over-Ear Headphones" on Amazon. The resulting page that displays the search results is considered the category page.
The category page presents essential information such as the product title, product image, product rating, product price, and, most importantly, the URLs to individual product pages. However, if you require more detailed information like product descriptions, you will need to visit the product details page.
Let's examine the structure of the product details page.
Open a product URL like https://www.amazon.com/Bose-QuietComfort-45-Bluetooth-Canceling-Headphones/dp/B098FKXT8L in Chrome or any modern browser. Right-click on the product title and select "Inspect" from the context menu. You will notice that the HTML markup for the product title is highlighted.
You will observe that the product title resides within a span tag with the id attribute set to "productTitle".
Similarly, if you right-click on the price and choose "Inspect," you will see the HTML markup for the price.
You can identify that the dollar component of the price is within a span tag with the class "a-price-whole," while the cents component resides in another span tag with the class "a-price-fraction."
Likewise, you can locate the rating, image, and description elements.
Once you have identified this information, you can add the following lines to the existing code we have written so far:
1. Send a GET request with custom headers
response = requests.get(url, headers=custom_headers)
soup = BeautifulSoup(response.text, 'lxml')
Beautiful Soup provides a distinct approach to selecting tags through the use of find methods. Additionally, Beautiful Soup also supports CSS selectors, offering the flexibility to achieve the same results using either method. For this guide, we will utilize CSS selectors as they are widely compatible with various web scraping tools employed for extracting Amazon product data.
With the Soup object at our disposal, we can now proceed to query for specific information.
2. Locate and scrape product name
The product name or title can be found within a span element with the unique id "productTitle". Selecting elements using this id is straightforward.
Consider the following code snippet as an example:
title_element = soup.select_one('#productTitle')
By passing the CSS selector to the select_one method, we obtain an instance of the element.
To extract the information from the text, we can use the text attribute.
title = title_element.text
If you print it, you may notice some leading or trailing whitespace. To remove them, simply add the .strip() function call as shown below:
title = title_element.text.strip()
3. Locate and scrape product rating
To scrape Amazon product ratings, some additional steps are required.
First, let's define a selector for the rating:
rating_selector = '#acrPopover'
Next, the following statement will select the element that holds the rating information:
rating_element = soup.select_one(rating_selector)
Keep in mind that the rating value is stored within the title attribute:
rating_text = rating_element.attrs.get('title')
print(rating_text)
prints '4.6 out of 5 stars'
Finally, we can utilize the replace method to extract the numeric rating:
rating = rating_text.replace('out of 5 stars', '')
4. Locate and scrape product price
The product price can be found in two locations - below the product title and in the Buy Now box.
We have the option to scrape Amazon product prices from either of these elements.
Let's define a CSS selector for the price:
price_selector = '#price_inside_buybox'
We can then use this CSS selector with the select_one method of BeautifulSoup:
price_element = soup.select_one(price_selector)
Now, you can print the price:
print(price_element.text)
5. Locate and scrape product image
Now let's scrape the default image. The CSS selector for the image is #landingImage. With this information, we can write the following code to extract the image URL from the src attribute:
image_element = soup.select_one('#landingImage')
image_url = image_element.attrs.get('src')
By executing these lines of code, you will obtain the URL of the default image.
6. Locate and scrape product description
To scrape Amazon product information, the subsequent step involves scraping the product description. The methodology remains unchanged, where you create a CSS selector and employ the select_one method. The CSS selector for the description is as follows:
description_selector = '#productDescription'
description_element = soup.select_one(description_selector)
print(description_element.text)
To handle product listings, you need to begin with product listing or category pages. For instance, if you visit the category page for over-ear headphones at https://www.amazon.com/b?node=12097479011, you'll observe that all the products are enclosed within a div element with a special attribute [data-asin]. Inside that div, the product links are present within an h2 tag.
Taking this into consideration, the CSS selector would be:
product_selector = '[data-asin] h2 a'
To extract the href attribute from this selector and iterate over the links, remember that the links will be relative. You'll need to utilize the urljoin method to parse these links. Here's an example code snippet:
from urllib.parse import urljoin
def parse_listing(listing_url):
# ...
link_elements = soup_search.select(product_selector)
page_data = []
for link in link_elements:
full_url = urljoin(listing_url, link.attrs.get("href"))
product_info = get_product_info(full_url)
page_data.append(product_info)
When dealing with pagination, you can identify the link to the next page by searching for a link that contains the text "Next" using the CSS contains operator. If such a link exists, you can retrieve its href attribute, join it with the listing URL using urljoin, and proceed to the next page.
next_page_el = soup.select_one('a:contains("Next")')
if next_page_el:
next_page_url = urljoin(listing_url, next_page_el.attrs.get('href'))
Remember to handle pagination within your scraping code to navigate through multiple pages of product listings effectively.
7. Export scraped product data to a CSV file
The scraped data is intentionally returned as a dictionary. To consolidate all the scraped products, we can create a list.
def parse_listing(listing_url):
# ...
page_data = []
for link in link_elements:
# ...
product_info = get_product_info(full_url)
page_data.append(product_info)
To further utilize this page_data and create a Pandas DataFrame object, you can follow these steps:
import pandas as pd
# ...
df = pd.DataFrame(page_data)
df.to_csv('headphones.csv', index=False)
By creating a list page_data and appending the scraped product information to it, you can then convert it into a Pandas DataFrame object using pd.DataFrame(page_data). Finally, you can save the DataFrame as a CSV file named 'headphones.csv' using the to_csv() method, with index=False to exclude the index column in the CSV file.
Reviewing final script
After you put everything together you get this final script:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd
custom_headers = {
"accept-language": "en-GB,en;q=0.9",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
}
def get_product_info(url):
response = requests.get(url, headers=custom_headers)
if response.status_code != 200:
print("Error in getting webpage")
exit(-1)
soup = BeautifulSoup(response.text, "lxml")
title_element = soup.select_one("#productTitle")
title = title_element.text.strip() if title_element else None
price_element = soup.select_one("#price_inside_buybox")
price = price_element.text if price_element else None
rating_element = soup.select_one("#acrPopover")
rating_text = rating_element.attrs.get("title") if rating_element else None
rating = rating_text.replace("out of 5 stars", "") if rating_text else None
image_element = soup.select_one("#landingImage")
image = image_element.attrs.get("src") if image_element else None
description_element = soup.select_one("#productDescription")
description = description_element.text.strip() if description_element else None
return {
"title": title,
"price": price,
"rating": rating,
"image": image,
"description": description,
"url": url,
}
def parse_listing(listing_url):
response = requests.get(listing_url, headers=custom_headers)
soup_search = BeautifulSoup(response.text, "lxml")
link_elements = soup_search.select("[data-asin] h2 a")
page_data = []
for link in link_elements:
full_url = urljoin(listing_url, link.attrs.get("href"))
print(f"Scraping product from {full_url[:100]}", flush=True)
product_info = get_product_info(full_url)
page_data.append(product_info)
next_page_el = soup_search.select_one('a:contains("Next")')
if next_page_el:
next_page_url = next_page_el.attrs.get('href')
next_page_url = urljoin(listing_url, next_page_url)
print(f'Scraping next page: {next_page_url}', flush=True)
page_data += parse_listing(next_page_url)
return page_data
def main():
search_url = "https://www.amazon.com/s?k=bose&rh=n%3A12097479011&ref=nb_sb_noss"
data = parse_listing(search_url)
df = pd.DataFrame(data)
df.to_csv('amz.csv', index=False)
if __name__ == '__main__':
main()
In this version, I have kept the original structure of the script while ensuring proper formatting and readability. The necessary imports and function definitions are included, and the main function executes the scraping process. The scraped data is then converted into a Pandas DataFrame and saved as a CSV file named 'amz.csv' with the index=False parameter to exclude the index column.
In this version, I have kept the original structure of the script while ensuring proper formatting and readability. The necessary imports and function definitions are included, and the main function executes the scraping process. The scraped data is then converted into a Pandas DataFrame and saved as a CSV file named 'amz.csv' with the index=False parameter to exclude the index column.
Best practices
Scraping data from Amazon can be challenging without proxies or dedicated scraping tools due to various obstacles. Similar to other popular scraping targets, Amazon implements rate-limiting mechanisms, which can lead to IP address blocking if the established limits are exceeded. Additionally, Amazon employs bot-detection algorithms that scrutinize HTTP headers for any suspicious details. Moreover, you should be prepared to adapt continually to different page layouts and varying HTML structures.
Considering these factors, it is advisable to follow some common practices to minimize the risk of detection and blocking by Amazon. Here are some useful tips:
- Use a real User-Agent: It is crucial to make your User-Agent appear as realistic as possible. You can refer to a list of commonly used user agents to select an appropriate one.
- Set consistent fingerprint parameters: Websites often employ Transmission Control Protocol (TCP) and IP fingerprinting to identify bot activity. To avoid detection, ensure that your fingerprint parameters remain consistent throughout your scraping process.
- Vary your crawling pattern: To develop an effective crawling pattern, mimic the behavior of a regular user while exploring a page. Incorporate clicks, scrolls, and mouse movements in a manner that resembles human interaction with the site.
Conclusion
To scrape Amazon products, you can utilize the Requests and Beautiful Soup libraries by writing code. While it may require some effort, this approach is effective. By incorporating certain techniques such as sending custom headers, rotating user-agents, and proxy rotation, you can enhance your ability to bypass bans or overcome rate limiting imposed by Amazon. These methods can help in ensuring a smoother and uninterrupted scraping process.