Didn’t read that post? All good, here is the TL;DR (too long; didn’t read) version:
Scraped Amazon product reviews data for some Sony Headphones I recently bought
Introduction to using Splash and Docker for web-scraping
Step-by-step implementation of popular web-scraping Python libraries: BeautifulSoup, requests, and Splash.
Getting started
Here’s a snippet of code that we can start with:
Running this will fetch us the product, title, date, rating, and the actual text for each review made on Amazon for these Sony Headphones.
Code refactoring
Before we mess around with the code to loop through multiple pages, its a good practice to refactor a little. We will introduce few functions to help with performance and readability.
Let’s start with converting our URL setup & HTML request to a function.
This function takes in URL as an input, and spits out an HTML-parsed version of the HTTP request we made. Let’s call this our soup.
Next, we need to tell Python what data we want to look for in the soup.
Here, we try to scrape for those reviews in our soup that are tagged in the way we’ve specified, then append to our list.
If for whatever reason we cannot find what we are looking for, then we are asking Python to pass to the next review.
Looping through multiple pages
One of the easiest methods to scrape multiple pages is to modify the base URL to accept a page variable that increments as needed.
Try for yourself! See how the URL changes as you go through multiple pages.
For Amazon product reviews, the only thing that seems to change is the number indicating which page it is.
Okay, now let’s put this to work in a function:
We can even add in a a stop condition. For this one, we can tell Python to look for a greyed out “Next Page” button. To identify this element, use the element inspector.
Add this to the bottom of the function above.
Putting it all together
You can find the put-together code here in my amazon web scraping github repo. Look for the file called 2_amz_review_scraper.py.
You’ll notice that I used Splash & Docker to render and return the page. I found that I needed it to run my code.
Incase you want to try running the scraper without Splash & Docker, simply change Line 9 to the following:
Conclusion
To review, here is what we just did:
Modularized and refactored our single-page amazon product reviews web-scraper
Added a for loop that returns Amazon product reviews for the range we specified
Added a stop condition that looks for a greyed-out “Next Page” button
Added printed comments to help us follow the code as it scrapes
Returned a CSV file with the data we specified called sony-headphones.csv
Once again, credit to John Watson Rooney who provided the methodology in github from which this tutorial is adapted from. I would encourage you to check out his content if web-scraping in Python interests you.
There will be one last post on this topic, this time taking the data that we’ve got and running a basic sentiment analysis in Python.