Introduction:
In this post, which can be read as a follow-up to our guide about web scraping without getting blocked, we will cover almost all of the tools to do web scraping in Python. We will go from the basic to advanced ones, covering the pros and cons of each. Of course, we won't be able to cover every aspect of every tool we discuss, but this post should give you a good idea of what each tool does and when to use one.
Note: When I talk about Python in this blog post, you should assume that I talk about Python3.
0. Web Fundamentals
The Internet is complex: there are many underlying technologies and concepts involved to view a simple web page in your browser. The goal of this article is not to go into excruciating detail on every single of those aspects, but to provide you with the most important parts for extracting data from the web with Python.
HyperText Transfer Protocol
HyperText Transfer Protocol (HTTP) uses a client/server model. An HTTP client (a browser, your Python program, cURL, libraries such as Requests...) opens a connection and sends a message (βI want to see that page : /productβ) to an HTTP server (Nginx, Apache...). Then the server answers with a response (the HTML code for example) and closes the connection.
HTTP is called a stateless protocol because each transaction (request/response) is independent. FTP, for example, is stateful because it maintains the connection.
Basically, when you type a website address in your browser, the HTTP request looks like this:
GET /product/ HTTP/1.1
Host: example.com
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch, br
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 12_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36
In the first line of this request, you can see the following:
- The HTTP method or verb. In our case
GET
, indicating that we would like to fetch data. There are quite a few other HTTP methods available as (e.g. for uploading data), and a full list is available here. - The path of the file, directory, or object we would like to interact with. In the case here, the directory
product
right beneath the root directory. - The version of the HTTP protocol. In this tutorial, we will focus on HTTP 1.
- Multiple headers fields: Connection, User-Agent... Here is an exhaustive list of HTTP headers
Here are the most important header fields :
- Host: This header indicates the hostname for which you are sending the request. This header is particularly important for name-based virtual hosting, which is the standard in today's hosting world.
- User-Agent: This contains information about the client originating the request, including the OS. In this case, it is my web browser (Chrome) on macOS. This header is important because it is either used for statistics (how many users visit my website on mobile vs desktop) or to prevent violations by bots. Because these headers are sent by the clients, they can be modified (βHeader Spoofingβ). This is exactly what we will do with our scrapers - make our scrapers look like a regular web browser.
- Accept: This is a list of MIME types, which the client will accept as response from the server. There are lots of different content types and sub-types: text/plain, text/html, image/jpeg, application/json ...
- Cookie : This header field contains a list of name-value pairs (name1=value1;name2=value2). Cookies are one way how websites can store data on your machine. This could be either up to a certain date of expiration (standard cookies) or only temporarily until you close your browser (session cookies). Cookies are used for a number of different purposes, ranging from authentication information, to user preferences, to more nefarious things such as user-tracking with personalised, unique user identifiers. However, they are a vital browser feature for mentioned authentication. When you submit a login form, the server will verify your credentials and, if you provided a valid login, issue a session cookie, which clearly identifies the user session for your particular user account. Your browser will receive that cookie and will pass it along with all subsequent requests.
- Referer: The referrer header (please note the typo) contains the URL from which the actual URL has been requested. This header is important because websites use this header to change their behavior based on where the user came from. For example, lots of news websites have a paying subscription and let you view only 10% of a post, but if the user comes from a news aggregator like Reddit, they let you view the full content. They use the referrer to check this. Sometimes we will have to spoof this header to get to the content we want to extract.
And the list goes on...you can find the full header list here.
A server will respond with something like this:
HTTP/1.1 200 OK
Server: nginx/1.4.6 (Ubuntu)
Content-Type: text/html; charset=utf-8
Content-Length: 3352
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" /> ...[HTML CODE]
On the first line, we have a new piece of information, the HTTP code 200 OK
. A code of 200 means the request was properly handled. You can find a full list of all available codes on Wikipedia. Following the status line, you have the response headers, which serve the same purpose as the request headers we just discussed. After the response headers, you will have a blank line, followed by the actual data sent with this response.
Once your browser received that response, it will parse the HTML code, fetch all embedded assets (JavaScript and CSS files, images, videos), and render the result into the main window.
We will go through the different ways of performing HTTP requests with Python and extract the data we want from the responses.
1. Manually Opening a Socket and Sending the HTTP Request
Socket
The most basic way to perform an HTTP request in Python is to open a TCP socket and manually send the HTTP request.
import socket
HOST = 'www.google.com' # Server hostname or IP address
PORT = 80 # The standard port for HTTP is 80, for HTTPS it is 443
client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (HOST, PORT)
client_socket.connect(server_address)
request_header = b'GET / HTTP/1.0\r\nHost: www.google.com\r\n\r\n'
client_socket.sendall(request_header)
response = ''
while True:
recv = client_socket.recv(1024)
if not recv:
break
response += str(recv)
print(response)
client_socket.close()
Now that we have the HTTP response, the most basic way to extract data from it is to use regular expressions.
Regular Expressions
Regular expressions (or also regex) are an extremely versatile tool for handling, parsing, and validating arbitrary text. A regular expression is essentially a string that defines a search pattern using a standard syntax. For example, you could quickly identify all phone numbers on a web page.
Combined with classic search and replace, regular expressions also allow you to perform string substitution on dynamic strings in a relatively straightforward fashion. The easiest example, in a web scraping context, may be to replace uppercase tags in a poorly formatted HTML document with the proper lowercase counterparts.
You may be now wondering why it is important to understand regular expressions when doing web scraping in Python. That's a fair question, and after all, there are many different Python modules to parse HTML with XPath and CSS selectors.
In an ideal semantic world, data is easily machine-readable, and the information is embedded inside relevant HTML elements with meaningful attributes. But the real world is messy. You will often find huge amounts of text inside a <p>
element. For example, if you want to extract specific data inside a large text (a price, a date, a name...), you will have to use regular expressions.
Note: Here is a great website to test your regex: https://regex101.com/. Also, here is an awesome blog to learn more about them. This post will only cover a small fraction of what you can do with regex.
Regular expressions can be useful when you have this kind of data:
<p>Price : 19.99$</p>
We could select this text node with an XPath expression and then use this kind of regex to extract the price:
^Price\s*:\s*(\d+\.\d{2})\$
If you only have the HTML, it is a bit trickier, but not all that much more after all. You can simply specify in your expression the tag as well and then use a capturing group for the text.
import re
html_content = '<p>Price : 19.99$</p>'
m = re.match('<p>(.+)<\/p>', html_content)
if m:
print(m.group(1))
As you can see, manually sending the HTTP request with a socket and parsing the response with regular expression can be done, but it's complicated and there are higher-level API that can make this task easier.
2. urllib3 & LXML
Disclaimer: It is easy to get lost in the urllib universe in Python. The standard library contains urllib and urllib2 (and sometimes urllib3). In Python3 urllib2 was split into multiple modules and urllib3 won't be part of the standard library anytime soon. This confusing situation will be the subject of another blog post. In this section, I've decided to only talk about urllib3 because it is widely used in the Python world, including by Pip and Requests.
Urllib3 is a high-level package that allows you to do pretty much whatever you want with an HTTP request. With urllib3, we could do what we did in the previous section with way fewer lines of code.
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://www.google.com')
print(r.data)
As you can see, this is much more concise than the socket version. Not only that, the API is straightforward. Also, you can easily do many other things, like adding HTTP headers, using a proxy, POSTing forms ...
For example, had we decided to set some headers and use a proxy, we would only have to do the following (you can learn more about proxy servers at bestproxyreviews.com):
import urllib3
user_agent_header = urllib3.make_headers(user_agent="<USER AGENT>")
pool = urllib3.ProxyManager(f'<PROXY IP>', headers=user_agent_header)
r = pool.request('GET', 'https://www.google.com/')
See? There is exactly the same number of lines. However, there are some things that urllib3 does not handle very easily. For example, if we want to add a cookie, we have to manually create the corresponding headers and add them to the request.
There are also things that urllib3 can do that Requests can't: creation and management of a pool and proxy pool, as well as managing the retry strategy, for example.
To put it simply, urllib3 is between Requests and Socket in terms of abstraction, although it's way closer to Requests than Socket.
To be honest, if you're going to do web scraping using Python, you probably won't use urllib3 directly, especially if it is your first time.
Next, to parse the response, we are going to use the LXML package and XPath expressions.
XPath
XPath is a technology that uses path expressions to select nodes or node-sets in an XML document (or HTML document). If you are familiar with the concept of CSS selectors, then you can imagine it as something relatively similar.
As with the Document Object Model, XPath has been a W3C standard since 1999. Although XPath is not a programming language in itself, it allows you to write expressions that can directly access a specific node, or a specific node-set, without having to go through the entire HTML tree (or XML tree).
To extract data from an HTML document with XPath we need three things:
- an HTML document
- some XPath expressions
- an XPath engine that will run those expressions
To begin, we will use the HTML we got from urllib3. And now we would like to extract all of the links from the Google homepage. So, we will use one simple XPath expression, //a
, and we will use LXML to run it. LXML is a fast and easy to use XML and HTML processing library that supports XPath.
Installation:
pip install lxml
Below is the code that comes just after the previous snippet:
from lxml import html
# We reuse the response from urllib3
data_string = r.data.decode('utf-8', errors='ignore')
# We instantiate a tree object from the HTML
tree = html.fromstring(data_string)
# We run the XPath against this HTML
# This returns an array of element
links = tree.xpath('//a')
for link in links:
# For each element we can easily get back the URL
print(link.get('href'))
And the output should look like this:
https://books.google.fr/bkshp?hl=fr&tab=wp
https://www.google.fr/shopping?hl=fr&source=og&tab=wf
https://www.blogger.com/?tab=wj
https://photos.google.com/?tab=wq&pageId=none
http://video.google.fr/?hl=fr&tab=wv
https://docs.google.com/document/?usp=docs_alc
...
https://www.google.fr/intl/fr/about/products?tab=wh
Keep in mind that this example is really really simple and doesn't show you how powerful XPath can be (Note: we could have also used //a/@href
, to point straight to the href
attribute). If you want to learn more about XPath, you can read this helpful introduction. The LXML documentation is also well-written and is a good starting point.
XPath expressions, like regular expressions, are powerful and one of the fastest way to extract information from HTML. And like regular expressions, XPath can quickly become messy, hard to read, and hard to maintain.
If you'd like to learn more about XPath, do not hesitate to read my dedicated blog post about XPath applied to web scraping.
toto3. Requests & BeautifulSoup
Requests
Requests is the king of Python packages. With more than 11,000,000 downloads, it is the most widely used package for Python.Β
If you're building your first Python web scraper, we advise starting with Requests and BeautifulSoup.
Installation:
pip install requests
Making a request with - pun intended - Requests is easy:
import requests
r = requests.get('https://www.scrapingninja.co')
print(r.text)
With Requests, it is easy to perform POST requests, handle cookies, query parameters... You can also download images with Requests.
On the following page, you will learn to use Requests with proxies. This is almost mandatory for scraping the web at scale.
Authentication to Hacker News
Let's say you're building a Python scraper that automatically submits our blog post to Hacker news or any other forum, like Buffer. We would need to authenticate on those websites before posting our link. That's what we are going to do with Requests and BeautifulSoup!
Here is the Hacker News login form and the associated DOM:
There are three <input>
tags with a name
attribute (other input elements are not sent) on this form. The first one has a type hidden with a name "goto", and the two others are the username and password.
If you submit the form inside your Chrome browser, you will see that there is a lot going on: a redirect and a cookie is being set. This cookie will be sent by Chrome on each subsequent request in order for the server to know that you are authenticated.
Doing this with Requests is easy. It will handle redirects automatically for us, and handling cookies can be done with the Session object.
BeautifulSoup
The next thing we will need is BeautifulSoup, which is a Python library that will help us parse the HTML returned by the server, to find out if we are logged in or not.
Installation:
pip install beautifulsoup4
So, all we have to do is POST these three inputs with our credentials to the /login endpoint and check for the presence of an element that is only displayed once logged in:
import requests
from bs4 import BeautifulSoup
BASE_URL = 'https://news.ycombinator.com'
USERNAME = ""
PASSWORD = ""
s = requests.Session()
data = {"goto": "news", "acct": USERNAME, "pw": PASSWORD}
r = s.post(f'{BASE_URL}/login', data=data)
soup = BeautifulSoup(r.text, 'html.parser')
if soup.find(id='logout') is not None:
print('Successfully logged in')
else:
print('Authentication Error')
Fantastic, with only a couple of lines of Python code, we have managed to log in to a site and to check if the login was successful. Now, on to the next challenge: getting all the links on the homepage.
By the way, Hacker News offers a powerful API, so we're doing this as an example, but you should use the API instead of scraping it!
The first thing we need to do is inspect Hacker News's home page to understand the structure and the different CSS classes that we will have to select:
As evident from the screenshot, all postings are part of a <tr>
tag with the class athing
. So, let's simply find all these tags. Yet again, we can do that with one line of code.
links = soup.findAll('tr', class_='athing')
Then, for each link, we will extract its ID, title, URL, and rank:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')
formatted_links = []
for link in links:
data = {
'id': link['id'],
'title': link.find_all('td')[2].a.text,
"url": link.find_all('td')[2].a['href'],
"rank": int(link.find_all('td')[0].span.text.replace('.', ''))
}
formatted_links.append(data)
print(formatted_links)
Great, with only a couple of lines of Python code, we have managed to load the site of Hacker News and get the details of all the posting. But on our journey to big data, we do not only want to print data, we actually want to persist it. Let's try to make our Python scraper a bit more robust now!
π‘ Like BeautifulSoup but need a dash of browser automation? check out our guide on Getting started with MechanicalSoup
Storing our data in PostgreSQL
We chose a good ol' relational database for our example here - PostgreSQL!
For starters, we will need a functioning database instance. Check out www.postgresql.org/download for that, pick the appropriate package for your operating system, and follow its installation instructions. Once you have PostgreSQL installed, you'll need to set up a database (let's name it scrape_demo
), and add a table for our Hacker News links to it (let's name that one hn_links
) with the following schema.
CREATE TABLE "hn_links" (
"id" INTEGER NOT NULL,
"title" VARCHAR NOT NULL,
"url" VARCHAR NOT NULL,
"rank" INTEGER NOT NULL
);
π‘ For managing the database, you can either use PostgreSQL's own command line client or one of the available UI interfaces.
All right, the database should be ready, and we can turn to our code again.
First thing, we need something that lets us talk to PostgreSQL and Psycopg is a truly great library for that. As always, you can quickly install it with pip.
pip install psycopg2
The rest is relatively easy and straightforward. We just need to get the connection
con = psycopg2.connect(host="127.0.0.1", port="5432", user="postgres", password="", database="scrape_demo")
That connection will allow us to get a database cursor
cur = con.cursor()
And once we have the cursor, we can use the method execute
, to actually run our SQL command.
cur.execute("INSERT INTO table [HERE-GOES-OUR-DATA]")
Perfect, we have stored everything in our database!
Hold your horses, please. Don't forget to commit
your (implicit) database transaction π. One more con.commit()
(and a couple of close
s) and we are really good to go.
And for the grand finale, here the complete code with the scraping logic from before, this time storing everything in the database.
import psycopg2
import requests
from bs4 import BeautifulSoup
# Establish database connection
con = psycopg2.connect(host="127.0.0.1",
port="5432",
user="postgres",
password="",
database="scrape_demo")
# Get a database cursor
cur = con.cursor()
r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')
for link in links:
cur.execute("""
INSERT INTO hn_links (id, title, url, rank)
VALUES (%s, %s, %s, %s)
""",
(
link['id'],
link.find_all('td')[2].a.text,
link.find_all('td')[2].a['href'],
int(link.find_all('td')[0].span.text.replace('.', ''))
)
)
# Commit the data
con.commit();
# Close our database connections
cur.close()
con.close()
Summary
As you can see, Requests and BeautifulSoup are great libraries for extracting data and automating different actions, such as posting forms. If you want to run large-scale web scraping projects, you could still use Requests, but you would need to handle lots of parts yourself.
π‘ Did you know about ScrapingBee's Data Extraction tools. Not only do they provide a complete no-code environment for your project, but they also scale with ease and handle all advanced features, such as JavaScript and proxy round-robin, out of the box. Check it out and the first 1,000 requests are always on us.
If you like to learn more about Python, BeautifulSoup, POST requests, and particularly CSS selectors, I'd highly recommend the following articles
As so often, there are, of course plenty of opportunities to improve upon:
- Finding a way to parallelize your code to make it faster
- Handling errors
- Filtering results
- Throttling your request so you don't over-load the server
Fortunately for us, tools exist that can handle those for us.
GRequests
While the Requests package is easy-to-use, you might find it a bit slow if you have hundreds of pages to scrape. Out of the box, it will only allow you to send synchronous requests, meaning that if you have 25 URLs to scrape, you will have to do it one by one.
So if one page takes ten seconds to be fetched, will take more than four minutes to fetch those 25 pages.
import requests
# An array with 25 urls
urls = [...]
for url in urls:
result = requests.get(url)
The easiest way to speed up this process is to make several calls at the same time. This means that instead of sending every request sequentially, you can send requests in batches of five.
In that case, each batch will handle five URLs simultaneously, which means you'll scrape five URLs in 10 seconds, instead of 50, or the entire set of 25 URLs in 50 seconds instead of 250. Not bad for a time-saver π₯³.
Usually, this is implemented using thread-based parallelism. Though, as always, threading can be tricky, especially for beginners. Fortunately, there is a version of the Requests package that does all the hard work for us, GRequests. It's based on Requests, but also incorporates gevent, an asynchronous Python API widely used for web application. This library allows us to send multiple requests at the same time and in an easy and elegant way.
For starters, let's install GRequests.
pip install grequests
Now, here is how to send our 25 initial URLs in batches of 5:
import grequests
BATCH_LENGTH = 5
# An array with 25 urls
urls = [...]
# Our empty results array
results = []
while urls:
# get our first batch of 5 URLs
batch = urls[:BATCH_LENGTH]
# create a set of unsent Requests
rs = (grequests.get(url) for url in batch)
# send them all at the same time
batch_results = grequests.map(rs)
# appending results to our main results array
results += batch_results
# removing fetched URLs from urls
urls = urls[BATCH_LENGTH:]
print(results)
# [<Response [200]>, <Response [200]>, ..., <Response [200]>, <Response [200]>]
And that's it. GRequests is perfect for small scripts but less ideal for production code or high-scale web scraping. For that, we have Scrapy π.
4. Web Crawling Frameworks
Scrapy
Scrapy is a powerful Python web scraping and web crawling framework. It provides lots of features to download web pages asynchronously and handle and persist their content in various ways. It provides support for multithreading, crawling (the process of going from link to link to find every URL in a website), sitemaps, and more.
Scrapy also has an interactive mode called the Scrapy Shell. With Scrapy Shell, you can test your scraping code quickly and make sure all your XPath expressions or CSS selectors work without a glitch. The downside of Scrapy is that the learning curve is steep. There is a lot to learn.
To follow up on our example about Hacker News, we are going to write a Scrapy Spider that scrapes the first 15 pages of results, and saves everything in a CSV file.
You can easily install Scrapy with pip:
pip install Scrapy
Then you can use the Scrapy CLI to generate the boilerplate code for our project:
scrapy startproject hacker_news_scraper
Inside hacker_news_scraper/spider
we will create a new Python file with our spider's code:
from bs4 import BeautifulSoup
import scrapy
class HnSpider(scrapy.Spider):
name = "hacker-news"
allowed_domains = ["news.ycombinator.com"]
start_urls = [f'https://news.ycombinator.com/news?p={i}' for i in range(1,16)]
def parse(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.findAll('tr', class_='athing')
for link in links:
yield {
'id': link['id'],
'title': link.find_all('td')[2].a.text,
"url": link.find_all('td')[2].a['href'],
"rank": int(link.td.span.text.replace('.', ''))
}
There is a lot of convention in Scrapy. We first provide all the desired URLs in start_urls
. Scrapy will then fetch each URL and call parse
for each of them, where we will use our custom code to parse response
.
We then need to fine-tune Scrapy a bit in order for our spider to behave nicely with the target website.
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
You should always turn this on. Based on the response times, this feature automatically adjusts the request rate and the number of concurrent threads and makes sure your spider is not flooding the website with requests. We wouldn't want that, would we?
You can run this code with the Scrapy CLI and with different output formats (CSV, JSON, XML...):
scrapy crawl hacker-news -o links.json
And that's it! You now have all your links in a nicely formatted JSON file.
There is a lot more to say about this Scrapy. So, if you wish to learn more, please don't hesitate to check out our dedicated blog post about web scraping with Scrapy.
PySpider
PySpider is an alternative to Scrapy, albeit a bit outdated. Its last release is from 2018. However it is still relevant because it does many things that Scrapy does not handle out of the box.
First, PySpider works well with JavaScript pages (SPA and Ajax call) because it comes with PhantomJS, a headless browsing library. In Scrapy, you would need to install middlewares to do this. On top of that, PySpider comes with a nice UI that makes it easy to monitor all of your crawling jobs.
However, you might still prefer to use Scrapy for a number of reasons:
- Much better documentation than PySpider with easy-to-understand guides
- A built-in HTTP cache system that can speed up your crawler
- Automatic HTTP authentication
- Support for 3XX redirections, as well as the HTML meta refresh tag
5. Headless browsing
Selenium & Chrome
Scrapy is great for large-scale web scraping tasks. However, it is difficult to handle sites with it, which are heavily using JavaScript are implemented, e.g., as SPA (Single Page Application). Scrapy does not handle JavaScript on its own and will only get you the static HTML code.
It, generally, can be challenging to scrape SPAs because there are often lots of AJAX calls and WebSocket connections involved. If performance is an issue, always check out what exactly the JavaScript code is doing. This means manually inspecting all of the network calls with your browser inspector and replicating the AJAX calls containing the interesting data.
Often, though, there are too many HTTP calls involved to get the data you want and it can be easier to render the page in a headless browser. Another great use case for that, would be to take a screenshot of a page, and this is what we are going to do with the Hacker News homepage (we do like Hacker News, don't we?) and the help of Selenium.
β Hey, I don't get it, when should I use Selenium or not?
Here are the three most common cases when you need Selenium:
- You're looking for an information that is appearing a few seconds after the webpage is loaded on a browser.
- The website you're trying to scrape is using a lot of JavaScript.
- The website you're trying to scrape have some JavaScript check to block "classic" HTTP client.
You can install the Selenium package with pip:
pip install selenium
You will also need ChromeDriver. On mac OS you can use brew for that.
brew install chromedriver
Then, we just have to import the Webdriver from the Selenium package, configure Chrome with headless=True, set a window size (otherwise it is really small), start the Chrome, load the page, and finally get our beautiful screenshot:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=r'/usr/local/bin/chromedriver')
driver.get("https://news.ycombinator.com/")
driver.save_screenshot('hn_homepage.png')
driver.quit()
True, being good netizens, we also quit()
the WebDriver instance of course. Now, you should get a nice screenshot of the homepage:
Naturally, there's a lot more you can do with the Selenium API and Chrome. After all, it's a full-blown browser instance.
- Running JavaScript
- Filling forms
- Clicking on elements
- Extracting elements with CSS selectors / XPath expressions
Selenium and Chrome in headless mode are the ultimate combination to scrape anything you want. You can automate everything that you could do with your regular Chrome browser.
The big drawback is that Chrome needs lots of memory / CPU power. With some fine-tuning you can reduce the memory footprint to 300-400mb per Chrome instance, but you still need 1 CPU core per instance.
Don't hesitate to check out our in-depth article about Selenium and Python.
If you need to run several instances concurrently, this will require a machine with an adequate hardware setup and enough memory to serve all your browser instances. If you'd like a more lightweight and carefree solution, check out ScrapingBee's site crawler SaaS platform, which does a lot of the heavy lifting for you.
RoboBrowser
RoboBrowser is a Python library which wraps Requests and BeautifulSoup into a single and easy-to-use package and allows you to compile your own custom scripts to control the browsing workflow of RoboBrowser. It is a lightweight library, but it is not a headless browser and still has the same restrictions of Requests and BeautifulSoup, we discussed earlier.
For example, if you want to login to Hacker-News, instead of manually crafting a request with Requests, you can write a script that will populate the form and click the login button:
# pip install RoboBrowser
from robobrowser import RoboBrowser
browser = RoboBrowser()
browser.open('https://news.ycombinator.com/login')
# Get the signup form
signin_form = browser.get_form(action='login')
# Fill it out
signin_form['acct'].value = 'account'
signin_form['password'].value = 'secret'
# Submit the form
browser.submit_form(signin_form)
As you can see, the code is written as if you were manually doing the task in a real browser, even though it is not a real headless browsing library.
RoboBrowser is cool because its lightweight approach allows you to easily parallelize it on your computer. However, because it's not using a real browser, it won't be able to deal with JavaScript like AJAX calls or Single Page Applications.
Unfortunately, its documentation is also lightweight, and I would not recommend it for newcomers or people not already used to the BeautilfulSoup or Requests API.
6. Scraping Reddit data
Sometimes you don't even have to scrape the data using an HTTP client or a headless browser. You can directly use the API exposed by the target website. That's what we are going to try now with the Reddit API.
To access the API, we're going to use Praw, a great Python package that wraps the Reddit API.
To install it:
pip install praw
Then, you will need to get an API key. Go to https://www.reddit.com/prefs/apps .
Scroll to the bottom to create application:
As outlined in the documentation of Praw, make sure to provide http://localhost:8080
as "redirect URL".
After clicking create app
, the screen with the API details and credentials will load. You'll need the client ID, the secret, and the user agent for our example.
Now we are going to get the top 1,000 posts from /r/Entrepreneur and export it to a CSV file.
import praw
import csv
reddit = praw.Reddit(client_id='you_client_id', client_secret='this_is_a_secret', user_agent='top-1000-posts')
top_posts = reddit.subreddit('Entrepreneur').top(limit=1000)
with open('top_1000.csv', 'w', newline='') as csvfile:
fieldnames = ['title','score','num_comments','author']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for post in top_posts:
writer.writerow({
'title': post.title,
'score': post.score,
'num_comments': post.num_comments,
'author': post.author
})
As you can see, the actual extraction part is only one single line of Python code. Running top
on subreddit
and storing the posts in top_posts
π.
There are many other use cases for Praw. You can do all kinds of crazy things, like analyzing sub reddits in real-time with sentiment analysis libraries, predicting the next $GME ...
π‘ Want to take the hassle out of scraping? Learn how to screen scrape with no infrastructure maintenance via our scraping API
Conclusion
Here is a quick recap table of every technology we discussed in this blog post. Please, do not hesitate to let us know if you know some resources that you feel belong here.
Name | socket | urllib3 | requests | scrapy | selenium |
---|---|---|---|---|---|
Ease of use | - - - | + + | + + + | + + | + |
Flexibility | + + + | + + + | + + | + + + | + + + |
Speed of execution | + + + | + + | + + | + + + | + |
Common use case | * Writing low-level programming interface | * High level application that needs fine control over HTTP (pip, aws client, requests, streaming) | * Calling an API * Simple application (in terms of HTTP needs) | * Crawling an important list of websites * Filtering, extracting and loading scraped data | * JS rendering * Scraping SPA * Automated testing * Programmatic screenshot |
Learn more | * Official documentation * Great tutorial π | * Official documentation * PIP usage of urllib3, very interesting | * Official documentation * Requests usage of urllib3 | * Official documentation - Scrapy overview | * Official documentation * Scraping SPA |
I hope you enjoyed this blog post! This was a quick introduction to the most used Python tools for web scraping. In the next posts we're going to go more in-depth on all the tools or topics, like XPath and CSS selectors.
If you want to learn more about HTTP clients in Python, we just released this guide about the best Python HTTP clients.
Happy Scraping!
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.