Web scraping is a common technique for harvesting data online, in which an HTTP client, processing a user request for data, uses an HTML parser to comb through that data. It helps programmers more easily get at the information they need for their projects.
There are a number of use cases for web scraping. It allows you to access data that might not be available from APIs, as well as data from several disparate sources. It can also help you aggregate and analyze product-related user opinions, and it can provide insights into market conditions such as pricing volatility or distribution issues. However, scraping that data or integrating it into your projects hasn’t always been easy.
Fortunately, web scraping has become more advanced and a number of programming languages support it, including C++. The ever-popular language for system programming also offers a number of features that make it useful for web scraping, such as speed, strict static typing, and a standard library whose offerings include type inference, templates for generic programming, primitives for concurrency, and lambda functions.
In this tutorial, you’ll learn how to use C++ to implement web scraping with the libcurl
and gumbo
libraries. You can follow along on GitHub.
Prerequisites
For this tutorial, you’ll need the following:
- a basic understanding of HTTP
- C++ 11 or newer installed on your machine
- g++ 4.8.1 or newer
- the
libcurl
andgumbo
C libraries - a resource with data for scraping (you’ll use the Merriam-Webster website)
About Web Scraping
For every HTTP request made by a client (such as a browser), a server issues a response. Both requests and responses are accompanied by headers that describe aspects of the data the client intends to receive and explain all the nuances of the sent data for the server.
For instance, say you made a request to Merriam-Webster’s website for the definitions of the word “esoteric,” using cURL as a client:
GET /dictionary/esoteric HTTP/2
Host: www.merriam-webster.com
user-agent: curl/7.68.0
accept: */*
The Merriam-Webster site would respond with headers to identify itself as the server, an HTTP response code to signify success (200), the format of the response data—HTML in this case—in the content-type header, caching directives, and additional CDN metadata. It might look like this:
HTTP/2 200
content-type: text/html; charset=UTF-8
date: Wed, 11 May 2022 11:16:20 GMT
server: Apache
cache-control: max-age=14400, public
pragma: cache
access-control-allow-origin: *
vary: Accept-Encoding
x-cache: Hit from cloudfront
via: 1.1 5af4fdb44166a881c2f1b1a2415ddaf2.cloudfront.net (CloudFront)
x-amz-cf-pop: NBO50-C1
x-amz-cf-id: HCbuiqXSALY6XbCvL8JhKErZFRBulZVhXAqusLqtfn-Jyq6ZoNHdrQ==
age: 5787
<!DOCTYPE html>
<html lang="en">
<head>
<!--rest of it goes here-->
You should get similar results after you build your scraper. One of the two libraries you’ll use in this tutorial is libcurl
, which cURL is written on top of.
Building the Web Scraper
The scraper you’re going to build in C++ will source definitions of words from the Merriam-Webster site, while eliminating much of the typing associated with conventional word searches. Instead, you’ll reduce the process to a single set of keystrokes.
For this tutorial, you will be working in a directory labeled scraper
and a single C++ file of the same name: scraper.cc
.
Setting up the Libraries
The two C libraries you’re going to use, libcurl
and gumbo
, work here because C++ interacts well with C. While libcurl
is an API that enables several URL and HTTP-predicated functions and powers the client of the same name used in the previous section, gumbo
is a lightweight HTML-5 parser with bindings in several C-compatible languages.
Using vcpkg
Developed by Microsoft, vcpkg
is a cross-platform package manager for C/C++ projects. Follow this guide to set up vcpkg
on your machine. You can install libcurl
and gumbo
by typing the following in your console:
$ vcpkg install curl
$ vcpkg install gumbo
If you are working in an IDE environment—specifically Visual Studio Code—next run the following snippet in the root directory of your project in order to integrate the packages:
$ vcpkg integrate install
To minimize errors in your installations, consider adding
vcpkg
to your environment variable.
Using apt
If you’ve used Linux, you should be familiar with apt
, which enables you to conveniently source and manage libraries installed on the platform. To install libcurl
and gumbo
with apt
, type the following in your console:
$ sudo apt install libcurl4-openssl-dev libgumbo-dev
Installing the Libraries
Rather than go through manual installation, you can use the method shown below.
First, clone the curl
repository and install it globally:
$ git clone https://github.com/curl/curl.git <directory>
$ cd <directory>
$ autoreconf -fi
$ ./configure
$ make
Next, clone the gumbo
repository and install the package:
$ sudo apt install libtool
$ git clone https://github.com/google/gumbo-parser.git <directory>
$ cd <directory>
$ ./autogen.sh
$ ./configure
$ make && sudo make install
Coding the Scraper
The first step in coding the scraper is creating a facility for making an HTTP request. The artifact—a function and named request—will allow the dictionary scraping tool to fetch markup from the Merriam-Webster site.
Defined in the request function in your scraper.cc
file, in the code snippet below, are immutable primitives—a client name to identify the scraper via user-agent header, and language artifacts for writing server response markup into memory. The sole parameter is the word that constitutes a portion of the URL path, definitions of which are sourced by the scraper.
typedef size_t( * curl_write)(char * , size_t, size_t, std::string * );
std::string request(std::string word) {
CURLcode res_code = CURLE_FAILED_INIT;
CURL * curl = curl_easy_init();
std::string result;
std::string url = "https://www.merriam-webster.com/dictionary/" + word;
curl_global_init(CURL_GLOBAL_ALL);
if (curl) {
curl_easy_setopt(curl,
CURLOPT_WRITEFUNCTION,
static_cast < curl_write > ([](char * contents, size_t size,
size_t nmemb, std::string * data) -> size_t {
size_t new_size = size * nmemb;
if (data == NULL) {
return 0;
}
data -> append(contents, new_size);
return new_size;
}));
curl_easy_setopt(curl, CURLOPT_WRITEDATA, & result);
curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
curl_easy_setopt(curl, CURLOPT_USERAGENT, "simple scraper");
res_code = curl_easy_perform(curl);
if (res_code != CURLE_OK) {
return curl_easy_strerror(res_code);
}
curl_easy_cleanup(curl);
}
curl_global_cleanup();
return result;
}
Remember to include the appropriate headers in the preamble of your .cc or .cpp file for the curl
library and C++ string library. This will avoid compilation problems with library linkage.
#include “curl/curl.h”
#include “string”
The next step, parsing the markup, requires four functions: scrape
, find_definitions
, extract_text
, and str_replace
. Since gumbo is central to all markup parsing, add the appropriate library header as follows:
#include “gumbo.h”
The scrape
function feeds the markup from the request into find_definitions
for selectively iterative DOM traversal. You’ll use the gumbo
parser in this function, which returns a string containing a list of word definitions:
std::string scrape(std::string markup)
{
std::string res = "";
GumboOutput *output = gumbo_parse_with_options(&kGumboDefaultOptions, markup.data(), markup.length());
res += find_definitions(output->root);
gumbo_destroy_output(&kGumboDefaultOptions, output);
return res;
}
The find_definitions
function below recursively harvests definitions from the span
HTML elements with the unique class identifier "dtText"
. It extracts definition text via the extract_text
function on each successful iteration from each HTML node in which that text is enclosed.
std::string find_definitions(GumboNode *node)
{
std::string res = "";
GumboAttribute *attr;
if (node->type != GUMBO_NODE_ELEMENT)
{
return res;
}
if ((attr = gumbo_get_attribute(&node->v.element.attributes, "class")) &&
strstr(attr->value, "dtText") != NULL)
{
res += extract_text(node);
res += "\n";
}
GumboVector *children = &node->v.element.children;
for (int i = 0; i < children->length; ++i)
{
res += find_definitions(static_cast<GumboNode *>(children->data[i]));
}
return res;
}
Next, the extract_text
function below extracts text from each node that is not a script or style tag. The function funnels the text to the str_replace
routine, which replaces the leading colon with the binary >
symbol.
std::string extract_text(GumboNode *node)
{
if (node->type == GUMBO_NODE_TEXT)
{
return std::string(node->v.text.text);
}
else if (node->type == GUMBO_NODE_ELEMENT &&
node->v.element.tag != GUMBO_TAG_SCRIPT &&
node->v.element.tag != GUMBO_TAG_STYLE)
{
std::string contents = "";
GumboVector *children = &node->v.element.children;
for (unsigned int i = 0; i < children->length; ++i)
{
std::string text = extract_text((GumboNode *)children->data[i]);
if (i != 0 && !text.empty())
{
contents.append("");
}
contents.append(str_replace(":", ">", text));
}
return contents;
}
else
{
return "";
}
}
The str_replace
function (inspired by a PHP function of the same name) replaces every instance of a specified search string in a larger string with another string. It appears as follows:
std::string str_replace(std::string search, std::string replace, std::string &subject)
{
size_t count;
for (std::string::size_type pos{};
subject.npos != (pos = subject.find(search.data(), pos, search.length()));
pos += replace.length(), ++count)
{
subject.replace(pos, search.length(), replace.data(), replace.length());
}
return subject;
}
Since the traversal and replacement in the function above depend on primitives defined in the algorithm library, you’ll also need to include that library:
#include ”algorithm”
Next, you’ll add dynamism to the scraper—enabling it to return definitions for each word supplied as a command-line argument. To do this, you’ll define a function that converts each command-line argument to its lowercase equivalent, minimizing the likelihood of request errors from redirects and restricting input to a single command-line argument.
Add the function to convert string inputs to their lowercase equivalents:
std::string strtolower(std::string str)
{
std::transform(str.begin(), str.end(), str.begin(), ::tolower);
return str;
}
Next is the branching logic that selectively parses a single command-line argument:
if (argc != 2)
{
std::cout << "Please provide a valid English word" << std::endl;
exit(EXIT_FAILURE);
}
The primary function in your scraper should appear as shown below:
int main(int argc, char **argv)
{
if (argc != 2)
{
std::cout << "Please provide a valid English word" << std::endl;
exit(EXIT_FAILURE);
}
std::string arg = argv[1];
std::string res = request(arg);
std::cout << scrape(res) << std::endl;
return EXIT_SUCCESS;
}
You should include C++’s iostream
library to ensure the Input/Output (IO) primitives defined in the main function work as expected:
#include “iostream”
To run your scraper, compile it with g++. Type the following in your console to compile and run your scraper. It should pull the six listed definitions of the word “esoteric”:
$ g++ scraper.cc -lcurl -lgumbo -std=c++11 -o scraper
$ ./scraper esoteric
You should see the following:
If you would like to learn more about cURL you can check: How to follow redirect using cURL?, How to forward headers with cURL? or How to send a POST request using cURL?
Conclusion
As you saw in this tutorial, C++, which is normally used for system programming, also works well for web scraping because of its ability to parse HTTP. This added functionality can help you expand your knowledge of C++.
You’ll note that this example was relatively simple, and did not address how scraping would work for a more JavaScript-heavy website, for instance one using Selenium. To perform scraping on a more dynamically rendered site, you could use a headless browser with a C++ library for Selenium. This topic will be discussed in a future article.
To check your work on this tutorial, consult this GitHub gist.