R’s XML package is a powerful tool for generating datasets by “scraping” the text of HTML and XML documents. Converting webpage data into a dataframe you can work with in R is very simple for websites that format their data cleanly using HTML tables.
The readHTMLTable function turns all HTML tables found at the given URL into separate dataframes in R, so you have to comb through the output to determine which table you wish to use for analysis (in the example above, table is set as the nth table scraped from the webpage).
This is quite convenient for cases where the data of interest is unchanging, updated infrequently, or future updates don’t matter for the analysis you plan to run. Simply running the scrape once on your desktop suffices to gather the data you want. Consider these examples:
- Individual medalist and country medal counts for the 2000 Sydney Olympic Games.
- Wins-above-replacement (WAR) leader in the AL and NL each year between 1917 and 2016. (Baseball Reference)
- All-time movie box office rankings, gross receipts adjusted for inflation. (Box Office Mojo)
- Weekly Apple stock prices between Jan. 1, 2005 and Dec. 31, 2012. (Yahoo Finance)
However, a more robust web scraping solution is needed when the data is updated regularly, and we would like to capture all these updates. Consider two such cases that I’ve encountered:
- Updating daily data on home runs hit in MLB games during the ongoing season. Perhaps I would like to generate an email each morning detailing the home runs the New York Mets hit in the previous night’s game. (ESPN Home Run Tracker)
- Aggregating data on all temporary flight restrictions (TFRs) issued by the FAA. (FAA)
Updating the dataset manually is unrealistic, unreliable, and time-consuming, so we need to operate the web scraping scripts in an environment such that:
- the device has a persistent Internet connection, and
- the device can execute the web scrape and other functions automatically at specified intervals, or frequency.
The specific problem I took on was running an hourly web scrape of the FAA’s active TFR list. The R and Shell code I run to execute these scrapes is available in a Github repo.
The hosting solution I found was using Amazon Web Service’s Elastic Cloud Computing (EC2) service. Under the AWS free tier, I can run an EC2 Linux instance configured to run the web scrapes in R each hour and kick the output data to an AWS S3 bucket. All this is can be done for free, or at very minimal cost.
If you have not set up an AWS account, you can create one and be eligible to take advantage of the AWS free tier for 12 months!
The rest of this post outlines the process of setting up the EC2 instance with R and the dependencies you will need to run the code in my tfr-data Github repo, or a web scraper of your own!
Create an AWS Cloud Computing (EC2) Instance
In AWS EC2 console interface, you will walk through creating an EC2 instance. Here are my notes, numbered in correspondence with the steps of the interface.
- Amazon Linux, free tier eligible. I think there’s no right answer, but R is very easy to install and configure in Linux. If you have any experience with Linux or Mac OS Terminal, then you’ll feel at home in this setup.
- The
t2.microinstance is the only one that is free-tier eligible. Unless you’re running a scrape that will be processing gigabytes upon gigabytes of data, this should be adequate. - Select IAM role. This will be necessary if you wish to set up a user that can read and write your data output to an S3 bucket. If you do wish to integrate S3 (recommended), please skip down to the S3 setup section below to set this up right.
- There’s probably no need to add more storage to your instance, and I’m not sure how any changes will affect your free-tier eligibility.
- Create tags as you see fit.
- The security group controls the traffic that can navigate to your EC2 instance, which I expand on directly below.
Security Group
These settings will vary greatly across users, depending on if they are hosting a web application, or using a Linux instance to compile data like I have. This AWS blog post discusses using RStudio and Shiny to create data visualizations that are accessible via web browser, so thats good reference for setting up your security group if you wish to go that route.
In my case, I only allow traffic via SSH, with the IP address set to that of my personal device. If you need to access the EC2 instance from multiple devices or locations (home and office), simply create new rules of type SSH to allow multiple IP addresses to connect.
Private-Public Key Pair (AWS Doc)
Before you launch your EC2 instance, you will have generated a private-public key pair that you will be required to have when connecting to your EC2 instance (it will be downloaded with the .pem extension). With the key in hand, you connect to your launched EC2 instance like so:
The -i option specifies the path to the .pem key-pair file, which is necessary to verify your authority to access the server. The IPv4 public DNS can be copied-and-pasted from the EC2 console.
Once connected, you’ll see the following in your Terminal window:
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
Installing R and other Dependencies
Now that the instance is launched and configured, the next step is to install dependencies not preloaded onto the server. The technologies on which the XML package depends for its web scraping capabilities are usually preconfigured on Mac and Windows devices, but we’re not so lucky with the Linux AMI.
$ sudo yum install -y libcurl-devel
$ sudo yum install -y openssl-devel
$ sudo yum install -y libxml2-devel
$ sudo yum install -y RHenceforth, simply typing the command R will launch the R application, and the interface in your Terminal window will be very familiar to that of your desktop application. The same commands work, and you can write your R script just like you would in your Mac or Windows desktop environment.
You’ll need to install the XML, RCurl, and httr R packages manually, which are required for my FAA TFR scrapes and will likely be needed for yours. The need for XML is apparent. RCurl fetches the XML text from the web, which corrects a fatal error introduced after the FAA’s webpage began using the https prefix. The httr package has a nice function that allows you to check that a URL exists before you try scraping data from it with the XML package’s functions, which caused fatal errors for me. (Refer to the R script in the tfr-data repo for details).
chooseCRANmirror()
# select mirror with integer response
install.packages("XML", dependencies=TRUE)
install.packages("RCurl", dependencies=TRUE)
install.packages("httr", dependencies=TRUE)At the end of the install process for httr, you’ll see a warning message like this:
Warning messages:
1: In install.packages("httr", dependencies = TRUE) :
installation of package ‘jpeg’ had non-zero exit status
2: In install.packages("httr", dependencies = TRUE) :
installation of package ‘png’ had non-zero exit status
3: In install.packages("httr", dependencies = TRUE) :
installation of package ‘readr’ had non-zero exit status
4: In install.packages("httr", dependencies = TRUE) :
installation of package ‘xml2’ had non-zero exit statusThese can be disregarded for my use case, so I expect you won’t need to install the jpeg and png libraries to run your web scraping tool either.
Now you can upload files, like your R code, to the server in order to perform the scrape with scp (secure copy).
Finally, to automate the script to run each hour, add a cron task with command crontab -e. You may need to add sudo before R to prevent errors.
The script will run the R file every hour, and print the output in an email. The emails can be annoying, but it makes for an excellent debugging tool in case unexpected issues arise within the first few days of the script running. Setting a filter to move these hourly updates to a folder lessens the annoyance considerably. Simply remove the MAILTO line to stop the emails.
Create an S3 Bucket to Store and Publicly List Your Data
S3 is AWS’s “Simple Storage Service.” It too is free-tier eligible, and it is incredibly affordable after the limited-time offer expires. Using S3 has an advantage over storing the data on your EC2 instance.
- Redundancy: Copying the data you’ve compiled to S3 will protect you in the event that the EC2 instance you’ve created is shut down and access to its files is lost.
- Security: You want to limit access to your EC2 instance as much as possible. Depending on your security group settings, you may only be able to connect to your server by secure shell if and only if you hold the corresponding private-public key pair and your IP address has been allowed under the security group settings. Opening up the EC2 to try to allow anonymous users to access your dataset greatly increases your “attack surface” for those who may want to get access to other items or take control of your EC2 instance.
- Transparency: Rather than relying on you to hand over your data, reviewers, colleagues, and readers can go download the latest copy of your dataset themselves from your S3 bucket, if they would like to run new tests or attempt to replicate your results.
Connecting EC2 and S3 with an IAM Role
When assigning an IAM role for your EC2 instance, you want to have a role that can read and write data to the S3 bucket which you will use to store your data. You can accomplish this by assigning the AmazonS3FullAccess policy to that user, but this policy allows the role to read and write to any S3 bucket under your AWS account.
A cleaner and less-risky policy can be created custom by lightly editing the JSON code below to match your new role’s desired settings. This code allows the user to list the content of your-bucket and add, edit, and delete items in your-bucket specifically.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::your-bucket/*"
]
}
]
}Once the role is correctly configured, you just need to check that this role is assigned to your EC2 instance under IAM role.
You can now use the AWS command line tools (preinstalled on the Linux AMI) in order to sync the files on your EC2 instance with those in your S3 bucket. You can add this to the front and end of your cron task to automate the syncs between S3 and EC2.
0 * * * * aws s3 sync s3://your-bucket /home/ec2-user/s3-export && R CMD BATCH /home/ec2-user/file.R && aws s3 sync /home/ec2-user/s3-export s3://your-bucketThe first S3 sync takes all the folder and files in your-bucket and updates the files local to your EC2 instance at the specified path. The second S3 sync updates the files in your bucket to match the data on your EC2 instance, which should be updated to reflect the most recent run of the web scraping script.
The purpose of the && in the cron file is to set the order of the functions run. It also stops the process if at any point, one of the functions throws a fatal error.
Make Your Data Public
In the S3 Management Console, you can customize the permissions of your bucket to allow anyone to read the files (or specific files) in your bucket under the “Permissions” tab in the bucket’s settings. It’s important that we only allow give anonymous users read access to the bucket. If anyone can write new files to the bucket, overwrite existing files, delete items from the bucket, or manage its user and permission settings, then you are exposed to run into problems malicious or accidental.
Further, the default read-only access you can give to everyone under “Access Control List” allows anyone to list the contents of your bucket by navigating to the base URL. This may be problematic, so you set custom permissions under “Bucket Policy” that allow anyone to read or download the files of your choosing without exposing all the files in your bucket to the public eye.
{
"Version": "2008-10-17",
"Id": "http better policy",
"Statement": [
{
"Sid": "readonly policy",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::your-bucket/*"
}
]
}Principal * means anyone. Note that you can change Resource to assign a specific file path or paths. By this current configuration, anyone can read any file in your-bucket, but he or she cannot list all the contents of your-bucket. Bucket policies can be easily debugged by plugging URLs to your bucket into your browser to see if the desired results are achieved.
By Michael Kotrous