Tables are pretty common in web pages as data sources, and the most direct way to get these data is probably to copy and paste. This is OK if there are only two or three tables, and when we need to grab 5000 tables in 1000 web pages, we may not really wish to fulfill the task by hand. That is one of the reasons for why we need programming – we want to be as lazy as possible. Who is willing to spend 2 hours copying and pasting? Just let the computers do the tedious job and we can watch movies.
The R package XML is a handy tool to deal with web pages (both XML or HTML). I’m actually a big fan of its author, Duncan Temple Lang, who did a lot of work on the infrastructure of statistical computing (see the Omegahat project). Next I use the Stat579 homework of week 8 as an example to show how to read tables from web pages directly using R, i.e. no Excel, no Word, no copy & paste. The task is to grab 3 data tables from the web pages for 3 states, clean the data and do some graphics. Specifically, I’ll take the page for Iowa as an example. See the R code below:
The most important function is readHTMLTable(), which is a convenient wrapper to parse an HTML page and retrieve the table elements. The rest of work is simply to figure out which table we need. Then we have to remove some characters which are not numbers. This is done by a regular expression [^0-9] in which ^ means matching any characters other than the following ones (in this case, they are digits from 0 to 9). It is easy to extend this script to reading other web pages too – just change the URL (e.g. using a loop).
To assist understanding the code above, I put some intermediate results below: