Featured
Articles on my recommendation, i.e. will not be a waste of time
The world has changed. You can feel it on GitHub. You can smell it on Google+. The knitr package, as an alternative tool to Sweave, has features that you have been longing for, and features that you might have never imagined. Thumb through the PDF manual to see some of them.
Currently this package is still a beta version, so I’m looking for feedback from early birds on:
- is the PDF documentation confusing in any places? e.g. you have no idea on how to install the package because it was not mentioned in the manual;
- does the website look ugly in your browser? (I know it does with IE under Windows) I used a font from Google Font API, and it does not seem to be consistent across different web browsers/OS’es;
- what kind of difficulties did you have in switching from Sweave/pgfSweave/whatever-Sweave to knitr?
- do you like the idea of putting R code/output in a shaded frame in LaTeX? is the default shading (
rgb(.97, .97, .97)) too dark or too light? how about the highlighting theme? - have you ever tried to hack at Sweave? I’d love to listen to your stories;
- what else do you expect from knitr?
Feel free to file a bug report in the Issues page if you find any problems or have any suggestions. I appreciate your efforts in making this knitr package even neater!
I must admit that I have been tired of maintaining my R packages for a long time, and the main reason is I feel really uncomfortable with writing R documentations (Rd). The required structure of an R package mainly includes two directories R and man — the former for the R source code (typically functions), and the latter for documentation. In the past I usually use package.skeleton() to generate a skeleton of the documentation and fill in the tags one by one. The main headache is to frequently switch between the two files and type the raw Rd commands such as \title{} and \description{}.
People told me all kinds of advantages of Emacs+ESS in the past few years, and I tried it for more than ten times, but often ended up with frustration (so I installed and removed Emacs repeatedly for several times). My last attempt a few months ago succeeded finally, and I realized how easy it was to document R functions in Emacs with roxygen. See the 1-minute video below:
0. Summary
Take a look at the video in this entry if you don’t understand the title. To put it short,
- install LyX and R as well as a working LaTeX toolkit such as MikTeX or TeXLive or MacTeX;
- run
source('http://gitorious.org/yihui/lyx-sweave/blobs/raw/master/lyx-sweave-config.R')in R under Windows or Ubuntu or Mac; I tried my best to automatically configure LaTeX, R and LyX; - restart LyX as instructed, and you can enjoy
pgfSweavein LyX now — either play with my demo (demo 1; demo 2 with bibliography; a beamer demo; an animation demo with PDF output), or DIY: create a new document, change the document class to article (Sweave noweb) from Document –> Settings, switch the environment to Scrap from the top-left drop list, start your Sweave code chunks like
<<test>>=
rnorm(10)
@
and click the PDF button to compile this document. Done. Take a look at this video if you feel confused.
This works for MikTeX under Windows (Server 2003 / Win7), and TeXLive 2009 under Ubuntu 10.10, MacTeX 2010 under Mac OS; R 2.12.0 or 2.11.1; LyX 1.6.x.
Tables are pretty common in web pages as data sources, and the most direct way to get these data is probably to copy and paste. This is OK if there are only two or three tables, and when we need to grab 5000 tables in 1000 web pages, we may not really wish to fulfill the task by hand. That is one of the reasons for why we need programming — we want to be as lazy as possible. Who is willing to spend 2 hours copying and pasting? Just let the computers do the tedious job and we can watch movies.
The R package XML is a handy tool to deal with web pages (both XML or HTML). I’m actually a big fan of its author, Duncan Temple Lang, who did a lot of work on the infrastructure of statistical computing (see the Omegahat project). Next I use the Stat579 homework of week 8 as an example to show how to read tables from web pages directly using R, i.e. no Excel, no Word, no copy & paste. The task is to grab 3 data tables from the web pages for 3 states, clean the data and do some graphics. Specifically, I’ll take the page for Iowa as an example. See the R code below:
if (!require(XML)) install.packages('XML')
library(XML)
x = readHTMLTable('http://www.disastercenter.com/crime/iacrime.htm')
## the 3rd element is what we want
x = x[[3]]
## names are in the first 2 rows
nms = as.vector(apply(x[1:2, ], 2, paste, collapse = ''))
## remove the first 2 rows because they are not data
x = x[-(1:2), ]
## assign the names to data
names(x) = nms
## then remove any characters which are not numbers (i.e. 0-9)
x = sapply(x, function(xx) as.numeric(gsub('[^0-9]', '', xx)))
The most important function is readHTMLTable(), which is a convenient wrapper to parse an HTML page and retrieve the table elements. The rest of work is simply to figure out which table we need. Then we have to remove some characters which are not numbers. This is done by a regular expression [^0-9] in which ^ means matching any characters other than the following ones (in this case, they are digits from 0 to 9). It is easy to extend this script to reading other web pages too — just change the URL (e.g. using a loop).
This blog post is mainly for Stat 579 students on the homework for week 7, since I received too many “gory” loops in the homework submissions and I think it would help a bit to write my thoughts on R loops for beginners. The immortal motto for newbies in programming is:
If you want to make an apple pie from scratch, you must first create the universe.
Carl Sagan
There have been endless wars on which programming language is better than others, but my view point is, that is nothing but the balance between the code performance and the amount of work for programmers. In an extreme sense, almost all languages give you the ability to create the universe, but you do not really have to if you just want to make an apple pie.
R was born after S, a language which was invented “to turn ideas into software, quickly and faithfully” and received the ACM Software System Award in 1998. Before the S language, statisticians often had to write “gory” low-level computing routines to do data analysis and statistical computation, including those “gory” loops, of course. For example, imagine what you have to do to compute the correlation coefficients in C.
R has wrapped a lot of common tasks in lower-level programming languages (mainly C and Fortran) to make it easier to call and faster to compute (R’s (explicit) loops are generally slower than low-level languages), which frees statisticians from paying too much attention to the gory details in computation. However, the consequence is we have got too many tools in our hands, of which we are often unaware. I have no quick solution on this problem — we have to learn more about the capability of R through many ways, e.g. reading the R-help mailing list, asking experts, doing daily work with R, reading the source code of R functions and playing with the examples in help pages, etc.
It is not uncommon to see messy R code which is almost not human-readable like this:
# rotation of the word "Animation"
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
# redraw the plot again and again
plot(1,ann=FALSE,type="n",axes=FALSE)
# rotate; use rainbow() colors
text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
# pause for a while
Sys.sleep(0.01)}
Apparently it is pain reading unformatted R code, but on the other hand, it is natural for us to be lazy. I don’t care about adding spaces or indent to my raw R code — I’ll concentrate on programming first and format my code later. The R package ‘formatR‘ is intended to help us format our messy R code.
# formatR optionally depends on gWidgetsRGtk2
# please use the latest version of R (>=2.12.0)
install.packages('formatR')
library(formatR)
formatR()
## you will get an error if the package gWidgetsRGtk2 is not installed;
## then you need to install it
install.packages('gWidgetsRGtk2')
formatR('RGtk2')
Then you can either paste your code into the text box or click the “Open” button to open an existing R code file. Click the “Convert” button and you are done!
There are several options in the “Preferences” panel, e.g. you can specify whether to keep comments or blank lines, or specify the width of the formatted R code.
No matter how messy your code looks like, formatR can make it tidy and structured as long as there are no syntax errors in your R code. If you prefer the command line interface, you may want to take a look at the function tidy.source() in this package.
Note that multi-byte characters (say, Chinese) are also supported in the GUI.
Amber Watkins gave me a suggestion on the animation for the ratio estimation, and I think this is a good topic for my animation package. I’ve finished writing the initial version of the function sample.ratio() for this package, which will appear in the version 1.1-2 a couple of days later.
As we know, the benefit of ratio estimation is that sampling skewness may be adjusted for, because the estimation of will make use of the information in the relationship of X and Y:
. Here is a demo (we can see the ratio estimate, denoted by the red line, generally performs better than
):
Since animation 1.0-9, we will be able to create a PDF document with an animation embedded in it; the function is saveLatex(), and its usage is similar to saveMovie() and saveSWF(): you pass an R expression for creating animations to this function, and this expression will be evaluated in the function; the image frames get recorded by a graphics device. In the end, a LaTeX document is written in a directory, and we can get a PDF document by running pdflatex on the document.
In fact, the key point is the LaTeX package named animate, which can be used to insert image frames into a PDF document to generate an animation. The interface of animations created by this package is quite similar to the HTML animation page by the R package animation, moreover, it also uses JavaScript (in PDF) to animate the image frames.
Today Romain Francois posted an interesting topic in the R-help list, and you can read his blog post for more details: celebrating R commit #50000. 50000 is certainly not a small number; we do owe R core members a big “thank you” for their great efforts in this fantastic statistical language in the 13 years. When I saw Romain’s data, I suddenly remembered a question I asked to one of Prof Ripley’s student a couple of years ago: does Prof Ripley ever sleep? And he answered “No!”. No wonder we can see Prof Ripley so frequently in the R-help/devel mailing list. If you have stayed on R-help list for enough long time, you’ll surely know several facts, e.g. Martin Maechler will arrive in less than 3 minutes if you dare call an R package “library”, and you will get “Ripleyed” if you are not careful enough in posting your R code.
> library(fortunes)
> fortune("Ripleyed")
And the fear of getting Ripleyed on the mailing list also makes me think, read,
and improve before submitting half baked questions to the list.
-- Eric Kort
R-help (January 2006)
As Sir Francis Bacon said, “Histories make men wise; poets witty; the mathematics subtile[1]; natural philosophy deep; moral grave; logic and rhetoric able to contend.” And Windows stupid.
He should have added the last sentence if he were a Windows user in this age.
1. Avoid Using M$ Excel
A lot of R users often ask this question: “How to import MS Excel data into R?” Well, my suggestion is, avoid using M$ Excel if you are a statistician (or going to be a statistician) because you just cannot imagine how messy Excel data can be: some cells might be merged, some are colored, some texts are bold, several data tables can be put everywhere (e.g. cell(1,1) to (10,4), and (17,3) to (25,9)), stupid bar plots and pie charts are inserted in the sheets, silly statistical procedures that are wrong forever… If you don’t trust my words (yes, I’m a nobody), just read the examples here: Problems with Excel (collected by Prof Harrell).
I know there are reasons for you to continue using Excel. Your boss required you to do so; you don’t have time to learn more about various data formats; everybody is using Excel, and you don’t want to be so cool to use R; or if you finish your tasks too quickly and accurately, your boss will doubt whether you have really spent time on working, hence you will get less money paid (this is a REAL story for me – though I didn’t get less payment, I was indeed doubted when I used R); …



Recent Comments