Yihui Xie

Aug 222010

I’m really surprised that most beamer slides I’ve ever seen have Figure/Table captions like this:

Figure: blabla blabla

or

Table: blabla blabla

which should have been Figure 1 or Table 2.

Why are the caption numbers missing? This is because beamer does not produce these numbers by default. To enable numbered captions, you have to put this in the preamble of your LaTeX document:

\setbeamertemplate{caption}[numbered]

I’m a picky LaTeX users… I cannot stand captions without numbers.

Aug 142010
Update: now the long hints for function parameters can be broken into several shorter lines.

Auto-completion is fancy in a text editor. Notepad++ does not support auto-completion for the R language, so I spent a couple of hours on creating such an XML file to support R:

Download: R.xml (938Kb)

Put it under ‘plugins/APIs‘ in the installation directory of Notepad++ (you can see several other XML files there supporting different languages such as C), and make sure you have enabled auto-completion in Notepad++ (Settings --> Preferences --> Backup/Auto-completion). Open an R script and start typing a familiar function (e.g. paste()), you will see some candidates in a drop-down list like this:

Show parameters of R functions in Notepad++

Show parameters of R functions in Notepad++

Hit the Enter key if the function name selected in the list is correct for you, then type ‘(‘ and you will see hints for parameters:

Auto-completion in Notepad++ for R script

Auto-completion in Notepad++ for R script

The file R.xml was actually generated from R; it contains almost all visible R objects in base R packages as well as recommended packages like MASS. You may create an extended XML file (containing keywords from other packages) by yourself after loading the packages you need into your current workspace, and run:

source('http://yihui.name/en/wp-content/uploads/2010/08/Npp_R_Auto_Completion.r')
# R.xml will be generated under your current work directory: getwd()
Jul 212010

As every useR knows, the useR! 2010 conference is being held at NIST in Gaithersburg these days. I have just finished my talk on the R package animation this afternoon. Here are my slides and R code for those who are interested:

Download: Slides (1.6M), and R code (3.6K); Note you may need Acrobat Reader to watch the animations inside the slides.

Have fun, even if you are a PhD!

Apr 172010
A new paper on the α-convex hull appeared in the Journal of Statistical Software today (http://www.jstatsoft.org/v34/i05/paper). The α-convex hull is an interesting problem which caught my attention long time ago but I didn’t know a solution then. R has a function chull() which can generate (indices of) the convex hull for a series of points. Now we can use the R package alphahull to compute the α-convex hull. For those who are not familiar with the α-convex hull, the animation below might be a good illustration for the difference between a convex hull and an α-convex hull. Note how the parameter α affects the shape of the hull:

alpha-convex hull with different alpha's

alpha-convex hull with different alpha's

The above animation can be reproduced with the code below (uncomment the lines to create a GIF animation with the animation package):

Apr 152010

I came across this blog post just now: The Next Big Thing, and of course these words caught my attention:

[...] However, for me personally and for most users, both individual and organizational, the much greater cost of software is the time it takes to install it, maintain it, learn it and document it. On that, R is an epic fail.

I don’t really understand how (much more?) difficult will it be to install and maintain R. Usually it takes about one minute to install it from the binary (and SAS? SPSS? buy it, find a technician, install it, maintain according to different licenses – single PC or server or other types, continue to pay only tens of thousand dollars next year, …). For learning, it depends. I don’t think it is too difficult for people who know well about statistics, and for the rest of people, do they really feel safe to do something they do not understand? For the documentation, some people prefer simple ones and some prefer handbooks (of SAS-style).

In all, I cannot see why R is an epic fail for the above reasons…

What? Data visualization?…

The R community must have been tired of comparing SAS with R. Please don’t tell Prof Frank Harrell about this post…

Apr 132010

It is not uncommon to see messy R code which is almost not human-readable like this:

 # rotation of the word "Animation"
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
 # redraw the plot again and again
plot(1,ann=FALSE,type="n",axes=FALSE)
# rotate; use rainbow() colors
text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
# pause for a while
Sys.sleep(0.01)}

Apparently it is pain reading unformatted R code, but on the other hand, it is natural for us to be lazy. I don’t care about adding spaces or indent to my raw R code — I’ll concentrate on programming first and format my code later. The R package ‘formatR‘ is intended to help us format our messy R code. Two lines of R code will show you the graphical interface of formatR:

# formatR depends on RGtk+, will be installed automatically
# please use the latest version of R (>=2.10.1)
install.packages('formatR')
library(formatR)
# or formatR()

Then you can either paste your code into the text box or click the “Open” button to open an existing R code file. Click the “Convert” button and you are done!

formatR: unformatted R code

formatR: unformatted R code

formatR: tidy R code

formatR: tidy R code

There are several options in the “Preferences” panel, e.g. you can specify whether to keep comments or blank lines, or specify the width of the formatted R code.

No matter how messy your code looks like, formatR can make it tidy and structured as long as there are no syntax errors in your R code. If you prefer the command line interface, you may want to take a look at the function tidy.source() in the animation package.

Currently there are problems with the encoding of multi-byte characters, and I have not figured out how to deal with them.

Apr 052010

Here is my personal list of rules of thumb for people who want to meet some R gurus (quickly) in the R help mailing list (R-help@R-project.org):

  • If you want to meet Dr Bill Venables, just say something about Type III Sum of Squares (better if you also mention the “unbeatable” SAS);
  • If you want to meet Prof Douglas Bates, say something about LSMEANS (of course, with SAS) and P-values for the fixed effects in lmer() (or wait in the mixed-models group r-sig-mixed-models@r-project.org — he often shows up there);
  • If you want to meet Prof Frank Harrell Jr, say SAS is unbeatable (or efficient, golden-standard, high-quality graphics, whatever);
  • If you want to meet Dr Martin Mächler, say something like “I need help on a library called ***” (it is said that he would show up in 5 mins upon such mistakes, but I feel he is tired of correcting people who don’t know the difference between a “package” and a “library” now);
  • If you want to meet Prof Brian Ripley (the-professor-on-whom-the-sun-never-sets), well, I guess you can say anything, because he is so devoted to the mailing list that you can see him a.e., but you have to be careful enough not to be “Ripleyed”;

I’ve been reading the mailing list for about 2 years, so I may not know enough about all the gurus. Let me know if I missed anyone. The above list is not given for serious purpose, and my real point is I learned a lot from their advice and arguments.

Apr 032010

We know the real distribution of the F statistic in linear models — it is a non-central F distribution. Under H0, we have a central F distribution. Given 1 – α, we can compute the probability of (correctly) rejecting H0. I created a simple demo to illustrate how the power changes as other parameters vary, e.g. the degrees of freedoms, the non-central parameter and alpha. Here is the video:

The Power of F Test

And for those who might be interested, here is the code (you need to install the gWidgets package first and I recommend the RGtk2 interface). Have fun:

Mar 282010

When we want to call external programs in R under Windows, we often need to know the paths of these programs. For instance, we may want to know where ImageMagick is installed, as we need the convert (convert.exe) utility to convert images to other formats, or where OpenBUGS is installed because we need this path to use the function bugs(). Usually this problem does not exist under Linux, because the executables (or their symbolic links) are often put in the directories which are in the environment variable PATH (e.g. /usr/bin, /usr/local/bin).

However, we may be able to find the paths through the registry if the installation will save the path info in the registry hive. The R function is readRegistry():

## ImageMagick:
## I used this trick in the function saveMovie (the animation package)
> readRegistry("SOFTWARE\\ImageMagick\\Current")
$BinPath
[1] "C:\\Program Files\\ImageMagick"
$CoderModulesPath
[1] "C:\\Program Files\\ImageMagick\\modules\\coders"
$ConfigurePath
[1] "C:\\Program Files\\ImageMagick\\config"
$FilterModulesPath
[1] "C:\\Program Files\\ImageMagick\\modules\\filters"
$LibPath
[1] "C:\\Program Files\\ImageMagick"
$QuantumDepth
[1] 16
$Version
[1] "6.3.8"

## OpenBUGS
> r = names(readRegistry("Software\\Microsoft\\Windows\\ShellNoRoam\\MUICache",
+    "HCU"))
> dirname(r[grep("OpenBUGS\\.exe", r)])
[1] "C:/Program Files/OpenBUGS"

There is no guarantee for this approach to work on any Windows platforms, but I think this is better than explaining what is the PATH variable to some Windows users…

Mar 242010
Amber Watkins gave me a suggestion on the animation for the ratio estimation, and I think this is a good topic for my animation package. I’ve finished writing the initial version of the function sample.ratio() for this package, which will appear in the version 1.1-2 a couple of days later.

As we know, the benefit of ratio estimation is that sampling skewness may be adjusted for, because the estimation of \bar{Y} will make use of the information in the relationship of X and Y: \bar{X} \cdot (\bar{y}/\bar{x}). Here is a demo (we can see the ratio estimate, denoted by the red line, generally performs better than \bar{y}):

An animation demo for the ratio estimation

An animation demo for the ratio estimation

WWW.YIHUI.NAME XIE@YIHUI.NAME © 2007 - 2010 by Yihui Xie