Featured

Articles on my recommendation, i.e. will not be a waste of time

Apr 132010

It is not uncommon to see messy R code which is almost not human-readable like this:

 # rotation of the word "Animation"
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
 # redraw the plot again and again
plot(1,ann=FALSE,type="n",axes=FALSE)
# rotate; use rainbow() colors
text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
# pause for a while
Sys.sleep(0.01)}

Apparently it is pain reading unformatted R code, but on the other hand, it is natural for us to be lazy. I don’t care about adding spaces or indent to my raw R code — I’ll concentrate on programming first and format my code later. The R package ‘formatR‘ is intended to help us format our messy R code. Two lines of R code will show you the graphical interface of formatR:

# formatR depends on RGtk+, will be installed automatically
# please use the latest version of R (>=2.10.1)
install.packages('formatR')
library(formatR)
# or formatR()

Then you can either paste your code into the text box or click the “Open” button to open an existing R code file. Click the “Convert” button and you are done!

formatR: unformatted R code

formatR: unformatted R code

formatR: tidy R code

formatR: tidy R code

There are several options in the “Preferences” panel, e.g. you can specify whether to keep comments or blank lines, or specify the width of the formatted R code.

No matter how messy your code looks like, formatR can make it tidy and structured as long as there are no syntax errors in your R code. If you prefer the command line interface, you may want to take a look at the function tidy.source() in the animation package.

Currently there are problems with the encoding of multi-byte characters, and I have not figured out how to deal with them.

Mar 242010
Amber Watkins gave me a suggestion on the animation for the ratio estimation, and I think this is a good topic for my animation package. I’ve finished writing the initial version of the function sample.ratio() for this package, which will appear in the version 1.1-2 a couple of days later.

As we know, the benefit of ratio estimation is that sampling skewness may be adjusted for, because the estimation of \bar{Y} will make use of the information in the relationship of X and Y: \bar{X} \cdot (\bar{y}/\bar{x}). Here is a demo (we can see the ratio estimate, denoted by the red line, generally performs better than \bar{y}):

An animation demo for the ratio estimation

An animation demo for the ratio estimation

Nov 112009

Since animation 1.0-9, we will be able to create a PDF document with an animation embedded in it; the function is saveLatex(), and its usage is similar to saveMovie() and saveSWF(): you pass an R expression for creating animations to this function, and this expression will be evaluated in the function; the image frames get recorded by a graphics device. In the end, a LaTeX document is written in a directory, and we can get a PDF document by running pdflatex on the document.

In fact, the key point is the LaTeX package named animate, which can be used to insert image frames into a PDF document to generate an animation. The interface of animations created by this package is quite similar to the HTML animation page by the R package animation, moreover, it also uses JavaScript (in PDF) to animate the image frames.

Oct 102009

Today Romain Francois posted an interesting topic in the R-help list, and you can read his blog post for more details: celebrating R commit #50000. 50000 is certainly not a small number; we do owe R core members a big “thank you” for their great efforts in this fantastic statistical language in the 13 years. When I saw Romain’s data, I suddenly remembered a question I asked to one of Prof Ripley’s student a couple of years ago: does Prof Ripley ever sleep? And he answered “No!”. No wonder we can see Prof Ripley so frequently in the R-help/devel mailing list. If you have stayed on R-help list for enough long time, you’ll surely know several facts, e.g. Martin Maechler will arrive in less than 3 minutes if you dare call an R package “library”, and you will get “Ripleyed” if you are not careful enough in posting your R code.

> library(fortunes)
> fortune("Ripleyed")

And the fear of getting Ripleyed on the mailing list also makes me think, read,
and improve before submitting half baked questions to the list.
 -- Eric Kort
 R-help (January 2006)
Sep 262009

As Sir Francis Bacon said, “Histories make men wise; poets witty; the mathematics subtile[1]; natural philosophy deep; moral grave; logic and rhetoric able to contend.” And Windows stupid.

He should have added the last sentence if he were a Windows user in this age.

1. Avoid Using M$ Excel

A lot of R users often ask this question: “How to import MS Excel data into R?” Well, my suggestion is, avoid using M$ Excel if you are a statistician (or going to be a statistician) because you just cannot imagine how messy Excel data can be: some cells might be merged, some are colored, some texts are bold, several data tables can be put everywhere (e.g. cell(1,1) to (10,4), and (17,3) to (25,9)), stupid bar plots and pie charts are inserted in the sheets, silly statistical procedures that are wrong forever… If you don’t trust my words (yes, I’m a nobody), just read the examples here: Problems with Excel (collected by Prof Harrell).

I know there are reasons for you to continue using Excel. Your boss required you to do so; you don’t have time to learn more about various data formats; everybody is using Excel, and you don’t want to be so cool to use R; or if you finish your tasks too quickly and accurately, your boss will doubt whether you have really spent time on working, hence you will get less money paid (this is a REAL story for me – though I didn’t get less payment, I was indeed doubted when I used R); …

Jun 102009
Tag cloud is a bunch of words drawn in a graph with their sizes proportional to their frequency; it’s widely used in blogs to visualize tags. We can observe important words quickly from a tag cloud, as they often appear in large fontsize. Tony N. Brown asked how to “graphically represent frequency of words in a speech” the other day in R-help list, which is actually a problem about the tag cloud:

I recently saw a graph on television that displayed selected words/phrases in a speech scaled in size according to their frequency. So words/phrases that were often used appeared large and words that were rarely used appeared small. [...]

Marc Schwartz mentioned that Gorjanc Gregor has done some work years ago using R (in grid graphics). The obstacle of creating tag cloud in R, as Gorjanc wrote, lies in deciding the placement of words, and it would be much easier for other applications such as browsers to arrange the texts. That’s true — there have already been a lot of mature programs to deal with tag cloud. One of them is the wp-cumulus plugin for WordPress, which makes use of a Flash object to generate the tag cloud, and it has fantastic 3D rotation effect of the cloud.

1. Arranging text labels with pointLabel()

Before introducing how to port the plugin into R, I’d like to introduce an R function pointLabel() in maptools package and it can partially solve the problem of arranging text labels in a plot (using simulated annealing or genetic algorithm). Here is a simulated example:

Simulated Tag Cloud with R function pointLabel() in maptools

Simulated Tag Cloud with R function pointLabel() in maptools

Sep 232008

The results of two coins which are tossed 200 times respectively are:

(A)
11110000000000100000101100000100000101001111001100
01111110010110110101101001111001100011011101100000
10001001111110100100001011001011101101110001010010
01100111111100011100101000101001110011100010100111

(B)
01110010010010100010011110010100010011010111001110
01111011010111101101001000111001101011010101101001
00101001110110100100001110101101101001110101100110
01110011110110001110011010111001110011110010100111

Which is the unfair one (or a false record)? In the animation below, x1 denotes the first coin, while x2 is the record of the second coin. The plot in the middle is 1000 simulations from the Binomial distribution with p = 0.5 and size = 1. An equivalent question to the hypothesis test is, which plot looks like the simulation more? Of course we should give a visual definition to “similarity” before comparison. Imagine if you are going to perform a test numerically, which statistic will you choose? For me, at least three options are available:

  1. Number of heads (or tails): if too many/few heads (or tails) show up, the coin might be unfair
  2. Maximum run length, i.e. maximum number of successive 0′s or 1′s (e.g. for coin A, there are ten 0′s); don’t take it for granted that ten successive 0′s is a rare event in 200 tosses — the probability is not 0.5^10; if the run length is too long or too short, we may consider the coin as unfair
  3. Number of changes from 0 to 1 or 1 to 0: if the coin changes too frequently from one side to the other side, it can be regarded as unusual too

Accordingly, we can present these statistics in a visual way. Plot the observed sequences and a simulated sequence as a reference, and compare observed graphs with the reference to see which one is unusual:

  1. How many points are in the top (equivalently, bottom)
  2. Length of the longest horizontal segment
  3. Density of vertical lines

Now watch the Flash animation below [Fullscreen Flash animation]:

Sep 062008

The other day I sent a small assignment to a group of people in order that they could “play” with statistics and become more interested with this subject. The data provided to them is:

Downdload the file here (104K)

The data-generating process was quite simple: first I generated 20000 random numbers (10000 rows, 2 columns) from N(0, 1) and then add 10000 rows of numbers which lie exactly on a circle; at last I provided this data in a randomized order so people cannot easily discover the pattern just from the numbers.

The question is, how to reveal the particular pattern in this “pile of sand”? Let’s look at the original plot:

The original scatter plot

The original scatter plot

What can we observe from this scatter plot? Perhaps nothing but “a pile of sand”. However, if we choose alternative ways to create the plot again, things will be completely different. Here are my approaches:

May 112008

I was unfortunately caught in the rain this evening and my glasses were blurred by the rain drops. When I came back the office, took them off and was about to clear the beads, I just found a good RNG (Random Number Generator) from the nature:

People who are familiar with statistical computation must have learned how to generate random numbers following a U(0, 1) distribution. One of the most common generators is the linear congruential generator. This is just what I was reminded of by the rain drops on my glasses. The plot below is made in R based the linear congruential generator:

May 092008

Yesterday Fan J asked me for a cover design of the statistical journal of the postgraduates of our school, and I just remembered the gradient descent algorithm, as this design should have something to do with statistics.

I always insist that students (as well as many researchers in the field of statistics) have ignored the great importance of statistical computation. For example, every now and then I can see presenters prefer to stop going further when they come to an issue related to computations; logistic regression is just a very typical topic. Rarely have I seen any authors mentioned the process of estimating parameters in logistic regression, what’s more, some authors just tell us to find the roots of a series of partial derivative equations (as if they were quite easy to solve by hand), which is a conventional method in optimization — I just cannot see where “IWLS” or “Fisher Scoring” are mentioned. Thus I designed this picture to emphasize the importance of statistical computation.

WWW.YIHUI.NAME XIE@YIHUI.NAME © 2007 - 2010 by Yihui Xie