R Computation

Statistical computation with R

Mar 232010

Here are some (trivial) R tips in the course Stat 511. I’ll update this post till the semester is over.

  1. Formatting R Code

  2. I’ve submitted an R package named formatR to CRAN yesterday. This package should be easier than the code below, because there is a GUI to tidy your R code. Install with install.packages('formatR').

    Reading code is pain, but the well-formatted code might alleviate the pain a little bit. The function tidy.source() in the animation package can help us format our R code automatically. By default it will read your code in the clipboard, parse it and return the well-formatted code. You have options to keep or remove the comments/blank lines and set the width of the code, etc. Spaces and indent will be added automatically. This can save us time typing spaces and paying attention to indent.

    ## install.packages('animation') if it is not installed yet
    library(animation)
    ## copy some R code somewhere and type:
    tidy.source()
    ## or specify the path of your code file
    tidy.source(file.path(system.file(package = "graphics"), "demo", "image.R"))
    ## can also use a URL
    tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R')
    ## remove blank lines
    tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R',
               keep.blank.line = FALSE)
    ## remove comments
    tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R',
               keep.comment = FALSE)
    
Mar 122010

Stephanie asked in 511 today if we were able to get the random seed which was set by set.seed() but we were only given the random numbers (without knowing the seed). This kind of “hacker” questions sound interesting. One dirty solution should be the brute-force method, e.g:

# x: the random vector;
# FUN: the function that generates random numbers with the first argument
#      being the length of random numbers
# seed: candidate seeds to be tried one by one
# ...: other arguments to be passed to FUN
find.seed = function(x, FUN = rnorm, seed = 0:10000, ...) {
    res = NULL
    for (i in seed) {
        set.seed(i)
        rx = FUN(length(x), ...)
        # all() can be changed to all.equal() to obtain a rough solution
        #     allowing a little bit numeric errors
        if (all(x == rx)) {
            res = i
            break
        }
    }
    res
}
Aug 292009
Coin-flipping is a rather old topic in probability theory, so most of us think we know very well about it, however, the other day I saw a question about this old topic (in David Smith’s REvolution?) which was beyond me expectation: how many times do we need to flip the coin until we get a sequence of HTH and HTT respectively? (For example, for the sequence HHTH, the number for HTH to appear is 4, and in THTHTT, the number for HTT is 6.)

It seems that the two results are equivalent, as H and T occurs with equal probability 0.5, so we naturally believe the average numbers of steps to HTH and HTT are the same, but the fact is not as we imagined.

## smart guys use math formulae to solve the problem,
## but *lazy* guys like me use simulations with R
coin.seq = function(v) {
    x = NULL
    n = 0
    while (!identical(x, v)) {
        x = append(x[length(x) - 1:0], rbinom(1, 1, 0.5))
        n = n + 1
    }
    n
}
set.seed(919)
mean(htt <- replicate(1e+05, coin.seq(c(1, 0, 0))))
# [1] 8.00304
mean(hth <- replicate(1e+05, coin.seq(c(1, 0, 1))))
# [1] 10.0062

png("coin-htt-hth.png", height = 150, width = 500)
par(mar = c(3, 2.5, 0.1, 0), mgp = c(2, 0.8, 0))
boxplot(list(HTT = htt, HTH = hth), horizontal = T,
    xlab = "n", ylim = range(boxplot(list(HTT = htt, HTH = hth),
        plot = FALSE)$stats))
points(c(mean(htt), mean(hth)), 1:2, pch = 19)
dev.off()

The answer is counterintuitive, isn’t it?

Number of times needed to get HTT and HTH

Number of times needed to get HTT and HTH (bold segments are for median; dots denote mean)

Well, mathematicians certainly do not like my solution (I guess they even hate such an imprecise approach). I hope some smart guys can give me some hints on working out the probability distribution and hence the expectation.

May 112008

I was unfortunately caught in the rain this evening and my glasses were blurred by the rain drops. When I came back the office, took them off and was about to clear the beads, I just found a good RNG (Random Number Generator) from the nature:

People who are familiar with statistical computation must have learned how to generate random numbers following a U(0, 1) distribution. One of the most common generators is the linear congruential generator. This is just what I was reminded of by the rain drops on my glasses. The plot below is made in R based the linear congruential generator:

May 092008

Yesterday Fan J asked me for a cover design of the statistical journal of the postgraduates of our school, and I just remembered the gradient descent algorithm, as this design should have something to do with statistics.

I always insist that students (as well as many researchers in the field of statistics) have ignored the great importance of statistical computation. For example, every now and then I can see presenters prefer to stop going further when they come to an issue related to computations; logistic regression is just a very typical topic. Rarely have I seen any authors mentioned the process of estimating parameters in logistic regression, what’s more, some authors just tell us to find the roots of a series of partial derivative equations (as if they were quite easy to solve by hand), which is a conventional method in optimization — I just cannot see where “IWLS” or “Fisher Scoring” are mentioned. Thus I designed this picture to emphasize the importance of statistical computation.

WWW.YIHUI.NAME XIE@YIHUI.NAME © 2007 - 2010 by Yihui Xie