R Computation
Statistical computation with R
Here are some (trivial) R tips in the course Stat 511. I’ll update this post till the semester is over.
-
Formatting R Code
I’ve submitted an R package named
formatR to CRAN yesterday. This package should be easier than the code below, because there is a GUI to tidy your R code. Install with install.packages('formatR').
Reading code is pain, but the well-formatted code might alleviate the pain a little bit. The function tidy.source() in the animation package can help us format our R code automatically. By default it will read your code in the clipboard, parse it and return the well-formatted code. You have options to keep or remove the comments/blank lines and set the width of the code, etc. Spaces and indent will be added automatically. This can save us time typing spaces and paying attention to indent.
## install.packages('animation') if it is not installed yet
library(animation)
## copy some R code somewhere and type:
tidy.source()
## or specify the path of your code file
tidy.source(file.path(system.file(package = "graphics"), "demo", "image.R"))
## can also use a URL
tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R')
## remove blank lines
tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R',
keep.blank.line = FALSE)
## remove comments
tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R',
keep.comment = FALSE)
Stephanie asked in 511 today if we were able to get the random seed which was set by set.seed() but we were only given the random numbers (without knowing the seed). This kind of “hacker” questions sound interesting. One dirty solution should be the brute-force method, e.g:
# x: the random vector;
# FUN: the function that generates random numbers with the first argument
# being the length of random numbers
# seed: candidate seeds to be tried one by one
# ...: other arguments to be passed to FUN
find.seed = function(x, FUN = rnorm, seed = 0:10000, ...) {
res = NULL
for (i in seed) {
set.seed(i)
rx = FUN(length(x), ...)
# all() can be changed to all.equal() to obtain a rough solution
# allowing a little bit numeric errors
if (all(x == rx)) {
res = i
break
}
}
res
}
HTH and HTT respectively? (For example, for the sequence HHTH, the number for HTH to appear is 4, and in THTHTT, the number for HTT is 6.)
It seems that the two results are equivalent, as H and T occurs with equal probability 0.5, so we naturally believe the average numbers of steps to HTH and HTT are the same, but the fact is not as we imagined.
## smart guys use math formulae to solve the problem,
## but *lazy* guys like me use simulations with R
coin.seq = function(v) {
x = NULL
n = 0
while (!identical(x, v)) {
x = append(x[length(x) - 1:0], rbinom(1, 1, 0.5))
n = n + 1
}
n
}
set.seed(919)
mean(htt <- replicate(1e+05, coin.seq(c(1, 0, 0))))
# [1] 8.00304
mean(hth <- replicate(1e+05, coin.seq(c(1, 0, 1))))
# [1] 10.0062
png("coin-htt-hth.png", height = 150, width = 500)
par(mar = c(3, 2.5, 0.1, 0), mgp = c(2, 0.8, 0))
boxplot(list(HTT = htt, HTH = hth), horizontal = T,
xlab = "n", ylim = range(boxplot(list(HTT = htt, HTH = hth),
plot = FALSE)$stats))
points(c(mean(htt), mean(hth)), 1:2, pch = 19)
dev.off()
The answer is counterintuitive, isn’t it?
Well, mathematicians certainly do not like my solution (I guess they even hate such an imprecise approach). I hope some smart guys can give me some hints on working out the probability distribution and hence the expectation.
I was unfortunately caught in the rain this evening and my glasses were blurred by the rain drops. When I came back the office, took them off and was about to clear the beads, I just found a good RNG (Random Number Generator) from the nature:
People who are familiar with statistical computation must have learned how to generate random numbers following a U(0, 1) distribution. One of the most common generators is the linear congruential generator. This is just what I was reminded of by the rain drops on my glasses. The plot below is made in R based the linear congruential generator:
Yesterday Fan J asked me for a cover design of the statistical journal of the postgraduates of our school, and I just remembered the gradient descent algorithm, as this design should have something to do with statistics.
I always insist that students (as well as many researchers in the field of statistics) have ignored the great importance of statistical computation. For example, every now and then I can see presenters prefer to stop going further when they come to an issue related to computations; logistic regression is just a very typical topic. Rarely have I seen any authors mentioned the process of estimating parameters in logistic regression, what’s more, some authors just tell us to find the roots of a series of partial derivative equations (as if they were quite easy to solve by hand), which is a conventional method in optimization — I just cannot see where “IWLS” or “Fisher Scoring” are mentioned. Thus I designed this picture to emphasize the importance of statistical computation.



Recent Comments