This blog post is mainly for Stat 579 students on the homework for week 7, since I received too many “gory” loops in the homework submissions and I think it would help a bit to write my thoughts on R loops for beginners. The immortal motto for newbies in programming is:
If you want to make an apple pie from scratch, you must first create the universe.
There have been endless wars on which programming language is better than others, but my view point is, that is nothing but the balance between the code performance and the amount of work for programmers. In an extreme sense, almost all languages give you the ability to create the universe, but you do not really have to if you just want to make an apple pie.
R was born after S, a language which was invented to turn ideas into software, quickly and faithfully and received the ACM Software System Award in 1998. Before the S language, statisticians often had to write “gory” low-level computing routines to do data analysis and statistical computation, including those “gory” loops, of course. For example, imagine what you have to do to compute the correlation coefficients in C.
R has wrapped a lot of common tasks in lower-level programming languages (mainly C and Fortran) to make it easier to call and faster to compute (R’s (explicit) loops are generally slower than low-level languages), which frees statisticians from paying too much attention to the gory details in computation. However, the consequence is we have got too many tools in our hands, of which we are often unaware. I have no quick solution on this problem – we have to learn more about the capability of R through many ways, e.g. reading the R-help mailing list, asking experts, doing daily work with R, reading the source code of R functions and playing with the examples in help pages, etc.
Being specific on this homework, I saw most submissions were using long loops, which is absolutely OK, since that was what we learned in class, and it is important to know how to write loops. Some loops are inevitable, but some are not. The rule of thumb is, functions do exist in R if you natural reaction to a problem is “why does not R have this common functionality?”. For example, several students used this function to concatenate all elements of a vector into a single string:
But in fact you will get a neat solution if you take a closer look at the help page of
This is one of the thousands of stories in which we created the universe to make an apple pie without knowing there was a perfect apple pie machine. Sometimes the feeling that we have to power to create the universe is so strong that we do so even we know the existence of the apple pie machine, e.g. here is a function to count the number of 0’s and 1’s in a vector:
The loop is pretty much like low-level languages like C/Fortran: we assign initial values to a recording variable, do the loop and collect the result. But frequency tables are so common in statistics that it is hard to exclude such a functionality in R,
table(), as we see in the last but one line of the code above.
Now I give my solutions as promised:
There are no explicit loops above. Instead, all loops are implicit, i.e. let the more efficient low-level languages do the loops for R. This is called vectorization. We can benefit a lot from vectorization – it is not only a matter of less heavier coding jobs, but also a huge improvement in terms of efficiency (speed) in general. If we write the above functions with loops, it will look like this (for Q2 only):
A simple timing test shows that it is much slower than my first version:
A few output examples:
So remember the pain of struggling with this homework – the same pain of the statisticians before the S language was invented. And begin to breathe the fresh air in the R empire with vectorization!