Yihui Xie http://yihui.name 2017-01-03T22:15:20+00:00 xie@yihui.name A Letter of Recommendation for Nan Xiao /en/2014/11/lor-nan-xiao/ 2014-11-18T00:00:00+00:00 Yihui Xie /en/2014/11/lor-nan-xiao

I hope my letter could boost this guy up like:

I’m not sure if I’m a good observer, but time and time again I feel some people are undervalued, or they were not given better opportunities to show their value. Not surprisingly, I know quite a few such people in the Chinese R/stats community, mainly because of the website Capital of Statistics (COS) that I founded a number of years ago.

I believe Nan Xiao is among these undervalued people, which is why I’m writing a public letter of recommendation for him to apply to a stats/biostats/bioinformatics program in the US. You can go to his website http://r2s.name to know more about him, and I’m not going to repeat his information here.

As someone who went through the same application process six years ago, I know it is difficult to get an offer unless you are from a top university in China in the eyes of the admission committee. By “top” (in terms of the major of statistics), it basically means Peking U, Tsinghua, USTC, Fudan, Beijing Normal U, and perhaps one or two other universities. My alma mater was Renmin U (in Beijing), which nobody knows, unfortunately. The statistics program at Renmin actually has got the highest ranking in China this year, and I’m not surprised at all. Perhaps Renmin does not offer the best math training to students in statistics, but I think its program is well balanced between application and theory. In recent years, they have been putting more emphasis on the math training to catch up with the “top” universities. Personally I do not think this is a good idea, but it seems to make the admission committees in the US more comfortable. Anyway, the first driving force of my admission to Iowa State U was probably my work on the animation package, which was also why I was acquainted with my PhD advisors Di and Heike before I applied to Iowa State.

To some degree, I was very fortunate since my research interest, statistical graphics, at that time was not the “mainstream” in statistics (it still is not), and it happened that there were two professors with the same research interest, so it was fairly easy to make the deal. Nan’s interests (machine learning/bioinformatics) are broader than mine, and I think he will face more competition consequently. Given his education background from a university that is not widely known, I’m trying to make him more visible, although my influence might be very limited. I believe he will make better contribution during his PhD training than me, if his potential can be well utilized.

I have known Nan for quite a few years. We have physically met only once during the 6th Chinese R conference last year, but I have been reading his forum posts in COS and blog posts since circa 2008. He is one of the best hackers that I know, with a very good sense of beauty. Apparently, hacking skills are becoming more and more important in this age of data (excuse me, but I do hate saying “big data” when “big” is meaningless). Let me enumerate some of my observations about him:

  • He knows well about the web (scraping data, security issues, and so on). To his future advisor/department, this means he could be very helpful if you need to obtain data from the web, and he may be able to improve the department IT support, which often sucks from my experience.
  • He is a superb presenter. He has an outstanding presenting style, which you can see from his past talks (it does not matter if you do not understand Chinese). You may underestimate the importance of this, but please recall how much you (or is it just me?) wanted to fall asleep during the Joint Statistical Meetings, when everybody was using the same blue Beamer style, with pages after pages of equations.
  • My favorite illustration among his blog posts is this one: http://r2s.name/cn/r/ria.html
  • He has deep interests in data visualization, in particular, network visualization. Look at his list of papers on his website! Aren’t those graphs beautiful?
  • He has worked on the translation of three books into Chinese with other people. To translate a book, you certainly have to understand it. You probably should not have any doubts on how well he knows R, graphics, and data mining methods.
  • I do not formally collaborate with him very often, but you may want to look at the SVD example in his projects. He did it after I said “How about a Shiny app?”. I believe there have been many other SVD examples with the similar idea, but I was still impressed how quickly he made it. If you are familiar with Shiny, you may also be impressed by his taste on design (I love the “Crouching Tiger Hidden Dragon” picture. Looks so cool!).
  • I know little about bioinformatics, chemoinformatics, or pharmacology, so I’m not going to comment on these specifics. There is one thing that I’m sure about, though, which is his eagerness for making substantial contributions to science that can eventually benefit some people. I trust his sincereness.

If you are looking for a PhD student in a stats-related program in 2015, please consider this guy. A job in the industry is also a possibility for him, so please also consider him if your company has a position for an awesome hacker. Email me if you have more questions, or forward my blog post to your colleagues/friends who might be interested.

library() vs require() in R /en/2014/07/library-vs-require/ 2014-07-26T00:00:00+00:00 Yihui Xie /en/2014/07/library-vs-require While I was sitting in a conference room at UseR! 2014, I started counting the number of times that require() was used in the presentations, and would rant about it after I counted to ten. With drums rolling, David won this little award (sorry, I did not really mean this to you).

After I tweeted about it, some useRs seemed to be unhappy and asked me why. Both require() and library() can load (strictly speaking, attach) an R package. Why should not one use require()? The answer is pretty simple. If you take a look at the source code of require (use the source, Luke, as Martin Mächler mentioned in his invited talk), you will see that require() basically means “try to load the package using library() and return a logical value indicating the success or failure”. In other words, library() loads a package, and require() tries to load a package. So when you want to load a package, do you load a package or try to load a package? It should be crystal clear.

One bad consequence of require() is that if you require('foo') in the beginning of an R script, and use a function bar() in the foo package on line 175, R will throw an error object “bar” not found if foo was not installed. That is too late and sometimes difficult for other people to understand if they use your script but are not familiar with the foo package – they may ask, what is the bar object, and where is it from? When your code is going to fail, fail loudly, early, and with a relevant error message. require() does not signal an error, and library() does.

Sometimes you do need require() to use a package conditionally (e.g. the sun is not going to explode without this package), in which case you may use an if statement, e.g.

if (require('foo')) {
} else {
  warning('You missed an awesome function')

That should be what require() was designed for, but it is common to see R code like this as well:

if (!require('foo')) {
  stop('The package foo was not installed')


  • library('foo') stops when foo was not installed
  • require() is basically try(library())

Then if (!require('foo')) stop() is basically “if you failed to try to load this package, please fail”. I do not quite understand why it is worth the circle, except when one wants a different error message with the one from library(), otherwise one can simply load and fail.

There is one legitimate reason to use require(), though, and that is, “require is a verb and library is a noun!” I completely agree. require should have been a very nice name to choose for the purpose of loading a package, but unfortunately… you know.

If you take a look at the StackOverflow question on this, you will see a comment on “package vs library” was up-voted a lot of times. It used to make a lot of sense to me, but now I do not care as much as I did. There have been useRs (including me up to a certain point) desperately explaining the difference between the two terms package and library, but somehow I think R’s definition of a library is indeed unusual, and the function library() makes the situation worse. Now I’m totally fine if anyone calls my packages “libraries”, because I know what you mean.

Karthik Ram suggested this GIF to express “Ah a new library, but require? Noooooo”:

Since you have read the source code, Luke, you may have found that you can abuse require() a bit, for example:

> (require(c('MASS', 'nnet')))
c("Loading required package: c", "Loading required package: MASS",
  "Loading required package: nnet")
Failed with error:  'package' must be of length 1
In addition: Warning message:
In if (!loaded) { :
  the condition has length > 1 and only the first element will be used

> (require(c('MASS', 'nnet'), character.only = TRUE))
c("Loading required package: MASS", "Loading required package: nnet")
Failed with error:  'package' must be of length 1
In addition: Warning message:
In if (!loaded) { :
  the condition has length > 1 and only the first element will be used

> library(c('MASS', 'nnet'), character.only = TRUE)
Error in library(c("MASS", "nnet"), character.only = TRUE) : 
  'package' must be of length 1

So require() failed not because MASS and nnet did not exist, but because of a different error. As long as there is an error (no matter what it is), require() returns FALSE.

One thing off-topic while I’m talking about these two functions: the argument character.only = FALSE for library() and require() is a design mistake in my eyes. It seems the original author(s) wanted to be lazy to avoid typing the quotes around the package name, so library(foo) works like library("foo"). Once you show people they can be lazy, you can never pull them back. Apparently, the editors of JSS (Journal of Statistical Software) have been trying to promote the form library("foo") and discourage library(foo), but I do not think it makes much sense now or it will change anything. If it were in the 90’s, I’d wholeheartedly support it. It is simply way too late now. Yes, two extra quotation marks will kill many kittens on this planet. If you are familiar with *nix commands, this idea is not new – just think about tar -z -x -f, tar -zxf, and tar zxf.

One last mildly annoying issue with require() is that it is noisy by default, because of the default quietly = FALSE, e.g.

> require('nnet')
Loading required package: nnet
> require('MASS', quietly = TRUE)

So when I tell you to load a package, you tell me you are loading a package, as if you had heard me. Oh thank you!

A Few Notes on UseR! 2014 /en/2014/07/a-few-notes-on-user2014/ 2014-07-26T00:00:00+00:00 Yihui Xie /en/2014/07/a-few-notes-on-user2014 It has been a month since the UseR! 2014 conference, and I’m probably the last one who writes about it. UseR! is my favorite conference because it is technical and not too big. I have completely lost interest in big and broad conferences like JSM (to me, it has become Joint Sightseeing Meetings). Karl has written two blog posts about UseR! (1-2, 3-4), and I’m going to add a few more observations here. An important disclaimer before I move on: Karl Broman is not responsible for videotaping at UseR! 2014 (neither am I), and he is not even on the conference committee. I accidentally mentioned the videos when replying his tweet, which seemed to have caused confusion unfortunately. Any questions about the conference, including the time to publish videos, should be directed to the official organizing committee.

The conference website is hosted on Github. Awesome. Speakers can add links to their slides through pull requests. Genius. It is a little sad, though, that each UseR conference has its own website and twitter handle. There should have been a single website, a single domain name, and a single twitter account managing all the R conferences each year. Fragmentation is just such a natural thing in programmers’ world.

The on-campus dorms were fantastic (oh when I wrote “dorm” I almost typed “dnorm”), which saved us time on transportation. Dining halls were on campus as well. Breakfast was perfect, although I could not stand eating sandwiches, hamburgers, or pizza for four days. Okay, let’s talk about the talks. You can find most of the slides on the conference website.

  • John Chambers: the three promising projects Rcpp/Rcpp11, RLLVM, and h2o. I do not know anything about h2o. I think Rcpp has been a great success, and I have my blind faith in Romain for Rcpp11. For RLLVM, I’m a little concerned: 15 stars, and 4 forks on Github. That is not a good sign. The bus factor is too low. Well, I have been admiring Duncan for a huge number of his amazing packages. Perhaps he can handle this one on his own as well.
  • Ramnath Vaidyanathan: all JavaScript libraries are belong to Ramnath! If you want to get a certain JS library integrated with R, tell him the night before, go to sleep, and you will see it in the R world the next morning. I’m only partially kidding :)
  • Gordon Woodhull: it was great to see the substantial progress in RCloud.
  • Jeroen Ooms: I came from the RWeb age, so you know how excited I was when I saw OpenCPU a couple of years ago. There were things I wished I could do for years but were too complicated before OpenCPU was launched. For example, the knitr issue #51 was probably my first experiment with OpenCPU, and I had a lot of fun with it.
  • Jonathan Godfrey: I did not attend his talk, but he attended my tutorial and talked to me a couple of times during the conference. That was the first time I had talked to a blind R user, and was surprised to know a few facts:
    • PDF is bad for blind people, and HTML is much better (think R package vignettes);
    • it is nearly impossible for them to read raster images, and SVG graphics can be better;
    • if there is an image in an HTML document, its alt attribute is very important;
    • LaTeX math expressions created by MathJax look excellent in the eyes of sighted people, but they seem to be hard to read by the blind (Jonathan mentioned to me afterwards that it may be possible to configure MathJax to make it readable but he is not sure how to do it; the math expressions on Wikipedia are displayed as images with the alt attribute, and he can read those math expressions);
  • Matt Dowle: his data.table story with Patrick Burns was pretty interesting. You can read his slides.
  • Aran Lunzer: LivelyR. It was not a talk. It was simply magic. I had absolutely no clue how they made it, even though I’m one of the (co-)authors of the packages that they used.
  • Dirk Eddelbuettel: Rcpp and Docker. I was really glad that he mentioned Docker. Travis CI seems to have attracted a lot of attention of R package authors after I was inspired by a reader of my blog and experimented with it last year. The major missing piece on Travis CI is R for R users. apt-get install r-base every time is a waste of time and resources. It will be nice if one can build one’s own virtual machine with all the necessary packages. This is pretty simple with Docker. However, I have not found a free service like Travis CI that allows the users to build/test software with custom Docker containers.
  • Martin Mächler: good practices in R programming. You can find slides on his homepage. This was an excellent talk. Precise and clear. I recommend everyone to read his slides. One minor and subjective issue is = vs <-. Not many people are with me, and once Alyssa Frazee’s post made me cried a little in the restroom (so excited to find another person using = in R).
  • Andy Chen: RLint. Programming styles? I guess a single programming style will never happen. I insist on using = instead of <- for assignment in R, except when I collaborate with the left arrow party. Roger Peng insists 8 spaces for indentation. Excuse me? What is a programming style?
  • John Nash: it was a great pleasure to meet John in person for the first time. I love hearing old stories from senior people, such as Jeff Laake telling me early stories about ADMB and CRAN, and John Kimmel mentioning John Tukey and the development of interactive graphics at Bell Labs. John (Nash) showed me some Fortran, Pascal, and BASIC programs to me that were even older than me. Personally I have no interest in these languages, but it was interesting to know what were done before you were born, and some of the programs are still “alive” in R. Since he was so eager for running Fortran code in knitr documents, we sat together for a few minutes, he wrote a Fortran example, and I just added a quick and dirty Fortran engine in knitr.
  • I gave a talk titled “Knitr Ninja”, and a few people remembered sword(2) after that. I was extremely bored by myself after I had given so many talks on knitr, so I thought I should do a completely different talk that nobody had heard of, including my bosses at RStudio. It turned out that the RStudio viewer was pretty handy for presentations, and I could show Kakashi in it:

    Kakashi Lightning Blade

  • Katharine Mullen: JSS. Honestly I’m a little concerned about JSS, although I strongly believe it is an outstanding journal. I have only published one paper on it (the animation package), and the review process was too slow. Four months goes by and you hear nothing back so you ask what’s up. Then four months goes by, you get the first round of review. Sometimes I do not quite understand how free journals work, or what motivates the anonymous reviewers. I think this is a pretty hard problem, and I would propose to open up the journal to the wild just like open source software. Even 58 editors on board is still too few compared to the number of authors and submissions. I was extremely excited that Jan de Leeuw’s very first proposal of establishing such a journal was that the journal should be done in HTML!! And interactive, where possible!! That was the year 1995 (I was still in the fifth grade in elementary school learning fractions in a village). Twenty years later, I think the infrastructure is good enough (e.g. R Markdown, Shiny, Shiny Server) to go back to his original proposal. I would love to see papers in HTML instead of PDF. Typesetting with HTML is a whole lot easier and attractive than LaTeX/PDF in my opinion, and there is a whole lot more interesting stuff to play with in HTML. With R Markdown v2, HTML and PDF are not mutually exclusive, although we will have to give up certain markups in LaTeX, but man, do you really need \proglang{} and \pkg{}?

Finally it is rumor time:

  • iris has been officially declared (? by whom? perhaps by the many sleepy faces in the audience) as the dataset porn in R, and the next candidate will be ggplot2::diamonds!
Markdown or LaTeX? /en/2013/10/markdown-or-latex/ 2013-10-19T00:00:00+00:00 Yihui Xie /en/2013/10/markdown-or-latex

What happens if you ask for too much power from Markdown?

R Markdown is one of the document formats that knitr supports, and it is probably the most popular one. I have been asked many times about the choice between Markdown and LaTeX, so I think I’d better wrap up my opinions in a blog post. These two languages (do you really call Markdown a language?) are kind of at the two extremes: Markdown is super easy to learn and type, but it is primarily targeted at HTML pages, and you do not have fine control over typesetting ( really? really?), because you only have a very limited number of HTML tags in the output; LaTeX is relatively difficult to learn and type, but it allows you to do precise typesetting (you have control over anything, and that is probably why a lot of time can be wasted).

What is the problem?

What is the root problem? I think one word answers everything: page! Why do we need pages? Printing is the answer.

In my eyes, the biggest challenge for typesetting is to arrange elements properly with the restriction of pages. This restriction seems trivial, but it is really the root of all “evil”. Without having to put things on pages, life can be much easier in writing.

What is the root of this root problem in LaTeX? One concept: floating environments. If everything comes in a strictly linear fashion, writing will be just writing; typesetting should be no big deal. Because a graph cannot be broken over two pages, it is hard to find a place to put it. By default, it can float to unexpected places. The same problem can happen to tables (see the end of a previous post). You may have to add or delete some words to make sure they float to proper places. That is endless trouble in LaTeX.

There is no such a problem in HTML/Markdown, because there is no page. You just keep writing, and everything appears linearly.

Can I have both HTML and PDF output?

There is no fault being greedy, and it is natural to ask the question whether one can have both HTML and PDF output from a single source document. The answer is maybe yes: you can go from LaTeX to HTML, or from Markdown to LaTeX/PDF.

  • pandoc can convert Markdown to almost anything
  • many tools to convert LaTeX to HTML

But remember, Markdown was designed for HTML, and LaTeX was for PDF and related output formats. If you ask for more power from either language, the result is not likely be ideal, otherwise one of them must die.

How to make the decision?

If your writing does not involve complicated typesetting and primarily consists of text (especially no floating environments), go with Markdown. I cannot think of a reason why you must use LaTeX to write a novel. See Hadley’s new book Advanced R programming for an excellent example of Markdown + knitr + other tools: the typesetting elements in this book are very simple – section headers, paragraphs, and code/output. That is pretty much it. Eventually it should be relatively easy to convert those Markdown files to LaTeX via Pandoc, and publish a PDF using the LaTeX class from Chapman & Hall.

For the rest of you, what I’d recommend is to think early and make a decision in the beginning; avoid having both HTML and PDF in mind. Ask yourself only one question: must I print the results nicely on paper? If the answer is yes, go with LaTeX; otherwise just choose whatever makes you comfortable. The book Text Analysis with R authored by Matthew Jockers is an example of LaTeX + knitr. Matt also asked me this question about Markdown vs LaTeX last week while he was here at Iowa State. For this particular book, I think Markdown is probably OK, although I’m not quite sure about a few environments in the book, such as the chapter abstracts.

It is not obvious whether we must print certain things. I think we are just too used to printing. For example, dear professors, must we print our homework? (apparently Jenny does not think so; I saw her grade homework on RPubs.com!) Or dear customers, must we submit reports in PDF? … In this era, you have laptops, iPad, Kindle, tablets and all kinds of electronic devices that can show rich media, why must you print everything (in black and white)?

For those who are still reading this post, let me finish with a side story: Matt, a LaTeX novice, taught himself LaTeX a few months ago, and he has finished the draft of a book with LaTeX! Why are you still hesitating about the choice of tools? Shouldn’t you just go ahead and get the * done? Although all roads lead to Rome, some people die at the starting line instead of on the roads.

Testing R Packages /en/2013/09/testing-r-packages/ 2013-09-30T00:00:00+00:00 Yihui Xie /en/2013/09/testing-r-packages This guy th3james claimed Testing Code Is Simple, and I agree. In the R world, this is not anything new. As far as I can see, there are three schools of R users with different testing techniques:

  1. tests are put under package/tests/, and a foo-test.Rout.save from R CMD BATCH foo-test.R; testing is done by comparing foo-test.Rout from R CMD check with your foo-test.Rout.save; R notifies you when it sees text differences; this is typically used by R core and followers
  2. RUnit and its followers: formal ideas were borrowed from other languages and frameworks and it looks there is a lot to learn before you can get started
  3. the testthat family: tests are expressed as expect_something() like a natural human language

At its core, testing is nothing but “tell me if something unexpected happened”. The usual way to tell you is to signal an error. In R, that means stop(). A very simple way to write a test for the function FUN() is:

if (!identical(FUN(arg1 = val1, arg2 = val2, ...), expected_value)) {
  stop('FUN() did not return the expected value!')

That is, when we pass the values val1 and val2 to the arguments arg1 and arg2, respectively, the function FUN() should return a value identical to our expected value, otherwise we signal an error. If R CMD check sees an error, it will stop and fail.

For me, I only want one thing for unit testing: I want the non-exported functions to be visible to me during testing; unit testing should have all “units” available, but R’s namespace has intentionally restricted the objects that are visible to the end users of a package, which is a Very Good Thing to end users. It is less convenient to the package author, since he/she will have to use the triple colon syntax such as foo:::hidden_fun() when testing the function hidden_fun().

I wrote a tiny package called testit after John Ramey dropped by my office one afternoon while I was doing intern at Fred Hutchinson Cancer Research Center last year. I thought a while about the three testing approaches, and decided to write my own package because I did not like the first approach (text comparison), and I did not want to learn or remember the new vocabulary of RUnit or testthat. There is only one function for the testing purpose in this package: assert().

  "1 plus 1 is equal to 2",
  1 + 1 == 2

You can write multiple testing conditions, e.g.

  "1 plus 1 is equal to 2",
  1 + 1 == 2,
  identical(1 + 1, 2),
  (1 + 1 >= 2) && (1 + 1 <= 2), # mathematician's proof
  c(is.numeric(1 + 1), is.numeric(2))

There is another function test_pkg() to run all tests of a package using an empty environment with the package namespace as its parent environment, which means all objects in the package, exported or not, are directly available without ::: in the test scripts. See the CRAN page for a list of packages that use testit, for example, my highr package, where you can find some examples of tests.

While I do not like the text comparison approach, it does not mean it is not useful. Actually it is extremely useful when testing text document output. It is just a little awkward when testing function output. The text comparison approach plays an important role in the development of knitr: I have a Github repository knitr-examples, which serves as both an example repo and a testing repo. When I push new commits to Github, I use Travis CI to test the package, and there are two parts of the tests: one is to run R CMD check on the package, which uses testit to run the test R scripts, and the other is to re-compile all the examples, and do git diff to see if there are changes. I have more than 100 examples, which should have reasonable coverage of possible problems in the new changes in knitr. This way, I feel comfortable when I bring new features or make changes in knitr because I know they are unlikely to break old documents.

If you are new to testing and only have 3 minutes, I’d strongly recommend you to read at least the first two sections of Hadley’s testthat article.

After Three Months I Cannot Reproduce My Own Book /en/2013/09/cannot-reproduce-my-own-book/ 2013-09-05T00:00:00+00:00 Yihui Xie /en/2013/09/cannot-reproduce-my-own-book

I thought I could easily jump to a high standard (reproducibility), but I failed.

Some of you may have noticed that the knitr book is finally out. Amazon is offering a good price at the moment, so if you are interested, you’d better hurry up.

I avoided the phrase “Reproducible Research” in the book title, because I did not want to take that responsibility, although it is related to reproducible research in some sense. The book was written with knitr v1.3 and R 3.0.1, as you can see from my sessionInfo() in the preface.

Three months later, several things have changed, and I could not reproduce the book, but that did not surprise me. I’ll explain the details later. Here I have extracted the first three chapters, and released the corresponding source files in the knitr-book repository on Github. You can also find the link to download the PDF there. This repository may be useful to those who plan to write a book using R.

What I could not reproduce were not really important. The major change in the recent knitr versions was the syntax highlighting commands, e.g. \hlcomment{} is \hlcom{} now, and the syntax highlighting has been improved by the highr package (sorry, Romain). This change brought a fair amount of changes when I look at git diff, but these are only cosmetic changes.

I tried my best to avoid writing anything that is likely to change in the future into the book, but as a naive programmer, I have to say sorry that I have broken two little features, although they may not really affect you:

  • the preferred way to stop knitr in case of errors is to set the chunk option error = FALSE instead of the package option stop_on_error, which has been deprecated (Section 6.2.4);
  • for external code chunks (Section 9.2), the preferred chunk delimiter is ## ---- instead of ## @knitr now;

Actually the backward-compatibility is still there, so they will not really break until a long time later.

With exactly the same software environment, I think I can reproduce the book, but that does not make much sense. Things are always evolving. Then there are two types of reproducible research:

  1. the “dead” reproducible research (reproduce in a very specific environment);
  2. the reproducible research that evolves and generalizes;

I think the latter is more valuable. Being reproducible alone is not the goal, because you may be reproducing either real findings or simply old mistakes. As Roger Peng wrote,

[…] reproducibility cannot really address the validity of a scientific claim as well as replication

Roger’s recent three blog posts on reproducible research are very worth reading. This blog post of mine is actually not quite relevant (no data analysis here), so I recommend my readers to move over there after you haved checked out the knitr-book repository.

My first Bioconductor conference (2013) /en/2013/07/bioconductor-2013/ 2013-07-21T00:00:00+00:00 Yihui Xie /en/2013/07/bioconductor-2013 The BioC 2013 conference was held from July 17 to 19. I attended this conference for my first time, mainly because I’m working at the Fred Hutchinson Cancer Research Center this summer, and the conference venue was just downstairs! No flights, no hotels, no transportation, yeah.

Last time I wrote about my first ENAR experience, and let me tell you why the BioC conference organizers are smart in my eyes.

A badge that never flips

I do not need to explain this simple design – it just will not flip to the damn blank side:

The conference program book

The program book was only four pages of the schedule (titles and speakers). The abstracts are online. Trees saved.

Lightning talks

There were plenty of lightning talks. You can talk whatever you want.

Live coding

On the developer’s day, Martin Morgan presented some buggy R code to the audience (provided by Laurent Gatto), and asked us to debug it right there. Wow!

Everything is free after registration

The registration includes almost everything: lunch, beer, wine, coffee, fruits, snacks, and most importantly, Amazon Machine Instances (AMI)!


This is a really shiny point of BioC! If you have ever tried to do a software tutorial, you probably know the pain of setting up the environment for your audience, because they use different operating systems, different versions of packages, and who knows what is going to happen after you are on your third slide. At a workshop last year, I had the experience of spending five minutes figuring out why a keyboard shortcut did not work for one Canadian lady in the audience, and it turned out she was using the French keyboard layout.

The BioC organizers solved this problem beautifully by installing the RStudio server on AMI. Every participant was sent a link to the Amazon virtual machine, and all they need is a web browser and wireless connection in the room. All people run R in exactly the same environment.

Isn’t that smart?


I do not really know much about biology, although a few biological terms have been added to my volcabulary this summer. When a talk becomes biologically oriented, I will have to give up.

Simon Urbanek talked about big data in R this year, which is unusual, as mentioned by himself. Normally he shows fancy graphics (e.g. iplots). I did not realize the significance of this R 3.0.0 news item until his talk:

It is now possible to write custom connection implementations outside core R using R_ext/Connections.h. Please note that the implementation of connections is still considered internal and may change in the future (see the above file for details).

Given this new feature, he implemented the HDFS connections and 0MQ-based connections in R single-handedly (well, that is always his style).

You probably have noticed the previous links are Github repositories. Yes! Some R core members really appreciate the value of social coding now! I’m sure Simon does. I’m aware of other R core members using Github quietly (DB, SF, MM, PM, DS, DTL, DM), but I do not really know their attitude toward it.

Joe Cheng’s Shiny talk is shiny as usual. Each time I attend his talk, he will show a brand new amazing demo. Joe is the only R programmer that makes me feel “the sky is the limit (of R)”. The audience were shocked when they saw a heatmap that they were so familiar with suddently became interactive in a Shiny app! BTW, Joe has a special sense of humor when he talks about an area in which he is not an expert (statistics or biology).

RStudio 0.98 is going to be awesome. I’m not going to provide the links here, since it is not released yet. I’m sure you will find the preview version if you really want it.

Bragging rights

  • I met Robert Gentleman for the first time!
  • I dare fall asleep during Martin Morgan’s tutorial! (sorry, Martin)
  • some Bioconductor web pages were built with knitr/R Markdown!

Next steps

Given Biocondutor’s open-mindedness to new technologies (GIT, Github, AMI, Shiny, …), let’s see if it is going to take over the world. Just kidding. But not completely kidding. I will keep the conversation going before I leave Seattle around mid-August, and get something done hopefully.

If you have any feature requests or suggestions to Bioconductor, I will be happy to serve as the “conductor” temporarily. I guess they should set up a blog at some point.

R Package Versioning /en/2013/06/r-package-versioning/ 2013-06-27T00:00:00+00:00 Yihui Xie /en/2013/06/r-package-versioning This should be what it feels like to bump the major version of your software:

bump the major version

For me, the main reason for package versioning is to indicate the (slight or significant) differences among different versions of the same package, otherwise we can keep on releasing the version 1.0.

That seems to be a very obvious fact, so here are my own versioning rules, with some ideas borrowed from Semantic Versioning:

  1. a version number is of the form major.minor.patch (x.y.z), e.g., 0.1.7
  2. only the version x.y is released to CRAN
  3. x.y.z is always the development version, and each time a new feature or a bug fix or a change is introduced, bump the patch version, e.g., from 0.1.3 to 0.1.4
  4. when one feels it is time to release to CRAN, bump the minor version, e.g., from 0.1 to 0.2
  5. when a change is crazy enough that many users are presumably going to yell at you (see the illustration above), it is time to bump the major version, e.g., from 0.18 to 1.0
  6. the version 1.0 does not imply maturity; it is just because it is potentially very different from 0.x (such as API changes); same thing applies to 2.0 vs 1.0

I learned the rule #3 from Michael Lawrence (author of RGtk2) and I think it is a good idea. In particular, it is important for brave users who dare install the development versions. When you ask them for their sessionInfo(), you will be aware of which stage they are at.

Rule #2 saves us a little bit energy in the sense that we do not need to write or talk about the foo package 1.3.548, which is boring to type or speak. Normally we say foo 1.3. As a person whose first language is not English, speaking the patch version does consume my brain memory and slows down my thinking while I’m talking. When I say it in Chinese, I feel boring and unnecessarily geeky. Yes, I know I always have weird opinions.

You Do Not Need to Tell Me I Have A Typo in My Documentation /en/2013/06/fix-typo-in-documentation/ 2013-06-10T00:00:00+00:00 Yihui Xie /en/2013/06/fix-typo-in-documentation help me with Github pull requests

So I just got yet yet another comment saying “you have a typo in your documentation”. While I do appreciate these kind reminders, I think it might be a good exercise for those who want to try GIT and Github pull requests, which make it possible for you to contribute to open source and fix obvious problems with no questions being asked – just do it yourself, and send the changes to the original author(s) through Github.

The official documentation for Github pull requests is a little bit verbose for beginners. Basically what you need to do for simple tasks are:

  1. click the Fork button and clone the repository in your own account;
  2. make the changes in your cloned version;
  3. push to your repository;
  4. click the Pull Request button to send a request to the original author;

For trivial changes, sometimes I accept them on my cell phone while I’m still in bed. No extra communication is needed.

Occasionally I see reports of this kind of trivial documentation changes in the R-devel mailing list, and I believe that is just horribly inefficient. You could have done this quietly and quickly, and the developers could have merged the changes with a single mouse click. (Oh, okay, well, you know, SVN, mailing lists, …)

For the knitr repository, it has two branches: master and gh-pages. The R package lives in the master branch, and the knitr website lives in the gh-pages branch. If you want to fix any problems in the website, just check out the gh-pages:

git checkout gh-pages

All pages were written in Markdown, so edit them with your favorite text editor. For example, as the above comment pointed out, I omitted a right parenthesis ) in _posts/2012-02-24-sweave.md, and you just add it, save the file, write a GIT commit message, push to your repository and send the pull request.

I know I can do this by myself in five seconds, and it takes me way more time to write this blog post, but I just want everybody to know how people with different skill levels can play their roles in software development.

Let’s see how many minutes it takes for the pull request to come after I publish this blog post. Hurry!! :)

A Few Tips for Writing an R Book /en/2013/06/tips-for-writing-an-r-book/ 2013-06-03T00:00:00+00:00 Yihui Xie /en/2013/06/tips-for-writing-an-r-book I just finished fixing (hopefully all) the problems in the knitr book returned from the copy editor. David Smith has kindly announced this book before I do. I do not have much to say about this book: almost everything in the book can be found in the online documentation, questions & answers and the source code. The point of buying this book is perhaps you do not have time to read through all the two thousand questions and answers online, and I did that for you.

the knitr book

This is my first book, and obviously there have been a lot for me to learn about writing a book. In retrospect, I want to share a few tips that I found useful (in particular, for those who plan to write for Chapman & Hall):

  1. although it sounds like shameless self-promotion, using knitr made it a lot easier to manage R code and its output for the book; for example, I could quickly adapt to R 3.0.1 from 2.15.3 after I came back from a vacation; if I were to write a second edition, I do not think I will have big trouble with my R code in the book (it is easy to make sure the output is up-to-date);
  2. I put my source documents under version control, which helped me watch the changes in the output closely; for example, I noticed the source code of the function fivenum() in base R was changed from R 2.15.3 to 3.0.0 thanks to GIT (R core have been updating base R everywhere!);
  3. (opinionated) some people might be very bored to hear this: use LyX instead of plain LaTeX… because you are writing, not coding; LaTeX code is not fun to read…
  4. for the LaTeX document class krantz.cls (by Chapman & Hall):
    • to solve the only stupid problem in LaTeX (i.e., floating environments float to silly places by default), use something like this:

        \renewcommand{\floatpagefraction}{0.75}  I'm aware of the `float` package and the `H` option, and options like `!tbp`; I just do not want to _force_ LaTeX to do anything -- it may or may not be happy at some point.
    • put \usepackage{emptypage} in the preamble to make empty pages really empty, as required by the copy editor.
    • the document class krantz.cls does not work with the hyperref package, meaning that you cannot create bookmarks in the PDF; I have posted the solution here.
  5. for authors whose native language is not English like me, here is a summary of my problems in English:
    • when you want to use which, use that instead, unless there is a comma ahead, or you really want to emphasize a very specific object; e.g., > “here is a package that is helpful” (correct)

      “here is a package which is helpful” (wrong)

      “we will introduce an extremely important technology next, which has revolutionized the life of poor statisticians”

    • it is “A, B, and C” instead of “A, B and C”
    • do not forget the comma in other places, either: “e.g.,”, “i.e.,”, “foo and bar, respectively”; actually, try to use the comma whenever possible to break long sentences into shorter pieces
  6. for the plots, use the cairo_pdf() device when possible; in knitr, this means the chunk option dev = 'cairo_pdf'; the reason for the choice of cairo_pdf() over the normal pdf() device is that it can embed fonts in the PDF plot files, otherwise the copy editor will require you to embed all the fonts in the final PDF file of the book; normally pdflatex will embed fonts, and if there are fonts that are not embedded, it is very likely that they are from R graphics;
  7. include as many figures as possible (I have 51 figures in this 200-page book), because this will make the number of pages grow faster (I’m evil) so that you will not feel frustrated, and the readers will not fall into the hell of endless text, just pages after pages;
  8. prepare an extra monitor for copyediting;
  9. learn a little bit about pdftk, because you may want to use it finally, e.g., replace one page with a blank page in the frontmatter;
  10. learn these copy editing symbols (thanks, Matt Shotwell);

One thing I did not really understand was the punctuation marks like commas and periods should go inside quotation marks, e.g.,

I have “foo” and “bar.”

This makes me feel weird. I’m more comfortable with

I have “foo” and “bar”.

There was also one thing that I did not catch by version control – one figure file went wrong and I did not realize it, because normally I do not put binary files under version control. Fortunately, I caught it by my eyes. Karl Broman mentioned the same problem to me a while ago. I know there are tools for comparing images (ImageMagick, for example), and I was just too lazy to learn them.

I will be glad to know the experience of other authors, and will try to update this post according to the comments.