Yihui Xie

Why Are Some Good Old Ideas Buried in the History of Statistical Graphics?

Yihui Xie / 2017-12-07

The grand master of base R, Karl Broman, recently came up with an ingenious idea of showing missing values in scatterplots. Someone replied on Twitter that he had never seen it before. My PhD adviser Di replied never say never, and pointed out this idea had existed and been implemented for decades (e.g., in GGobi and MANET). It is a pity that these great software packages didn’t get the attention that they deserve, and I started to wonder why. Here are my thoughts after 10-min’s thinking:

  1. Although statistical graphics and data visualization are getting more statisticians’ attention (perhaps due to the waves of the so-called “data science”), they are still far from being able to beat other traditional course topics and research directions of statistics, such as the measure theory and probability theory. Perhaps “beat” is the wrong word. I mean they should be at least treated equally. Without a proper status in statistics, the research on statistical graphics will be limited due to lack of (human and financial) resources, and the number of people willing to dig into the history of statistical graphics will be small. Then we have to wait for people like Karl to reinvent the history from time to time.

  2. The bus factor. I haven’t seen too many graphics software packages that really blow my mind, but I have to say that GGobi is very unique, creative, and impressive (of course, I say this not because the authors include my PhD advisers — I have already graduated). It has done things decades ago that other software packages are still not able to catch up with, e.g., the various types of tours, and linked brushing. It is a lot of fun to play with.

    When I went to Iowa State, I was expected to continue the work under a new framework (Qt), which was extremely promising. The new R package is called cranvas. I was very excited about it. Initially I spent months after months on interactive parallel coordinates plots, and later moved on to other types of plots, such as histograms, scatterplots, and maps.

    Then in late 2011, unfortunately, I finally ran out of patience with Sweave, and pretty much stopped the work on interactive graphics since then. My focus was accidentally shifted to knitr and the reproducible research world. While I’m still grateful that my advisors didn’t pull me back to work on statistical graphics instead, I think the cranvas package has lost its momentum since then. This is the bus factor problem: when a core developer quits, a software package dies. I think it was partly my fault. I didn’t have much experience with designing a large package like cranvas, and the source code I wrote was a total mess, which made it difficult for other students to catch up the work. If I were to go back and do a PhD with Di and Heike again with all the lessons learned these years, cranvas would probably attract more contributors, survive longer, and become more popular. I still think being able to draw and brush a million graphical elements on the fly is amazing (which is what you can do with cranvas).

    GGobi and MANET have the same problem. As the original authors get busy and/or retire, there is not enough fresh blood carrying on the development: software is closely tied to individual developers.

  3. Too late to join the modern ways of “marketing”. If you are going to promote a software package today, the first thing you do is probably not writing a journal paper, and waiting for a year or two for it to be accepted and distributed world-wide. You may first put it on Github (instead of a personal SVN repo that nobody knows), build a website for it (with demos!), and advertise on the social media network.

    I guess MANET might only exist on 3.5-inch floppy disks now (if it still exists at all).1 What about GGobi? It was eventually put on Github, but… only 12 stars, and 3 forks at this moment. That is not a good sign. The core developers are either busy or low-key (e.g., who is Michael Lawrence?). While I generally like low-key people, I believe marketing is necessary (just not to abuse it).

I don’t have more time on this topic, but I think an interactive statistical graphics system that uses R as the back-end and web browser (JavaScript-based) as the front-end is a promising way to move forward. It could be based on Shiny, or use a similar infrastructure as Shiny. Someday I’ll come back to this area again, to put some old wine in shiny (or Shiny) new bottles.

  1. This is not true, but the point is that it is difficult to find.