Statistics

Numbers that can either reveal truth or tell lies

Apr 032010

We know the real distribution of the F statistic in linear models — it is a non-central F distribution. Under H0, we have a central F distribution. Given 1 – α, we can compute the probability of (correctly) rejecting H0. I created a simple demo to illustrate how the power changes as other parameters vary, e.g. the degrees of freedoms, the non-central parameter and alpha. Here is the video:

The Power of F Test

And for those who might be interested, here is the code (you need to install the gWidgets package first and I recommend the RGtk2 interface). Have fun:

Mar 242010
Amber Watkins gave me a suggestion on the animation for the ratio estimation, and I think this is a good topic for my animation package. I’ve finished writing the initial version of the function sample.ratio() for this package, which will appear in the version 1.1-2 a couple of days later.

As we know, the benefit of ratio estimation is that sampling skewness may be adjusted for, because the estimation of \bar{Y} will make use of the information in the relationship of X and Y: \bar{X} \cdot (\bar{y}/\bar{x}). Here is a demo (we can see the ratio estimate, denoted by the red line, generally performs better than \bar{y}):

An animation demo for the ratio estimation

An animation demo for the ratio estimation

Mar 232010

Here are some (trivial) R tips in the course Stat 511. I’ll update this post till the semester is over.

  1. Formatting R Code

  2. I’ve submitted an R package named formatR to CRAN yesterday. This package should be easier than the code below, because there is a GUI to tidy your R code. Install with install.packages('formatR').

    Reading code is pain, but the well-formatted code might alleviate the pain a little bit. The function tidy.source() in the animation package can help us format our R code automatically. By default it will read your code in the clipboard, parse it and return the well-formatted code. You have options to keep or remove the comments/blank lines and set the width of the code, etc. Spaces and indent will be added automatically. This can save us time typing spaces and paying attention to indent.

    ## install.packages('animation') if it is not installed yet
    library(animation)
    ## copy some R code somewhere and type:
    tidy.source()
    ## or specify the path of your code file
    tidy.source(file.path(system.file(package = "graphics"), "demo", "image.R"))
    ## can also use a URL
    tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R')
    ## remove blank lines
    tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R',
               keep.blank.line = FALSE)
    ## remove comments
    tidy.source('http://www.public.iastate.edu/~dnett/S511/twofactor.R',
               keep.comment = FALSE)
    
Aug 292009
Coin-flipping is a rather old topic in probability theory, so most of us think we know very well about it, however, the other day I saw a question about this old topic (in David Smith’s REvolution?) which was beyond me expectation: how many times do we need to flip the coin until we get a sequence of HTH and HTT respectively? (For example, for the sequence HHTH, the number for HTH to appear is 4, and in THTHTT, the number for HTT is 6.)

It seems that the two results are equivalent, as H and T occurs with equal probability 0.5, so we naturally believe the average numbers of steps to HTH and HTT are the same, but the fact is not as we imagined.

## smart guys use math formulae to solve the problem,
## but *lazy* guys like me use simulations with R
coin.seq = function(v) {
    x = NULL
    n = 0
    while (!identical(x, v)) {
        x = append(x[length(x) - 1:0], rbinom(1, 1, 0.5))
        n = n + 1
    }
    n
}
set.seed(919)
mean(htt <- replicate(1e+05, coin.seq(c(1, 0, 0))))
# [1] 8.00304
mean(hth <- replicate(1e+05, coin.seq(c(1, 0, 1))))
# [1] 10.0062

png("coin-htt-hth.png", height = 150, width = 500)
par(mar = c(3, 2.5, 0.1, 0), mgp = c(2, 0.8, 0))
boxplot(list(HTT = htt, HTH = hth), horizontal = T,
    xlab = "n", ylim = range(boxplot(list(HTT = htt, HTH = hth),
        plot = FALSE)$stats))
points(c(mean(htt), mean(hth)), 1:2, pch = 19)
dev.off()

The answer is counterintuitive, isn’t it?

Number of times needed to get HTT and HTH

Number of times needed to get HTT and HTH (bold segments are for median; dots denote mean)

Well, mathematicians certainly do not like my solution (I guess they even hate such an imprecise approach). I hope some smart guys can give me some hints on working out the probability distribution and hence the expectation.

Sep 182008

Is it clear enough to observe the interaction between z and x from these bubble plots?

No interaction

If there is no interaction between x and z, the size of bubbles will increase at the same rate while either x or z is fixed.

Positive Interaction

But when interaction exists, the rate of increasing will be different.

Jun 152008

There was something wrong with the wheel of my mouse these days: it could not control the scrolling of a page very well. Today I found a screwdriver and opened the mouse. Then I knew the reason: there was a lot of dirt beneath the cover!

However, obviously the dirt would not follow a B(n, p = 0.5) distribution. The two proportions of the dirt in front and back of the wheel were not equal — there was much more dirt in the back. Why?

The Glivenko-Cantelli Theorem tells us that the empirical distribution converges to the true distribution almost surely.

So what? The proportions of the dirt has revealed my habit of using the mouse, as I have used it for too many times. What I usually do is to scroll down a page, and rarely will I scroll up — that is the real distribution, and the dirt in the mouse has recorded the empirical distribution day by day. If I am serious enough, I can get the numbers of the weight and estimate the parameter p in the Binomial distribution. (Perhaps Michal will again suggest me do this. :grin: )

May 132008

I just found this page for the book "Handbook of Computational Statistics" by chance:

http://math.u-bourgogne.fr/monge/bibliotheque/ebooks/csa/start.html

$300 is too expensive for me… 

Jan 112008

Personally I don’t believe structural equation models (SEM) at all. I have been fighting against this modeling technique for quite a long time, but it seems that SEM is becoming more and more popular outside the discipline of statistics. It’s indeed strange. Yesterday I received an email asking me about the SEM, and I could not help being angry at this model to some extent. As a result, I listed some reasons for my objection to such a kind of modeling in my reply as follows:

Hi,

The most obvious disadvantage, I believe, just lies in the critical modeling technique it adopted at the beginning: constructing models to fit the sample COVARIANCE instead of the sample values themselves. I insist that such an idea has almost destroyed our valuable data. You know, the covariance is just one of the TOO MANY characteristics of the data, therefore if we only focus on this simple information (covariance), other information will be dropped (e.g. mean, kurtosis, skewness, …). And a very natural question is, what does the covariance stand for? Can it represent all of the original information? The answer is certainly NO!

The second disadvantage in my eyes is the complexity of this model. In statistics, rarely have I ever met models more complex than SEMs. Even the simplest SEMs will include tens of parameters, as there are several parameter matrices in a SEM. The direct consequence it brings is the computational complexity. You can easily calculate the minimum of f(x)=x^2+1, but do you know how to calculate the minimum of f(a,b,c,d,e,f,g,…)=a*b^4/(2*c+d^14*a)-f/g*c^d+…? Actually the target function for SEM to optimize is much more complex than this one! It involves with the multiplication, inverse and determinant of huge matrices… Just tell me, can you trust the software to correctly find the GLOBAL optimum for such a complex function? Personally I cannot, as I know there are too many “stories” behind this problem. It’s very likely that your software only tells you a LOCAL optimum WITHOUT warnings.

The third reason comes from a philosophy: you may regard SEM simply as a process of hypothesis testing. I think you surely know the null hypothesis. In the end, you can ONLY REJECT your model but you can NEVER accept it (or say, ah, my model is correct!). In other words, you can never find the truth, although there are many measures (Chi-square, GFI, …) to tell you how “good” you models are. The basic philosophy of hypothesis testing in statistics is that null hypothesis can only be rejected (because in most cases you can only know the risk of rejecting; you cannot know the risk of accepting it). If we are unable to accept a SEM, why bother to construct it? (If you declare this SEM is correct, other people can naturally doubt whether there are other alternative models.)

I spent about only half a year on SEM, so my words might be incorrect. My advice is never believe anyone (including me) unless you really understand it, especially the mathematical and statistical theories! I really hate those textbooks avoiding maths, because to some degree, they are just cheating the readers by hiding the most important part of a technique.

There are still a couple of reasons but I don’t have enough time to list all of them out here.

You may judge from my above words that I have been fighting against this model for quite a long time. Once a statistician said, “All models are wrong, but some models are useful”, and I’d like to say that SEM is surely not among these “some models”. I’d be very glad to hear from you about the “successful” applications of SEM if you have any cases in this aspect. I’ve inquired such cases of many people who wrote me emails asking about the SEM but I have not got any till today, which ensured me even more of my impression that SEM is a useless technique.

Regards,
Yihui

Jan 092008

Of course it is NOT the obligation for statistics (or statisticians) to draw conclusions.

Statistics helps us to understand facts (approximately, in most cases), and the result might be "insignificant" or "unstable", etc. I believe most people hate such results, as they cannot draw conclusions as usual.

A stranger sent me an email asking about an unstable result from his/her K-Means cluster analysis, and this guy wanted to know a "good" method to choose the initial centers so that the cluster membership would be stable, besides, he/she also wondered which result was "correct", as there were many results according to different initial centers.

My answer to such a kind of questions is just the title for this blog. No matter how eagerly you want a "beautiful" result, please make sure you understand statistics first. "Insignificant" or "unstable" results are ALSO results, although they look different from what you read in most papers or textbooks. "Unstable" cluster membership just indicated that individuals in your sample were actually NOT clustered from the viewpoint of K-Means cluster algorithm. Who can deny this is NOT also a conclusion?

Dec 282007

There are some scholars making research on “indices” this year, and a few of them are my teachers. The day before yesterday I saw there was a press conference about a development index of China, which triggered one of my old ideas on the function of statistics (analysis). I believe two conditions will make the work of statistics meaningless:

  1. If we do nothing but just rank our samples; i.e. which is the first, and which is the last…
  2. If our conclusions are “axioms“; i.e. anybody can know our conclusions without statistical analysis.

Unfortunately, what I read from the news seemed to satisfy these two conditions.

WWW.YIHUI.NAME XIE@YIHUI.NAME © 2007 - 2010 by Yihui Xie