The other day I sent a small assignment to a group of people in order that they could play with statistics and become more interested with this subject. The data-generating process was quite simple: first I generated 20000 random numbers (10000 rows, 2 columns) from
N(0, 1) and then add 10000 rows of numbers which lie exactly on a circle; at last I provided this data in a randomized order so people cannot easily discover the pattern just from the numbers.
The question is, how to reveal the particular pattern in this “pile of sand”? Let’s look at the original plot:
What can we observe from this scatter plot? Perhaps nothing but “a pile of sand”. However, if we choose alternative ways to create the plot again, things will be completely different. Here are my approaches:
1. Use Semi-transparent Colors
Actually there are 10000 points lying on the circle, so the critical problem is overlapping. In order to show the degree of overlapping, we can use semi-transparent colors, because the color will be more opaque if there are many points at the same place.
2. Set Axes Limits
If we look closer into the plot, the scene will also be different. For example, we only plot the data in the range
3. Plot with Smaller Point Symbols
Certainly, small symbols can prevent overlapping effectively in this case.
4. Draw A Subset of the Data
As the problem is that there are too many data points, why not draw a subset and try a scatter plot first? For example, here we have sampled 1000 rows of data and the plot is like this:
5. Estimate the 2D Density
The R package
KernSmooth has provided functions to estimate the 1D or 2D density.We can further examine the shape of this 2D density using the package
rgl. Here is an animation recorded to illustrate the 2D density.
The R code for the above plots & animation is as follows: