While this blog is principally about Image Analysis (turning images into numbers), Data Analysis (turning numbers into something meaningful) is also really important.
In this post I’m going to explain how to display your data in a beeswarm plot and why you might want to do this. Simple statistics are great but show me the data!
Trying a Sample (Mean)
It shouldn’t need saying but for completeness, let’s start with the basics:
I think I’m right in saying that there exists a ground truth for everything (although Werner Heisenberg would disagree, but we’re going to ignore quantum mechanics for the time being). For example, there is a value for the mean length of every dolphin on this planet at any given time. We can never really know that number, all we can do (should we want to) is attempt to estimate it. This is where sampling comes in. If we measure a sub-population of dolphins we can make a few assumptions and extrapolate to every dolphin on the planet.
The problem is that mean alone is not that helpful in describing a population. Talk amongst yourselves as I find some reference dolphins to sample.
Above you can see that in two sampling trips, I found that the mean dolphin length both times was 2.7m. If you look at the individual numbers you can see that the range of values is quite different:
In this case, we can describe the difference using the standard deviation or the variance (both of which come out to about 0.2 versus 1 for samples one and two respectively). But what if those values are also similar?
Enter the Anscombe Quartet
Simple statistics are great for describing lots of datasets but there will always be cases where they are insufficient. One of the best examples is the Anscombe Quartet:
Each of the four datasets are identical (to at least two decimal places) when compared using simple statistics (mean of X, mean of Y, sample variance, correlation and linear regression). The differences are only really apparent when you graph them (as above).
Show me the data
I’ve heard over and over again that you should plot the individual data points if you have fewer than 6 but personally, I would set that number much (much!) higher. It’s a bit clumsy to try and plot single data points in Excel or even MATLAB. In these cases we can resort to a little bit of know-how and create our own beeswarm plot:
The thing that makes this non-obvious is getting the points aligned on the x-axis. We can deal with that really easily in Excel using the RAND() function:
For Sample 1, every cell in the first column has the same formula. RAND() returns a number between 0 and 1 and we add 1 to that to give us numbers between 1 and 2. For Sample 2 we have done the same thing but added 3 to RAND() to give numbers between 3 and 4. This leaves the whole of the ‘2’ range empty so that there’s some separation between the points. If you need to plot more datasets just make an X column adding 5, 7, 9 &c. There’s your simple beeswarm!
Admittedly, with this few points, it really doesn’t come into it’s own but when you have several hundred you can really start to appreciate the distribution of your data.
That’s not normal
One final tip; we’ve used RAND() which provides a uniform random number (do you remember this from before?). If you want a more traditional beeswarm, use a normally distributed random number which biases values towards the mean value:
Excel doesn’t have a built-in function for this so (as before) we can use the Box-Muller method to approximate a normal distribution. In this case my equation for x-values was: