Tag Archives: Statistics

My 10 Commandments of a Data Culture

Thou shalt have no data but ethical data.

Thou shalt protect the identity of thy subjects with all thy heart, soul, mind and body.

Thou shalt back-up.

Thou shalt honour thy data and tell its story, not thy own.

Thou shalt always visualise thy data before testing.

Thou shalt share thy results even if negative.

Thou shalt not torture thy data (but thou may interrogate it).

Thou shalt not bow down to P<0.05 nor claim significance unless it is clinically so.

Thou shalt not present skewed data as mean±SD.

Thou shalt not covet thy neighbour’s P value.

The legend of Chris Martin: Part II

Chris Martin was Not-out a remarkable 50% of the time.  That is, 52 times out of 104 innings.  Is this a record?  I don’t know an answer, so I sent off an email to the gurus at Cricinfo to see if they do.  Michael Jones replied that “Yes it is” for batsman with over 100 innings (see here)!  Well done Chris! What this does raise is the possibility of working out whether it was better for an incoming batsman to swing and hope to score a few runs before Chris was out, or whether they should just play normally? For this we must first consider what to do with the innings in which both Chris and the other batsmen were Not out.  In such circumstances the choice is to include the innings on both sides or to exclude. I’ve chosen to exclude as I think this has the least room for bias.

Now let us apply my Rule #1 and visualise the data (see previous Chris Martin post).

Christ Martin's Partnerships: Data source: Crininfo

Christ Martin’s Partnerships:
Data source: Crininfo

Plot A is a histogram in which I have grouped for each of the two sets of data (the partnership scores when Chris was Out and the scores when he was Not out) into bins.  Each bin is 5 runs wide except for the first.  That is the first bin is from 0 to 2.5 (really to 2), the second from 2.5 to 7.5 etc.   What can be seen from this is that there appear to be more very low partnerships when Chris was Out than when the other batsman was Out.  However, don’t be fooled by histograms like this.  Remember, there were not the same number of innings in which he was out (52) compared to when the other batsman was Out (49).  This may distort the graph.

Plot B is better, but harder to read.  Each black or red dot is a score.  The coloured boxes show the range called the “Interquartile range”.  That is, 25% of the scores are below the box, and 25% are above.  The line in the middle of the box is the median – that is 50% of score are below and 50% of scores are above. The “Whiskers” (lines above and below the box) show the range of scores.

Plot C is less often used in the medical literature (at least), but is really very useful.  It plots cumulatively the percentage of scores below a particular score for each of the two sets of data.  For example, we can read off the graph that about 27% of the partnership scores for when Chris Martin was out were zero.  If we look a the dashed line at 50% and where it intersects the blue line, then we see that 50% of the scores for when Chris Martin was out were 2 or below.  This is a bit more informative than plot B.

What all the plots show is that the distribution of scores in both data sets is highly skewed.  That is, there are many more scores at one end of plot A than the other, or the lines in plot C are not straight lines.  This is very important because it tells us what tests we can not use and how we should not present data.  Quite often when I referee papers, and in papers I read I see the averages (means) presented for data like this.  This is wrong.  They are presented like:

Chris Out:  8.4±13.9

Chris Not  Out: 10.8±11.8

The first number is the mean (ie add all the scores and divide by the number of innings).  The second number after the “plus-minus” symbol is called the standard deviation.  It is a measure of the spread of the numbers around the mean.  In this case the standard deviation is large compared to the mean.  Indeed anything more than half the size of the mean is a bit of a give away that the distribution is highly skewed and that presenting the numbers this way is totally meaningless.  We should me able to look at the mean and standard deviation and conclude that about 95% of the scores are between two standard deviations below the mean and two above.  However two below (8.4 – 2*13.9) is a negative score!  Not possible.

What should be presented is the medians with interquartile range (ie the range from where 25% are below and 75% are below).

Chris Out:  2.0 (0-12.8)

Chris Not  Out: 8 (1-16.5)

We are now ready to apply a statistical test found in most statistical packages to see if Chris being out or the other batsmen being out was better for the partnership.  The test we apply is called the Mann-Whitney U test (or Kruskall-Wallis test if we were comparing 3 or more data sets).  Some people say this is comparing the medians – it is not, it is comparing the whole of the two data sets.  If you don’t believe me, see  http://udel.edu/~mcdonald/statkruskalwallis.html.

So, I apply the test and it gives me the number p=0.12.  What does this mean?  It means that if Chris Martin were to bat in another 104 innings, and another, and another etc, then 12% of the time we would see the difference (or greater) between the Outs and Not Out partnerships that we do actually see (see significantly p’d for more explanation of p).  12% for a statistician is quite large and so we would suggest that there is no overall difference in partnerships whether Chris Martin was Out or was Not Out.  Alas, Chris Martin’s playing days are over and we have the entire “population” of his scores to assess his batting prowess.  The kind of statistical test I’ve presented is only really useful when we are looking at a sample from a much greater population.  However, in the hope that Chris may make a return to Test cricket one day, then what is presented here should give pause for thought for the next batsman who goes out to bat with him… perhaps there is not a lot to gain by swinging wildly, and thereby increasing their chances of getting out; they are probably not improving the chances of the team.

The legend of Chris Martin: Part I

His innings may be over, but the legend lives on.  Chris Martin retired this week from international cricket. He was a legend with ball and he was a legend with bat, for quite different reasons.  His Test batting average of 2.36 was the worst ever of any international cricketer who batted in more than 15 innings.  But his average does not tell the whole story.  Indeed, the legend of Chris Martin’s batting is a long tale which will require several blog posts to tell.  We need to answer some important questions, “What was his best average?”, “Was it better for his partners to slog or should they have respected his abilities more?”  Along the way I hope that you will pick up on some techniques which will help you interpret those pesky statistics, or to present your own data.

Rule #1:  Always visualise your data

Christ Martin's batting innings by innings. Data source: CricInfo

Christ Martin’s batting innings by innings.
Data source: CricInfo

The best place to begin any quest is with a graph.  Here is a graph showing all 104 of Chris’s innings in chronological order.  On it is represented the scores when he was Out (red lines) and the scores when he was Not Out (blue lines).  Funnily enough he was out and not out exactly 52 times each.  We can see immediately that the peak of his batting performance was a score of 12 Not Out which occurred approximately half-way through his career.  His best form seems to be innings 30 to 34 where he went undefeated in 5 successive innings scoring 17 runs.  On the other hand he had several bad runs where he was Out for zero (red marks below the zero line).  One of the interesting things is that his first 4 innings may have given a false impression of his batting prowess.  In his first innings he scored 7, well above his eventual average of 2.36.  In his 2nd and 4th innings he was 0 Not Out.  In between he was 5 Not Out.  This coincided with his peak average every, 12 (orange triangles).  This allows us to note an important feature of statistics.  Let us pretend for a moment the average of 2.36 was “built-in” to Chris Martin from the beginning.  This means that it was inevitable that after many innings he would end up with that average.  But it is not inevitable that any one innings taken at random is equal to that mean.  Importantly, with only a few samples (ie the first few innings) the average at that point can be a long way from the “real” average.  This is a phenomenon caused by sampling from a larger population.  It is why we have to be very cautious with conclusions drawn from a small sample population.  For example, if General Practitioners throughout the country see on average 5 new leukemia cases a year, but we sample only three General Practitioners from Christchurch who saw 8, 9 and 14 then we would be quite wrong to conclude that Christchurch has a higher average leukemia rate than other regions.  We need a much larger sample from Christchurch to get a reasonable estimate of Christchurch’s average.  There are statistical techniques for deciding what proportion of General Practitioners should be sampled and what the uncertainty is in the average we arrive at.  Graphs also help… we can see with Chris that after only 10% of his innings he is within 1 of his average and stays that way throughout the rest of his career (orange triangles).

That’s it for today.  More on the legend of Chris Martin in the weeks ahead.

Significantly p’d

I may be a pee scientist, but today is brought to you by the letter “P” not the product.  “P” is something all journalists, all lay readers of science articles, teachers, medical practitioners, and all scientists should know about.  Alas, in my experience many don’t and as a consequence “P” is abused. Hence this post.  Even more abused is the word “significant” often associated with P; more about that later.

P is short for probability.  Stop! – don’t stop reading just because statistics was a bit boring at school; understanding maybe the difference between saving lives and losing them.  If nothing so dramatic, it may save you from making a fool of yourself.

P is a probability.  It is normally reported as a fraction (eg 0.03) rather than a percentage (3%).  You will be familiar with it when tossing a coin.  You know there is a 50% or one half or 0.5 chance of obtaining a heads with any one toss.  If you work out all the possible combinations of two tosses then you will see that there are four possibilities, one of which is two heads in a row.  So the prior (to tossing) probability of two heads in a row is 1 out 4 or P=0.25. You will see P in press releases from research institutes, blog posts, abstracts, and research articles, this from today:

“..there was significant improvement in sexual desire among those on  testosterone (P=0.05)” [link]

So, P is easy, but interpreting P depends on the context.  This is hugely important.  What I am going to concentrate on is the typical medical study that is reported.  There is also a lesson for a classroom.

One kind of study reporting a P value is a trial where one group of patients are compared with another.  Usually one group of patients has received an intervention (eg a new drug) and the other receives regular treatment or a placebo (eg a sugar pill).  If the study is done properly a primary outcome should have been decided before hand.  The primary outcome must measure something – perhaps the number of deaths in a one year period, or the mean change in concentration of a particular protein in the blood.  The primary outcome is how these what is measured differs between the group getting the new intervention and the group not getting it.  Associated with it is a P value, eg:

“CoQ10 treated patients had significantly lower cardiovascular mortality (p=0.02)” [link]

To interpret the P we must first understand what the study was about and, in particularly, understand the “null hypothesis.”  The null hypothesis is simply the idea the study was trying to test (the hypothesis) expressed in a particular way.  In this case, the idea is that CoQ10 may reduce the risk of cardiovascular mortality.  Expressed as a null hypothesis we don’t assume that it could only decrease rates, but we allow for the possibility that it may increase as well (this does happen with some trials!).  So, we express the hypothesis in a neutral fashion.  Here that would be something like that the risk of cardiovascular death is the same in the population of patients who take CoQ10 and in the population which does not take CoQ10.  If we think about it for a minute, then if the proportion of patients who died of a cardiovascular event was exactly the same in the two groups then the risk ratio (the CoQ10 group proportion divided by the non CoQ10 group proportion) would be exactly 1.  The P value, then answers the question:

If the risk of cardiovascular death was the same in both groups (the null hypothesis) was true what is the probability (ie P) that the difference in the actual risk ratio measured from 1 is as large as was observed simply by chance?

The “by chance” is because when the patients were selected for the trial there is a chance that they don’t fairly represent the true population of every patient in the world (with whatever condition is being studied) either in their basic characteristics or their reaction to the treatment. Because not every patient in the population can be studied, a sample must be taken.  We hope that it is “random” and representative, but it is not always.  For teachers, you may like to do the lesson at the bottom of the page to explain this to children.  Back to our example, some numbers may help.

If we have 1000 patients receiving Drug X, and 2000 receiving a placebo.  If, say, 100 patients in the Drug X group die in 1 year, then the risk of dying in 1 year we say is 100/1000 or 0.1 (or 10%).  If in the placebo group, 500 patients die in 1 year, then the risk is 500/2000 or 0.25 (25%).  The risk ratio is 0.1/0.25 = 0.4.  The difference between this and 1 is 0.6.  What is the probability that we arrived at 0.6 simply by chance?  I did the calculation and got a number of p<0.0001.  This means there is less than a 1 in 10,000 chance that this difference was arrived at by chance.  Another way of thinking of this is that if we did the study 10,000 times, and the null hypothesis were true, we’d expect to see the result we saw about one time.  What is crucial to realise is that the P value depends on the number of subjects in each group.  If instead of 1000 and 2000 we had 10 and 20, and instead of 100 and 500 deaths we had 1 and 5, then the risks and risk ratio would be the same, but the P value is 0.63 which is very high (a 63% chance of observing the difference we observed).  Another way of thinking about this is what is the probability that we will state there is a difference of at least the size we see, when there is really no difference at all. If studies are reported without P values then at best take them with a grain of salt.  Better, ignore them totally.

It is also important to realise that within any one study that if they measure lots of things and compare them between two groups then simply because of random sampling (by chance) some of the P values will be low.  This leads me to my next point…

The myth of significance

You will often see the word “significant” used with respect to studies, for example:

“Researchers found there was a significant increase in brain activity while talking on a hands-free device compared with the control condition.” [Link]

This is a wrong interpretation:  “The increase in brain activity while talking on a hands-free device is important.” or  “The increase in brain activity while talking on a hands-free device is meaningful.”

“Significant” does not equal “Meaningful” in this context.  All it means is that the P value of the null hypothesis is less than 0.05.   If I had it my way I’d ban the word significant.  It is simply a lazy habit of researchers to use this short hand for p<0.05.  It has come about simply because someone somewhere started to do it (and call it “significance testing”) and the sheep have followed.  As I say to my students, “Simply state the P value, that has meaning.”*



For the teachers

Materials needed:

  • Coins
  • Paper
  • The ability to count and divide

Ask the children what the chances of getting a “Heads” are.  Have a discussion and try and get them to think that there are two possible outcomes each equally probable.

Get each child to toss their coin 4 times and get them to write down whether they got a head or tail each time.

Collate the number of heads in a table like.

#heads             #children getting this number of heads

0                      ?

1                      ?

2                      ?

3                      ?

4                      ?

If your classroom size is 24 or larger then you may well have someone with 4 heads or 0 (4 tails).

Ask the children if they think this is amazing or accidental?

Then, get the children to continue tossing their coins until they get either 4 heads or 4 tails in a row.  Perhaps make it a competition to see how fast they can get there.  They need to continue to write down each head and tail.

You may then get them to add all their heads and all their tails.  By now the proportions (get them to divide the number of heads by the number of tails).  If you like, go one step further and collate all the data.  The probability of a head should be approaching 0.5.

Discuss the idea that getting 4 heads or 4 tails in a row was simply due to chance (randomness).

For more advanced classes, you may talk about statistics in medicine and in the media.  You may want to use some specific examples about one off trials that appeared to show a difference, but when repeated later it was found to be accidental.


*For the pedantic.  In a controlled trial the numbers in the trial are selected on the basis of pre-specifying a (hopefully) meaningful difference in the outcome between the case and control arms and a probability of Type I (alpha) and Type II (beta)  errors.  The alpha is often 0.05.  In this specific situation if the P<0.05 then it may be reasonable to talk about a significant difference because the alpha was pre-specified and used to calculate the number of participants in the study.

Cheesecake files: Ponting’s last innings

“Only as good as your last match” goes the cliché.  This is true for Ricky Ponting and here is why. I recently published an article1 (Open Access :)) on some new techniques being used in medical research which determine if making an additional measurement improves what we call “risk stratification.”  In other words – does measuring substance X help us to rule in or rule out if someone had a disease or not.  I got a bit board with talking about “biomarkers” and medical stuff, so when it came to presenting this at the Australian New Zealand Society of Nephrology’s annual conference I looked to answer the very important question: “Does Ricky Ponting’s last inning’s matter?”, or in Australian cricket jargon “Ponting, humph, he’s only as good as his last innings, mate.”

How did I do it?

  1. I chose Australia winning a one-day international when chasing runs as an outcome (Win or Loss).
  2. Using data available from Cricinfo I determined which of the following on its own predicts if Australia will win (ie which predicts the outcome better than just flipping a coin): (1) Who won the toss, (2) whether it is a day or night match, (3) whether it is a home or away match, (4) how many runs the opposition scored.
  3. As it turned out if Australia lost the toss they were more likely to win (!), and, not surprisingly, the fewer runs the opposition scored the more likely they were to win.  I then built a mathematical model.  All this means is that I came up with an equation where the inputs were the winning or losing of the toss and the number of runs and the output was the probability of winning.  This is called a “reference model.”
  4.  I added to this model Ricky Ponting’s last innings score and recalculatd the probability of Australia winning.
  5. I then could calculate some numbers which told me that by adding Ricky Ponting’s last innings to the model I improved the model’s ability to predict a win and to predict a loss.  Below is a graph which I came up with to illustrate this.  I call this a Risk Assessment Plot.

So, when the shrimp hit the barbie, the beers are in the esky, and your mate sends down a flipper you can smack him over the fence for you now know that when Ricky Ponting scored well in his last innings, Australia are more likely to win.

The middle bit is the Risk Assessment Plot. The dotted lines tell us about the reference model. The solid lines tell us about the reference model + Ricky Ponting. The further apart the red and blue lines are the better. The red lines are derived from when Australia won, the blue lines from when the lost. If you follow the black lines with arrows you can see that by adding in Rick Ponting’s last innings the model the predicted probability (risk) of a win increases when Australia went on to win (a perfect model would have all these predictions equal to 1). Similarly the predicted probability of a loss gets smaller when Australia did lose (ideally all these predictions would equal 0).

  1. Pickering JW, Endre ZH. New Metrics for Assessing Diagnostic Potential of Candidate Biomarkers. Clin J Am Soc Nephro 2012;7:1355–64.

Note to self – eat more chocolate

Apparently the pinnacle of one’s scientific career is to win a Nobel Prize.  Having not won a Nobel yet I will just need to accept by faith that it would indeed be the pinnacle of my career.  Thanks to a brilliant article in the New England Journal of Medicine this week I now have a new strategy to gain that elusive medal – eat more chocolate.  Dr Franz Messerli has nicely illustrated that the number of Nobel laureates per 10 million of population is correlated well with the chocolate consumption of the country of origin of the laureates (Messerli FH. Chocolate Consumption, Cognitive Function, and Nobel Laureates. N Engl J Med 2012;).  The correlation is strong with an r=0.79* (p<0.0001**) increasing to 0.86 with the removal of one outlier (Sweden). As the author wrote:

“..since chocolate consumption has been documented to improve cognitive function, it seems most likely that in a dose-dependent way, chocolate intake provides the abundant fertile ground needed for the sprouting of Nobel laureates. Obviously, these findings are hypothesis-generating only and will have to be tested in a prospective, randomized trial.”

On the outlier he wrote:

“The only possible outlier seems to be Sweden. Given its per capita chocolate consumption of 6.4 kg per year, we would predict that Sweden should have produced a total of about 14 Nobel laureates, yet we observe 32. Considering that in this instance the observed number exceeds the expected number by a factor of more than 2, one cannot quite escape the notion that either the Nobel Committee in Stockholm has some inherent patriotic bias when assessing the candidates for these awards or, perhaps, that the Swedes are particularly sensitive to chocolate, and even minuscule amounts greatly enhance their cognition.”

You may wonder why this was not published on 1 April .  Is this merely another example of “correlation doesn’t equal causation” and the tyranny of the p value (more on that in another post), or could there really be something in it? Have a read of the article and judge for yourself.

No good scientific report is worth its salt without a testimonial (here):

“I attribute essentially all my success to the very large amount of chocolate that I consume,” said Eric Cornell, an American physicist who shared the Nobel Prize in 2001.

“Personally I feel that milk chocolate makes you stupid,” he added. “Now dark chocolate is the way to go. It’s one thing if you want like a medicine or chemistry Nobel Prize, OK, but if you want a physics Nobel Prize it pretty much has got to be dark chocolate.”

*  if r=1 then correlation is perfect, if r=0 then there is nor correlation at all.  0.79 is impressive.

** this means that there is a .01% chance that the correlation observed was due to random chance.

Children killed by mothers – duh!

The media is in a tiz over a police report stating that 45% of children killed in family violence were killed by their mothers.  The Herald says “Children are far more likely to be killed by their mothers than any other category of offender,”; Newstalk ZB reports Bob McCoskrie as saying this is “startling”; Stuff ramp it up a bit and begin “Nearly 50 per cent of children who die as a result of family violence are killed by their mothers.”

Well, duh!  Don’t they know that children are far more likely to be living with their mothers than any other adult?  The “any other category of offender” the Herald talks about include Fathers, Stepfathers, Boyfriends, Grandmothers, and various others.  Given there were 33 deaths looked at this is a large number of groups to try and analyse.  For the numbers to have any meaning they need to assess the numbers using what are called “case controls.” For example, if they looked at other factors, eg socio-economic, and then assessed the make up of households in that group they may find that children in that group are living predominantly with their mothers and much less with Fathers, Stepfathers, boyfriends etc.  The numbers may only reflect who they are living with, not some kind of “evil mother” syndrome.

What use, really is such a report?  How will it help prevent further deaths?  I suspect the answers are “little” and “it won’t” but this will not prevent endless ours of hand wringing in the media.  Surely, we can do better.

Medals per capita is biased

The Games are over, let the analysis begin.

We’ve had some fun with ranking countries’ performance at the London Olympics according to medals per million (medals per capita) or medals per 100 billion of Gross Domestic Product (see my tables below).  As I predicted a few posts back, the medals per capita will be one by a country with very low population and few medals – Grenada is the winner here.  It seems obvious when we think about it that a country with a population of just 100,000 (0.1 Million) may end up with a very high medals per million score if they win just 1 or 2 medals (even though that is still a difficult feat).  What is not so easy to see is that countries with very high populations have a “limit” for their performance that is very much lower.  With just ~900 medals on offer and a population of over 1340 million China’s possible maximum medals per million score is just 0.67 (compared with Grenada’s 9000).  It is this breadth of this range of possible values that causes the bias in the ranking system.

I like to visualise data.  The two graphs below show the bias for the “Official Rankings” (you know, the ones that rank according to number of golds first, silvers second and bronzes third) and for the medals per capita.  The bias is obvious because the points on the graph are not scattered without any discernible pattern all over the graphs. The “Official Rankings” obviously are biased towards countries with greater populations, the Medals per capita is biased towards countries with lesser populations.  Obviously, dividing by population does not remove the bias, merely shifts the bias. Note, that the scales on the “y” axis are what we call “log scales”.  This enables us to see all the data more easily (ie countries with 100,000 and 1.3 billion can be displayed on one graph). What is not shown on the graph is the 122 countries ranked 80th equal who won no medals at all.

Later this week, once I am happy with my grant writing and get my head around some data I am trying to analyse I shall attempt to put together an equation which will better help us answer the important question of the day – “Which is the greatest olympic nation?”

Top graph: The Official Rankings verse Population (note the log scale).
Bottom graph: The Ranking of number of medals won per million population v population

Grenada grabs gold, NZ relegated to Silver

As expected, a country with a small population has grabbed the top medal position when Grenada (population, 104,000) grabbed a gold.  WIth a medals per million score of 9.6 they are only likely to be beaten by a country with even smaller population.  Meanwhile, Jamaica has moved into 5th position with 5 medals, all in athletics.  If this was health stats, then these two situations would be examples of “outliers.”  Worthy of study in and of themselves, but having a distorting influence on the overall population statistics.  Also of interest is that perhaps Great Britain is reaching a plateau, meanwhile China continues to fall as sports they are not traditionally strong in dominate the second week of competition.

A bronze puts New Zealand in Gold medal position

The weekend success of a New Zealand rowing pair put them in gold medal position on the medals per capita table.  They have now sneaked ahead of Slovenia.   Denmark are in bronze medal position with Australia solid in fourth.  The big mover over the weekend was Great britain moving from 17th at the end of day 6 to 11th at the end of day 9.