Tag Archives: cricinfo

The legend of Chris Martin: Part II

Chris Martin was Not-out a remarkable 50% of the time.  That is, 52 times out of 104 innings.  Is this a record?  I don’t know an answer, so I sent off an email to the gurus at Cricinfo to see if they do.  Michael Jones replied that “Yes it is” for batsman with over 100 innings (see here)!  Well done Chris! What this does raise is the possibility of working out whether it was better for an incoming batsman to swing and hope to score a few runs before Chris was out, or whether they should just play normally? For this we must first consider what to do with the innings in which both Chris and the other batsmen were Not out.  In such circumstances the choice is to include the innings on both sides or to exclude. I’ve chosen to exclude as I think this has the least room for bias.

Now let us apply my Rule #1 and visualise the data (see previous Chris Martin post).

Christ Martin's Partnerships: Data source: Crininfo

Christ Martin’s Partnerships:
Data source: Crininfo

Plot A is a histogram in which I have grouped for each of the two sets of data (the partnership scores when Chris was Out and the scores when he was Not out) into bins.  Each bin is 5 runs wide except for the first.  That is the first bin is from 0 to 2.5 (really to 2), the second from 2.5 to 7.5 etc.   What can be seen from this is that there appear to be more very low partnerships when Chris was Out than when the other batsman was Out.  However, don’t be fooled by histograms like this.  Remember, there were not the same number of innings in which he was out (52) compared to when the other batsman was Out (49).  This may distort the graph.

Plot B is better, but harder to read.  Each black or red dot is a score.  The coloured boxes show the range called the “Interquartile range”.  That is, 25% of the scores are below the box, and 25% are above.  The line in the middle of the box is the median – that is 50% of score are below and 50% of scores are above. The “Whiskers” (lines above and below the box) show the range of scores.

Plot C is less often used in the medical literature (at least), but is really very useful.  It plots cumulatively the percentage of scores below a particular score for each of the two sets of data.  For example, we can read off the graph that about 27% of the partnership scores for when Chris Martin was out were zero.  If we look a the dashed line at 50% and where it intersects the blue line, then we see that 50% of the scores for when Chris Martin was out were 2 or below.  This is a bit more informative than plot B.

What all the plots show is that the distribution of scores in both data sets is highly skewed.  That is, there are many more scores at one end of plot A than the other, or the lines in plot C are not straight lines.  This is very important because it tells us what tests we can not use and how we should not present data.  Quite often when I referee papers, and in papers I read I see the averages (means) presented for data like this.  This is wrong.  They are presented like:

Chris Out:  8.4±13.9

Chris Not  Out: 10.8±11.8

The first number is the mean (ie add all the scores and divide by the number of innings).  The second number after the “plus-minus” symbol is called the standard deviation.  It is a measure of the spread of the numbers around the mean.  In this case the standard deviation is large compared to the mean.  Indeed anything more than half the size of the mean is a bit of a give away that the distribution is highly skewed and that presenting the numbers this way is totally meaningless.  We should me able to look at the mean and standard deviation and conclude that about 95% of the scores are between two standard deviations below the mean and two above.  However two below (8.4 – 2*13.9) is a negative score!  Not possible.

What should be presented is the medians with interquartile range (ie the range from where 25% are below and 75% are below).

Chris Out:  2.0 (0-12.8)

Chris Not  Out: 8 (1-16.5)

We are now ready to apply a statistical test found in most statistical packages to see if Chris being out or the other batsmen being out was better for the partnership.  The test we apply is called the Mann-Whitney U test (or Kruskall-Wallis test if we were comparing 3 or more data sets).  Some people say this is comparing the medians – it is not, it is comparing the whole of the two data sets.  If you don’t believe me, see  http://udel.edu/~mcdonald/statkruskalwallis.html.

So, I apply the test and it gives me the number p=0.12.  What does this mean?  It means that if Chris Martin were to bat in another 104 innings, and another, and another etc, then 12% of the time we would see the difference (or greater) between the Outs and Not Out partnerships that we do actually see (see significantly p’d for more explanation of p).  12% for a statistician is quite large and so we would suggest that there is no overall difference in partnerships whether Chris Martin was Out or was Not Out.  Alas, Chris Martin’s playing days are over and we have the entire “population” of his scores to assess his batting prowess.  The kind of statistical test I’ve presented is only really useful when we are looking at a sample from a much greater population.  However, in the hope that Chris may make a return to Test cricket one day, then what is presented here should give pause for thought for the next batsman who goes out to bat with him… perhaps there is not a lot to gain by swinging wildly, and thereby increasing their chances of getting out; they are probably not improving the chances of the team.

The legend of Chris Martin: Part I

His innings may be over, but the legend lives on.  Chris Martin retired this week from international cricket. He was a legend with ball and he was a legend with bat, for quite different reasons.  His Test batting average of 2.36 was the worst ever of any international cricketer who batted in more than 15 innings.  But his average does not tell the whole story.  Indeed, the legend of Chris Martin’s batting is a long tale which will require several blog posts to tell.  We need to answer some important questions, “What was his best average?”, “Was it better for his partners to slog or should they have respected his abilities more?”  Along the way I hope that you will pick up on some techniques which will help you interpret those pesky statistics, or to present your own data.

Rule #1:  Always visualise your data

Christ Martin's batting innings by innings. Data source: CricInfo

Christ Martin’s batting innings by innings.
Data source: CricInfo

The best place to begin any quest is with a graph.  Here is a graph showing all 104 of Chris’s innings in chronological order.  On it is represented the scores when he was Out (red lines) and the scores when he was Not Out (blue lines).  Funnily enough he was out and not out exactly 52 times each.  We can see immediately that the peak of his batting performance was a score of 12 Not Out which occurred approximately half-way through his career.  His best form seems to be innings 30 to 34 where he went undefeated in 5 successive innings scoring 17 runs.  On the other hand he had several bad runs where he was Out for zero (red marks below the zero line).  One of the interesting things is that his first 4 innings may have given a false impression of his batting prowess.  In his first innings he scored 7, well above his eventual average of 2.36.  In his 2nd and 4th innings he was 0 Not Out.  In between he was 5 Not Out.  This coincided with his peak average every, 12 (orange triangles).  This allows us to note an important feature of statistics.  Let us pretend for a moment the average of 2.36 was “built-in” to Chris Martin from the beginning.  This means that it was inevitable that after many innings he would end up with that average.  But it is not inevitable that any one innings taken at random is equal to that mean.  Importantly, with only a few samples (ie the first few innings) the average at that point can be a long way from the “real” average.  This is a phenomenon caused by sampling from a larger population.  It is why we have to be very cautious with conclusions drawn from a small sample population.  For example, if General Practitioners throughout the country see on average 5 new leukemia cases a year, but we sample only three General Practitioners from Christchurch who saw 8, 9 and 14 then we would be quite wrong to conclude that Christchurch has a higher average leukemia rate than other regions.  We need a much larger sample from Christchurch to get a reasonable estimate of Christchurch’s average.  There are statistical techniques for deciding what proportion of General Practitioners should be sampled and what the uncertainty is in the average we arrive at.  Graphs also help… we can see with Chris that after only 10% of his innings he is within 1 of his average and stays that way throughout the rest of his career (orange triangles).

That’s it for today.  More on the legend of Chris Martin in the weeks ahead.

Cheesecake files: Ponting’s last innings

“Only as good as your last match” goes the cliché.  This is true for Ricky Ponting and here is why. I recently published an article1 (Open Access :)) on some new techniques being used in medical research which determine if making an additional measurement improves what we call “risk stratification.”  In other words – does measuring substance X help us to rule in or rule out if someone had a disease or not.  I got a bit board with talking about “biomarkers” and medical stuff, so when it came to presenting this at the Australian New Zealand Society of Nephrology’s annual conference I looked to answer the very important question: “Does Ricky Ponting’s last inning’s matter?”, or in Australian cricket jargon “Ponting, humph, he’s only as good as his last innings, mate.”

How did I do it?

  1. I chose Australia winning a one-day international when chasing runs as an outcome (Win or Loss).
  2. Using data available from Cricinfo I determined which of the following on its own predicts if Australia will win (ie which predicts the outcome better than just flipping a coin): (1) Who won the toss, (2) whether it is a day or night match, (3) whether it is a home or away match, (4) how many runs the opposition scored.
  3. As it turned out if Australia lost the toss they were more likely to win (!), and, not surprisingly, the fewer runs the opposition scored the more likely they were to win.  I then built a mathematical model.  All this means is that I came up with an equation where the inputs were the winning or losing of the toss and the number of runs and the output was the probability of winning.  This is called a “reference model.”
  4.  I added to this model Ricky Ponting’s last innings score and recalculatd the probability of Australia winning.
  5. I then could calculate some numbers which told me that by adding Ricky Ponting’s last innings to the model I improved the model’s ability to predict a win and to predict a loss.  Below is a graph which I came up with to illustrate this.  I call this a Risk Assessment Plot.

So, when the shrimp hit the barbie, the beers are in the esky, and your mate sends down a flipper you can smack him over the fence for you now know that when Ricky Ponting scored well in his last innings, Australia are more likely to win.

The middle bit is the Risk Assessment Plot. The dotted lines tell us about the reference model. The solid lines tell us about the reference model + Ricky Ponting. The further apart the red and blue lines are the better. The red lines are derived from when Australia won, the blue lines from when the lost. If you follow the black lines with arrows you can see that by adding in Rick Ponting’s last innings the model the predicted probability (risk) of a win increases when Australia went on to win (a perfect model would have all these predictions equal to 1). Similarly the predicted probability of a loss gets smaller when Australia did lose (ideally all these predictions would equal 0).

  1. Pickering JW, Endre ZH. New Metrics for Assessing Diagnostic Potential of Candidate Biomarkers. Clin J Am Soc Nephro 2012;7:1355–64.