Exploratory Data Analysis with Play-by-Play Cricket Data
April 30, 2022
1 What is cricket? What is the Data Like?
First, what is cricket? In one sentence, it’s a version of baseball, played in different parts of the world, and using older rules. We will focus on the Twenty20 format of cricket, which has only been played professionally since about 2005, and is growing quickly in terms of spectator and commercial interest.
In the Twenty20 format, each team is allowed one innings to score runs (yes, “innings”, not inning). That innings is over if either:
- 10 of the 11 players are dismissed (a dismissal is also called an “out” or “wicket”), or
- 120 “fair” balls are thrown. A group of 6 throws is called an “over”, making 20 overs per innings. Hence the name Twenty20.
A small number (typically < 10) of throws per innings do not count against the limit of 120 because they were not thrown in a prescribed manner. A small number of throws also incur a one-run penalty; these cases are similar to “balls” and “hits-by-pitch” in baseball, respectively.
Each of these 120+ throws has a discrete result that we simplified into six categories: Wicket, 0 Runs, 1 Run, 2-3 Runs, 4-5 Runs (Ground Rule Double), and 6 Runs (Home Run). Cases where 3 runs and 5 runs occurred were rare, so we treated them as a fixed proportion of 2-3 runs (~6.4%) and 4-5 (~1.5%) respectively. Likewise, throws not counting against the 120 limit, and those resulting in a penalty run, were treated as a fixed proportion of each of the six categories.