“Taca la bala” says the wizard: a trip into the World Cup 2014

Data Tales

We published this tweet three days ago, before the two semi-finals of the World Cup 2014. Our prediction was correct: against any (brazilian) forecast Brazil was humiliated by Germany, while Argentina defeated Netherlands on penalties after a not exciting match. The final, thus, will be a very classic of football: Germany vs Argentina. How did we figure out the two winning teams? It was not a stroke of luck. It was more properly a “stroke of data”.

In 1970, our parents followed the “match of the century”, Italy-Germany 4-3, on a noisy black-and-white TV, tuned on the unique public channel the Italian government provided at that time. After many technological improvements, in 2006 we switched to LCD full-color screens, and watched the famous Zidane’s  headbutt  in high…

View original post 1,167 more words

Posted in Soccer analytics | Tagged , , , , | Leave a comment

Are defences dominating this World Cup?

Attacking teams sell tickets, but defensive ones win games.  So far, the 2014 World Cup is not the exception, especially considering that after the fireworks of the group stage, knockout clashes have resulted in many draws, and a dearth of goals, with only five in the quarter-finals.

After all, the history of the World Cup teaches that the team that had the best defence (and not even the best attack) won 42% of the games, while the one with the best offense only 21% (exactly as the teams that have both the best offense that the best defense, with the rest who took home the World Cup having neither one nor the other). In theory, therefore, teams with a better defence are twice as likely to win the competition.

To remind us, we only need to analyse the quarter-finalists in Brazil. The four teams that conceded fewer shots, except France, have gone trough. Of these Les Bleus have ended their tournament as the second team with fewer shots on target conceded (SOTCON), 2.4, against Brazil’s 2 per match – an excellent performance, but not enough to beat Germany. The progress of Deschamps squad was not helped by the fact that France was the third team for SOTCON. Against France 4 shots were enough to produce a goal, only fewer were needed against Brazil (2.5) and the Netherlands (3.75). All of this is confirmed by France’s third last save percentage between the teams which reached the quarter (75%, worse only Brazil with 60%, and Holland with 73.3%); this against by its SOTCON, the second of the group for Les Bleus. In other words, statistically France has conceded few scoring opportunities. But these were good one, easy to score goals from. Proof is Germany’s goal, with Varane failing to match Hummels’ strength; evidence of how an individual weakness can ruin the work of a group.

France’s statistical blip of is most likely explained by this event. Statistically, however, the Costa Rica story is more difficult to tell. Here we have best defensive record of all the teams in the quarter-finals (only 2 goals conceded), and the highest save percentage (91.7%) – with their goalkeeper Keylor Navas going home with a 90% record. And Costa Rica was also the team that best used the offside trap: 41 times in 5 games, with two masterful peaks: 11 against Italy and 13 with Holland. Italy’s Balotelli and his substitute Immobile were judged offside 6 times each, a record in the tournament until the quarterfinals, when van Persie top them with 9. So, why Pinto’s men left us? Because logic dictates that he had milked all his team technical ability. Also, perhaps, because his team had the worse SOTCON. In other words, it is true that to score a goal against Costa Rica 12 shots were needed (basically double those of second placed Germany, with 6.3), but it is also true that the 34.29% of these shots were on target. And those were far too many for even Navas to save.

Aside these two statistical blips, the four teams left in the competition are very close to the standard of a tournament that could be decided by the best defence. The Brazil one may not look impressive, it has the lowest save percentage (60%), but, as mentioned, this is the team that has conceded fewer shots on target per game and has the best SOTCON, an impressive 16.67%. This suggests that to beat Julio Cesar (at least with Thiago Silva on the pitch) a hell of a shot is required.

Germany is third last in shots conceded per game and SOTCON, but has the highest save percentage after Costa Rica, an impressive 84.2%, which means that to score a goal against Neuer 6.3 shots on target have been necessary so far. Much the same is true for Argentina, which gives away 3.4 shots on target per game but has the third save percentage (82.4%). If anything, it is much more difficult to explain the semi-final place of Holland (at least in a defensive key). The Dutch have the second-last and the penultimate SOTCON save percentage (73.3%). So far, they have conceded a small number of shots per game (3), but one wonders if that is going to continue against Messi and company?

Posted in Soccer analytics, Sports Analytics, Uncategorized | Tagged , , , | Leave a comment

Analytic insights and dubious corners stats

Does Manchester City’s much publicised analytic insight on corners stands up to scrutiny?
There are reasons to doubt it!

Corners are one of the key moments in a match that grab fans’ attention.  There is always high expectation that a corner will result in a goal. However, the probability that a corner will lead directly (first touch) to a goal is very low [1]. Investigating the stats suggests that more goals come from the penalty area scramble that often follows corners, and any goal that is scored tends to come many touches of the ball after the original corner kick.

However, it is difficult to prove or disprove these hypotheses either way, in view of the lack of trustworthy data publicly available.  Manchester City appear to have tried (they can afford to buy or collect the data), and their claim on corners has received much publicity as a key finding of their massive analytic’s effort (11+ analytics people). Shown below are some extracts of how this claim has been reported in the media.

City had gone 22 games and not scored a goal from a corner. After this….. they scored 8 goals in 15 games.”
(http://thevideoanalyst.com/sports-analytics-conference-part-2/ – (22/11/2011)

“The data revolution keeps stumbling on new truths. At Manchester City, for instance, the analysts finally persuaded the club’s then manager, Roberto Mancini, that the most dangerous corner kick is the inswinger, the ball that swings towards goal. Mancini had long argued (strictly from intuition) that outswingers were best. Eventually he capitulated and, in the 2011-2012 season, when City won the English title, they scored 15 goals from corners, the most in the Premier League. The decisive goal, Vincent Kompany’s header against Manchester United, came from an in swinging corner.”
From review of the book …

“Wilson recalls one particular period when Manchester City hadn’t scored from corners in over 22 games, so his team decided to analyse over 400 goals that were scored from corners. They noticed that about 75 percent resulted from so-called in-swinging corners, the type where the ball curves towards the goal. “In the next 12 games of the next season we scored nine goals from corners,” Wilson says.
http://www.wired.co.uk/magazine/archive/2014/01/features/the-winning-formula (date)

Since the first time I came across these articles, this claim struck me as poor example of the kind of useful insight that analytics can deliver, and definitely not one that should get so much attention. I also did not understand this fascination by the Man City analytics department with goals scored from corners given their low impact on the game, a fact highlighted in blogs [1]  by Chris Anderson (@soccerquant), although the year he analysed was a poor one for corners

So, when I saw this Man City’s claim recently mentioned e in Wired (see above) I could not stop myself reflecting on on the phrase, “In the next 12 games of the next season we scored nine goals from corners,” Wilson says. Nine goals from corner in the next 12 games! That can’t be right! I am aware that a few goals come from corners, so to score nine in twelve consecutive games struck me as a rather exceptional event. I decided to investigate.

Opta stats do not specify how corners are taken, so there is no way to find from their data whether this stat was true. Anyway, I only had Opta summary data for the 2011-2012 EPL season, as provided by the now defunct MCFC project. So, using Opta data was out. However, I remembered that corner type (in-swinging or out-swinging) was specified in detailed match commentaries that I had collected from the web in the past: from 2007-08 to 2011-12, when for some reason, and to my chagrin, I could no longer find them anywhere. (I am tempted to comment on the lack of free data on football beyond the few traditional match stats – but this is probably better left to another blog).

So, I extracted the relevant data from these commentaries, and produced the following charts:

MC All cornersFig. 1 – All corners
Fig. 1 shows the number of inswinging and outswinging corners taken in the years specified. It is clearly visible that more in-swinging corners are taken: 60% more on average.

MC All goalsFig. 2 – All goals from corners
Fig. 2 Show the goals scored from each type of corners. A weighted average of comers and goals, shows that on average 50% more goals are scored from in-swinging corners than out-swinging ones . I should add that I was rather puzzled by the small number of goals scored from corners in 2010-2011 (a stats which perhaps merits further investigation), but after checking and re-checking, using also published data, I had to accept that this was the case.
MC cornersFig. 3 – Man City corners
Fig. 3 shows that the mix of corner taking by Man City does not follow the general pattern shown in Fig.1 (more in-swinging than out-swinging are taken in all seasons), and instead it changes for from one season to the next. A significant change occurs in the 2011-12, when tree times more in-swinging than out-swinging corners are taken.
MC goalsFig. 4 – Man City goals from corners
Fig. 4 Man City scores very few goals from corners, with highest totals in 2009-10, and in 2011-12, when they score 9 goals, all from inswinging corners. The latter exploit is probably the most relevant statistics to keep in mind.

So, now that we have the stats, let’s look at each of the three claims reported above, in order of time.

Claim 1
The first was (supposedly) made by Gavin Fleig, MC Head of Performance Analysis at the time (Nov, 2011),. “…City had gone 22 games and not scored a goal from a corner. After this…. they scored 8 goals in 15 games“. No timeline is given for these event, but since this claim was made at a conference in November 2011, and cannot refer to a distant past, we can safely assume it fall within the range of my data. So, by looking at the charts (fig. 4), one can see that it could only have happened in the 2009-2010 season, the one preceding this claim.

During this season, according to my stats , Man City scored 7 not 8 goals (but keep in mind that I have counted only goals scored directly from corners) all season, and in the following days’ play: 4, 10, 13, 24, 32 (3). From this sequence we can see that it did not go 22 games without scoring from corners as claimed, but only 10 (14-23). Moreover, MC did not score 8 goals in the following 12 games as claimed, but only 4, of which 3 on day 32 in a 6-0 win against an already relegated Burnley, which hardly merit being included in the count.

Of course Gavin’s 22 consecutive goalless games could also include games played at the end of the previous season 2008-2009, when Man City scored only one goal from corners all season (a record?). I’ll leave readers to query that stat, but, as you‘ll remember, we are still left with the second part of the claim:” …8 goals in 15 games “, one that doesn’t tally.

Claim 2
Man City corners stats have achieved such iconic status to merit a mention in a much publicised recent book on football analytics,“The numbers game: Why everything you know about football is wrong”. I haven’t got round to read it yet, so I can only comment on what has been reported in a review, which states that in “…winning the 2011-2012 season, when City won the English title, they scored 15 goals from corners, the most in the Premier League.” Fifteen (15) goals is the official Opta figure. I only found nine (9) , and, significantly, all coming from in-swinging corners.

However, it is likely that this Opta stat includes “goals created from this particular match situation are defined here as occurring within three touches of a corner“, a rule mentioned in the already mentioned blog by Chris Anderson, one of the authors of this book, and titled “Why the Goal Value of Corners Is (Almost) Nil: Evidence from the EPL” [1]. His analysis should leave many fans wondering about the Man City analytics crowd fascination with corners.

Claim 3
Last but not least, we come to the claim, as reported in Wired, made by the top analytics man himself, Simon Wilson, Man City Strategic Performance Manager.

Wilson recalls one particular period when Manchester City hadn’t scored from corners in over 22 games, so his team decided to analyse over 400 goals that were scored from corners. They noticed that about 75 percent resulted from so-called in-swinging corners, the type where the ball curves towards the goal. “In the next 12 games of the next season we scored nine goals from corners“, writes Wired.

This article is very recent, dated 23rd January, 2014, but the claim is similar to that made by Gavin Fleig in Nov 2011 (two years earlier!). And must obviously refer to the same ‘fact’: the sequence of goalless games is the same, 22, and so is the one when goals were scored 12 (I’ll leave it to the statisticians among you to calculate the probability of this event being repeated). But the number of goals now jumps to 9, not 8 as claimed by Gavin Flieg. I have already commented on the accuracy of these stats in Claim 1, so I’ll just deal with Wilson’s other claims: that 400 goals from corners were analysed, and that 75% of these were scored from in-swinging corners.

First, though, I should point out that Wilson does not specify that the 9 goals were scored from inswinging corners, although it is clear from his premise that is what he means. But, as we have already seen, this not tally with my figures, which show that in 2009-2010 only 3 goals came from this type of corner, while 4 came from out-swinging ones – hardly a strong case for favouring in-swinging corners. But perhaps Wilson (speaking last year, I presume) is mixing the stats of 2009-10 with those of 2011-12 when 9 goals where scored – all from in-swinging corners. As to the 75% success of in-swinging claim, this does not tally with my stats that show only a 61% advantage – not an insignificant difference.

Concluding remarks
So, here you have it. My stats appear to refute Man City’s much publicised corner claim (claims?). What next?

As I started thinking of how to wrap-up my piece with some “Conclusions”, many thoughts came to mind, and began to write. But then doubt and caution won the day, and stopped.  I wondered: how could a big club like Man Cit, with claimed 11+ people working on analytics could make such an error? Surely among these there must be some with A-level stats, and knowledge of Excel (even though neither is really necessary for doing simple stats).

I hesitated, and came to the decision to leave my Conclusion to a later blog. This would give time to the varius people mentioned, as well as interested football analyst, a chance to set the record straight, and, perhaps, question my findings.

[1] http://www.soccerbythenumbers.com/2011/05/why-goal-value-of-corners-is-almost.html

Notes on data
As a professional data analyst, I am often riled by the lack of free data on many topics of wide public interest (of which soccer is perhaps the least). I believe that anyone making a public claim based on the analysis of data should be prepared to make this available to all who request it. An action that in the digital age this is rather easy to do. The capacity for other people to repeat the analysis, and verify or refute a data-bases claim is of fundamental importance. Else statistics will always inhabit that realm between science and non-science, and won’t be taken seriously.

In line with this belief, the data I used – not all, but the one specific to Man City corners – is available on request received by fellow bona fide analysts. They must keep in mind however, that this data may be subject to copyright by the original publisher, and cannot be distributed at will.

Posted in Soccer analytics, Sports Analytics, Uncategorized | Tagged , , , , , , , | Leave a comment

Balls and Runs – an attempt to Cricket analytics

Having a rest from football, and since the battle for the Ashes  is on (England – Summer 2013), I have turned my attention to cricket.   Australia’s bowlers have been criticised by their lack of success, especially in the 2nd Test at Lords.  So here is my attempt at an analysis of their performance.  And that of the English batsmen that faced them.

The data

I have taken the data from the ball-by-ball commentary of the first two tests, Trent Bridge and Lords, published on the ESPN Cricinfo website.  Had to do some extensive  data cleaning and structuring to put it in the format I needed.  A snapshot of the resulting table I used for the analysis is shown below.

Fig. 1

Runs x ball

Some notes of explanation.   There is a row for each ball played – the variables used should be pretty clear.  Also, to facilitate the analysis , I have:

  • Allocated extras (byes, no balls, etc.) as runs to the respective bowler/batsmen.  Extras  are such a small percentage of the total runs as not to influence the results of my analysis
  • Allocated zero (0) runs to an OUT ball

The analysis
The purpose of my analysis is to find significant differences (SDs) in performance, between Tests, Innings, Bowlers, Batsmen, with respect to  a chosen performance variable.

The chosen variables  are:
1. Runs per ball – runs scored for each ball (1, 2,… to 6)
2. Number of runs – runs scored  (0, 1, 2,… to 6)

1. Runs per ball analysis

The starting point of the analysis is the distribution of the numbers  of runs scored from each ball in these two tests (Total).  So, in the Fig. 2  below, starting at the top, we can seen the first node  the order of the ball played (1st, 2nd, …) and the corresponding number of runs.

1.1  Runs x ball – Test and Innings
Fig. 2

runs-ball_inningsI must admit of being rather surprised to find little difference between the runs scored from each ball in the two Tests Total.  However this is not the case when we look at these Tests separately, where there is a SD in the distribution of runs per ball.  For example at Lords 19.27% of the total runs were scored from the 1st ball against 13.90% at Trent Bridge.  I have highlighted the highest percentage score in each test.

Going further down the tree,  I found a significant difference between the 1st and 2nd innings at Lords , and I have marked the highest scores.  There was no SD between innings  at Trent Bridge.

1.2 Runs x ball – Bowlers by Test
Fig. 3

ball-runs_bowlersFig. 3 shows that there is a SD in the number of runs conceded by Bowlers from each ball.  Those ball with the highest % of runs conceded are highlighted.  At Trent Bridge, for Agar  is the 5th one, for Pattinson and Siddle, the 2nd one, and so on.   Pattinson stands out from his mates at Lords for conceding most runs from the 5th ball (23.97%), and being the most effective with his 4th one (7.53%).

 1.3 Runs x ball – Batsmen by Test
Fig. 4

runs-ball_batsmanEngland’s Batsmen also show significant differences in the number of runs scored from different balls, as shown above in Fig. 4.  Again the analysis has been done by Test.  Note Bell’s preference for scoring most runs from the 5th ball.  I’ll leave to cricket fans to dig out  other interesting facts.

2. Runs scored

This aim of this analysis is to find significant differences (SDs) in the number of runs conceded/scored from each ball.  The starting point is a node (Fig 4) that shows the number of runs scored.  So we have that 1,974 balls scored no runs (0), 290 balls scored 1 run, and so on.

2.1 Runs scored – Test, Innings
Fig. 5

balls-runs_inningsThe data tree above shows first that there is a SD (in distribution of the runsxball) between the two Tests.  Significantly more ball were not scored at Trent Bridge than at Lords (79.51% vs. 74.77%).  Also  SD between the innings, but only at Trent Bridge.  The relevant figures are highlighted.

2.2. Runs scored – Bowlers by Test
Fig. 6

ball_runs_bowlersFig. 6 shows a comparison of the Australian bowlers in the two test.  Watson stands out at Trent Bridge for his economy,  93.04% no runs compared to his mates 78.15%.   However, he bowled  much fewer balls than them, which is something that could invalidate this comparison.  However, the objective of this post is to present facts, not to make a deep statement about performance.  At Lords too, Aussie bowlers showed SD in performance, as marked, with Smith perhaps the most expensive one.

2.3 Runs scored – Batsmen by Test
Fig. 7

ball_runs_batsmenIn both tests, Englands batsmen show SD in the numbers of runs scored from each ball . In the first, they divide into two groups, with the one headed by Bairstow apparently doing better (more balls scored) than the Anderson  group.  At Lords, Bresnan stands out with his poor (as judged solely by this analysis) performance, 43 runs from 166 balls.

Closing notes

This is it! Just wanted to show an example of a different way to analyse cricket performance.  I have done what what came easier to do.   There are other results that I could get; for example comparing individual performance of batsmen/bowlers across tests and innings.   Perhaps I’ll do this and other results next.

The main hurdle I have to overcome is getting the data and putting in a format suitable for this type of analysis.     The time and effort required for this is definitely a put-off.  However, I hope to continue my effort of this basic analysis for the next tests.  Will do more if I have the time.  Somehow I have a feeling that by enriching this data with additional variables, order and time of batting/bowling, for example, some interesting insights (patterns) may emerge that coaches/captains may find useful to explain/improve performance.

Posted in Sports Analytics, Uncategorized | Tagged , , , , , , , | 4 Comments

Best and worst defensive performance – 2013 EPL

The idea for this analysis came from reading Paul Power February post, where he sets out to analyse Defensive Efficiency (Deff). To measure it he lists seven variables, and reasons for his choice. These are:

1. Goals Conceded (GC)
2. Goals Conceded Difference (GC-D)
3. Total Shots Conceded (TSC)
4. Shots on Target Conceded (SoTC)
5. Shots on Target Conceded % (SoTC%)
6. Goals Conceded From Total Shots % (GCTS%)
7. Goals Conceded from Total Shots on Target % (GCSoT%)

In his next post he attempts a classification of teams at that time using 6. and 7. and draws some conclusions.

My take on it
I don’t agree with his choice of 2.  which I think this is only useful if one is trying to compare Home vs. Away Deff performance, not overall  one.  I also have doubts about the contribution of 5. SoTC%; I’ll leave both out of my analysis.

My analysis has a similar aim, but I am going to split Deff into two components, Defensive effectiveness (Deff), and  Defensive efficiency (Deff%), and analyse them separately.  The first measures how much a defence in effective in restricting shooting opportunities to opponents; the second its efficiency in preventing goals being scored.

Defensive effectiveness (Deff)
For this analysis I have taken the list of variables that follows.  As well as Totals for Goals, Shots and Shots on target, as done by Paul, I have added  Home and Away figures.  These would help me cluster teams that have a similar defensive profile. (PS As would have figures for Goals conceded In and Out of the Box, of course, if I had them)

HGC       Home Goals Conceded
AGC        Away Goals Conceded
TGC        Total Goals Conceded
HSC        Home Shots Conceded
ASC        Away Shots on Target Conceded
HSTC     Home Shots on Target Conceded
ASTC     Away Shots on Target Conceded
TSC        Total Shots Conceded
TSTC     Total Shots on Target Conceded

The data comes from the 2013 EPL, as given  by the website Football Data (link),and is shown in the table below:

Deff dataThe Deff analysis
To classify teams  according to these metrics I am going to use the Cluster Analysis method I have  used in my previous post .  The results are shown in the picture below.  The data has been normalised: high values are shown in red and low in green (best performers), as in the colour scale shown.

From left to right, the following picture shows:

  • The order in which team have been ranked with respect to the performance parameters
  • A heat map that shows the normalized values of these parameters
  • A dendogram that shows how teams have been clustered

Deff heatmapThe analysis splits the teams in five major clusters, with the teams with the best Deff  all having different shades of green – below average values- at the top.

To my surprise, Arsenal, whose defensive performance was much criticised during much of the season, tops the list.  Also Man Utd, the champions, are not among the best defensively, and belong to the second best cluster of teams  that share a similar defensive profile.   Stoke is the surprise entry in this second cluster – not bad for a team who just avoided relegation to be in (just) with the champions.  But I guess this can be explained with the fact that they score and concede few shots and goals (sorry, I haven’t got time for a deeper analysis).

As for the bottom half of the table, nobody would be shocked to find Reading at the bottom, and  alone because so much worse than all the others relegated companions.  But I guess not many would have expect ed Swansea to be just above them.   I seem to recall too many heavy defeats though. I’ll leave you to reflect on the other teams positions.

The Deff% analysis
This analysis is aimed at classifying teams in respect to their Deff%  and clustering teams with similar profile.   So, while in my first analysis I have taken values, I now take ratios % of Goals conceded o Shots conceded.

  1. HGC% = HGC/home shots
  2. HGCT%=HGC/home shots on target
  3. AGC%=AGC/away shots
  4. AGTC%=AGC/away shots on target
  5. TGC%=TGC/totals shots
  6. TGCT%=TGCT/total shots on target

The results are shown in the figure below:

Deff% compareThe heat map on  the left show a ‘sequencing’ of  the Teams in order of Deff%, with the best at the top.  The one on the right shows an attempt to cluster and order team at the same time.

There are some discrepancies between the two images, as the one on the left  show the  ordering clusters and not of individual teams (clustering is not an exact science).  What is clear, though, is that Chelsea and West Ham have the best Deff%, and that Southampton, Wigan and Newcastle have the worst one in that order – but there is not much to choose between them.  Could this be the main reason for the Toons steep fall from last year grace?  I’ll leave you to ponder about the other results.  I’ll just comments on some of that really stand out.

High flying Tottenham  (at least for most of the season) is just above the trio of relegated teams .  This probably accounts for the missed  Champions league spot, and something AVB will need to address in order to improve performance  for next season.  I am sure that he knows that despite his  (apparent) dislike of statistics.

Arsenal,  appears to be the odd team out in his cluster on account of his poor Home performance , which is of a bottom three placing. But is obviously lifted by his outstanding Away one.

And, finally,it appears that it wasn’t because of their Deff% that QPR and Reading were relegated.  Looks like they may have conceded  some decisive goals in their fight for salvation.

Final analysis
So, which team have the best and worst defensive record?  A combined cluster analysis  of the two metrics should in theory give the answer.  But I had problem getting a meaningful and consistent classification, probably on account of the mix of values (Deff) and ratios (Deff%).  So, I /we’ll have to do it by inspection of the combined heat maps below, and views are bounds to differ:

Deff compare ALLMine is that Arsenal does not deserve top spot because of his very poor Home record in conceding goals. Man City appears to have the best combined record, followed by Man Utd, and Chelsea, the latter on account of its best Deff%.

As for the the teams with the worst record, the picture is much less clear, and contrasting results make it rather confusing.  Swansea is a case in point , with one step from the bottom with regards to Deff, and in third position with Deff%.  And then there’s Tottenham, fifth in the League table, and in the top cluster for Deff, but near the bottom in Deff% – where is its true place? Even worse is the dilemma facing who wants to judge the relegated teams, with Reading with the worst Deff of all, and Deff% near mid-table.  And what about Sunderland? I’ll leave you to make up your own mind on this and the other teams.

Posted in Soccer analytics, Sports Analytics | Tagged , , , , , , , , , | Leave a comment

Points table vs. Performance table – take your pick

Continuing my analysis of the MCFC Analytics project data for the 2012 EPL season, in this post I try to give an answer to a much debated argument among fans, namely:  does the number of points gained by a team at the end of the season, and therefore its position on the final points table, reflect accurately its performance on the pitch?

The data

My first task was to select appropriate parameters to measure a team’s performance (metrics).   As the idea of this analysis was prompted by reading  a post by the always excellent Dan Barnett,  I decided to use his table of performance parameters (Data) referenced in his blog.  But not all of these –  I left some out reasoning, perhaps wrongly, that were not fit for my purpose .  My final selection of metrics is shown in the table below (Fig.1).

Team dataThe method of analysis

I had already decided that Cluster Analysis was the method that I would use to give an answer to the vexed question.  And I think many data analyst would agree with me that this is the most appropriate.

There are many ways to do Cluster Analysis and, perhaps, I should go into some detail of what I did.  But I won’t…  for a number of reasons.  I shall only say that I used a few just to make sure of my results.  I am willing, however, to discuss my choice with those who will take the trouble to ask.   So, let’s get straight to the analysis.

A heat map representation of the data table above is shown in Fig.2.  Values have been normalised by subtracting from the average of each column.  High values are shown in various shades of red, those near the average in black, and negative ones in shades of green.

So one can see that good performance values for the top teams (goals, shots, etc.) are in bright red (hot), while those of the bottom ones are mainly in shades of green.  The reverse is for true bad performance indicators, such as goals/shots/etc. conceded.

Fig. 2
Colour scale
Color scale2012 table heatmapThe analysis results

Fig. 3 show the results of the analysis. From left to right we have:

  1. The new ranking of the teams with respect to the performance parameters used
  2. A heat map that shows the normalized values of these parameters
  3. A dendogram that shows how teams have been clustered


Team cluster HMapThe first thing we notice is that this performance ranking order is different from the final points rank: some teams have been promoted and others  demoted – which was expected, of course. We can also notice that the first four teams and the bottom three retain their points table position.   So, the points system is clearly a fair method of classification.  These positions, in fact, dictate which teams are rewarded to play in the main European competitions, and which are punished with  relegation.

We can also notice the following:

  1.  Man City and Man Utd have very similar performance, belong to the same, unique cluster and have the same rank as the in the final EPL league table, 1st  and 2nd.
  2. They are followed by a cluster of four teams: Arsenal, Spurs, Chelsea and Liverpool.  And here we see the first major changes from the points ranking, as Chelsea and Liverpool have taken the positions of Newcastle and Everton respectively, dropped to the next, less performing cluster
  3. For Newcastle, this downgrading confirms other analysis and comments in various posts, not least that, already mentioned, of Dan Barnett.  Pointwise, Newcastle over performed – the Toons performance on the pitch does not tally with their final league position.
  4. Stoke and Aston Villa have very similar performance and make up a separate cluster, with the latter also being the biggest gainer, +4 positions in this analysis.
  5. In the next cluster we find WBA, down four positions from its points table rank.
  6. Four teams are in the last cluster, that with the worst performers.  This includes all the relegated teams, but also Norwich, who makes the biggest drop of all (-5 positions) , from  twelve to seventeen.

A summary comparison of these two classification, points (Table Rnk) and performance (Perf. rnk), is given in the table below  (Fig. 4):


Team rank
Concluding remarks

As expected, the points table does not reflect accurately the performance of ALL the teams during the season.  It appears, though, to reflect the performance of the best and worst performing teams (as measured by the metrics considered in our analysis).  And since the points table rewards and punishes  only these teams, it can be said that its results are fair.   Some managers, however, especially those who have been criticised and even sacked for ‘bad performance’, may feel vindicated by our analysis.  Others, of course, might question our results.

Cluster analysis is probably the only analytical method that can give the best answer to the question posed at the start.  In contrast with other methods of analysis that can classify teams with respect to a single metric, it can do so with respect to many, which is what is needed to answer the question correctly.   I believe this is the first time that this method  has been used for such purpose.  And I am looking forward to the end of the current season to apply it to this year’s results.

I guess that many would not be entirely happy with the metrics I have used.  As I said, I took the lazy option and used those provided by someone else, perhaps more knowledgeable analyst.  I would consider, however, performing further analysis by adding/subtracting any suggested metric (with a reasoned argument for) as long as I am provided with the relevant data.

Posted in Soccer analytics, Sports Analytics, Uncategorized | Tagged , , , , , , , , , , , , | Leave a comment

Benitez vs. Di Matteo – A decision tree analysis

There is only one story at the moment – the continuing controversy of Rafa Benitez at Chelsea. Yet despite the fans unrelenting abuse of the interim manager, an analysis of the teams performance shows that Chelsea are in fact performing slightly better under the new man than under Di Matteo.

I started thinking of doing this analysis before Rafa’s blew up at Middlesbrough.  And, obviously, the media frenzy that followed forced my hand. So, I am going to compare Rafa’s performance with that of his predecessor at Chelsea, Di Matteo.  And I am going to analyse only Premiership matches; the main reason is that I don’t have readily available performance data for other matches.

My comparison is over  Chelsea’s first 26 Premiership matches of this season – that is before their last match away at Man City – of these,  12 were played under Di Matteo and the last 14 under Benitez.   And, I’ll be using the only three metrics  I have: Points gained, Shots, and Shots on target.

For the analysis I am using a method called Decision tree, which is probably the best to compare statistically the performance of two individuals, and also rather easy to read and understand by all.  These performances are also compared using  three ’conditional ‘ parameters: Home/Away, Wins/Draws/Losses, and opposition team ranking : top, middle, bottom.

The results shows that there are some differences, some small, some large, in the three performance parameters examine. We can see that there has been a slight improvement in performance under Benitez with regards to Shots (16.07 vs. 13.83) and Shots on target (9.50 vs. 8.00), but Di Matteo has a slight Points advantage.  However, none of these differences is statistically significant! Overall, the performance of the two managers has been pretty much the same.

The only significant difference to come out of my analysis is between Home and Away shots (17.15 vs. 12.92), but probably Chelsea’s fans had already spotted this one.  Anyway, see the results below.

Overall comparison


Venue: Home vs. Away

Results: Wins/Draws/Losses


Opposition team rank: top, mid, bottom


So with little to split the two managers, maybe the fan’s ire should be directed more towards the director’s box than the dugout. What Benitez has said in the past may have upset the fans, but when it comes to what happens on the pitch, there is little to choose between him and the venerated Di Matteo.

Posted in Soccer analytics, Sports Analytics | Tagged , , , , , , , , , , | Leave a comment