**1. Introduction**

This blog is a continuation of my last one on shooting in the MLS (Star shooters of the MLS – April 1, 2017) . Here shall be looking at how *shot location* affects *shot outcome*, as defined by the metric SG (goals/shots), or conversion rate.

As soccer fans, intuitively we know that all shot do not have the same probability of being converted to a goal, and that shot location plays an important part in this outcome. So, the question I am trying to answer is: which locations gives a higher or lower probability of scoring; how many we a is which location have

**2. Methodology**

As in the previous analysis I shall be using the Decision Trees (https://en.wikipedia.org/wiki/Decision_tree_learning) method for the analysis. Before I can start, I need to add a new variable to the data, which I call Zones_XY , and assign values to it. To create Zones_XY I divide the last 3^{rd} of the pitch into a grid of 50 Zones (5x*10y). The result is that each shot is now associated with a Zone location specified by the variable Zone_XY. This takes values A1,A2, A3, … and so on.

The purpose of my is analysis is to classify these Zones by shot conversion rate SG, and cluster *similar* ones together. Instead of considering all shots, I am going to analyse only shots resulting from Regular play.

**3. The analysis**

As done in my preceding blog, I analysis the variable Result, which takes the value of 1 or 0 depending whether the outcome of shot is a goal or not. In contrast with my previous effort, I define Result as continuous (numeric) variable, and therefore here I am using Decision Trees (DT) to perform a regression type of analysis

The result of are shown in the graph below (3.1). The top node is my starting point; it show the average conversion rate (SG) for all shots (8,495), is 0.106. I then split its total by its PatternofPlay components. and obtain the Regular play node with the shots I want to analyse; then is just a matter of running the algorithm which creates the tree shown below.

**3.1 Conversion rate (SG) by Location (Zone_XY) – Regular play
**

**Legend**: Shots (6,095) from Regular play have an SG 0.096; which means that we can expect on average a goal to be scored every 10 shots. The analysis find, as expected, that the Zones_XY variable I have created is significant in explaining difference in SG, and creates seven Zone_XY clusters, each one with a SG that is significantly different from the other (95% confidence level). These are shown ordered from left to right, and vary from Zones with near zero to 0.27 SG. And if we map these result onto a football pitch we obtain the following picture

**Map 3.1**

**Legend**: The map on the left show the x,y location of shots (yellow) and goals (red). On the right, the same map is shown divided into my Zones_XY, with the colored ones mirroring the results of the DT analysis. We can see that most Zones (green and gray) have zero or infinitesimal probability that a shot taken from them will result in a goal . There are however 15 Zones (green to red) where we can expect a better outcome, which varies from 0.01 to 0.27.

**4. Digging deeper – Teams**

DT is a great tool for exploratory analysis as it allows to easily drill down into the data and find answer to obvious questions a football analyst or fan may want ask. For example, I can find out if this (overall) *shooting profile* I have just discovered applies to all Teams in the MLS, or there are differences among them that are statistically significant.

For this analysis, I need to create a new variable Avg_SG which maps each shot to the SG values of the Zones it was taken from, as computed in the previous analysis. The result is a *categorical* variable with seven categories (the 15 Zones share 7 different SG values) which I can now analyse using DT. The result is shown in the graph below, and summarised for easier readability in the table that follows it.

**Graph 4.1 Shot conversion profile – Teams
**

**Legend**: the top node of this tree shows the Avg_SG variable I have just created and its six categories; this is the overall *shot conversion profile* (Regular play). We can see, for example, that most shots (0.296% of the total, or 1 ,680, ) are taken from Avg_04, that is from the Zones which share an SG of 0.04. We can read the other categories (Avg_01, Avg_04, … etc.) in the same way.

Running the DT algorithm then creates five nodes (clusters), each one with a group of teams that share a *shot conversion profile* significantly different from the others. By *profile* here I mean the six values (vector) taken by the Avg_SG variable. For example, those of of Teams in the tagged node are: 0.079, 0.270, 0.184, 0.192, 0.131, 0.143. The table below (4.1) shows the *shot conversion profile* of each cluster of teams expressed in % for easier readability and comparison between them.

**Table 4.1 Shot conversion profile – Teams**

**Legend**: All teams takes most shots from Zones with a SG of 0.04 (Avg_04). Nearly half of {FC Dallas,…} shots come from the lowest SG Zones. In contrast {Columbus, …} have the best shooting record from the high SG Zones. I’ll leave to readers (MLS fans in particular) to discover other interesting facts in the results shown.

**5. Final notes**

While for this analysis I focused on shot *Location*, the DT algorithm tells me that shot *Direction – *the *trajectory* of the shot to the goal face – is ranked before Zones_XY as a predictor of SG. So perhaps, a better predictor of shot conversion would be taking both shot location and direction into account.

As one would expect, shooters also have ‘preferred’ shooting zones, and thus different profiles. DT found 14 of them for the 58 players considered – far too many to be included in this blog.

I was going to compare these results with those obtained by others using the expG metrics, and draw some conclusion. Unfortunately, I realised that this effort would take too much of my time and was better left to a later blog.