Predicting NFL Pass Completion

Dylan Nikol
12 min readDec 11, 2020

Dylan Nikol and Alex McGraw

Abstract

In this post, we will discuss our submission to the 2021 NFL Big Data Bowl. The Big Data Bowl is a data science competition on Kaggle hosted by the NFL. In years past, the competition has focused on predicting expected points and rushing play outcomes. This year, the focus is on evaluating passing plays from the 2018 season using Next Gen Stats. Next-Gen Stats capture players’ coordinates, orientation, direction, and trajectory via sensors in padding. For our submission, we performed data analysis, visualization, and built models to classify passing plays as either Complete or Not Complete (Not Complete includes incomplete passes, sacks, and interceptions). Our models use play characteristics and target receivers’ coordinates on a play-by-play basis to classify passing plays as complete or not complete passes at the time the football is snapped. These predictions can be used for real-time analysis for both teams and sports betting companies.

Introduction

Goal: Develop a model to classify play completion at the beginning of a play.

Application: To aid in understanding the effect of target receiver location and route on the completion of a play.

Methods: We evaluated the following classification methods: logistic regression, random forest, XGBoost, support vector machine, and multi-layer perceptron neural network.

Analytics in Football: As of 2020, sports analytics is a billion dollar industry, and it is expected to grow to $4.3 billion by 2025. Although the majority of techniques currently used in the field are purely statistical, there is growing interest in the machine learning community. Prediction models can have a powerful impact in sports, especially in the NFL. Teams can use continuous prediction models to evaluate plays and players in both real-time and in post-game analysis. This can aid teams in making better informed decisions and in identifying weaknesses in other teams. Additionally, sports betting companies can offer live betting with more attractive spreads, and fans can grasp a better understanding of plays. Finally, an implication that arguably supersedes those previously discussed, is the potential impact of analytics in injury prevention. A 2017 study, led by Dr. Ann McKee, an expert in neurodegenerative disease, found that 110 of the 111 analyzed brains of former NFL players had chronic traumatic encephalopathy (CTE). CTE is a degenerative brain disease associated with impaired judgment, aggression, memory loss, and depression. The likely cause of CTE is a repetitive head injury. With fine-tuned motion statistics, it is now possible to learn from the trajectories, route paths, and the orientations that result in head collisions to take preventative actions and make football a safer sport. Another common injury are ACL tears. ACL tears result from players making quick turns or pivots in positions that overstretch the ligaments in their knees. We computed the rate of change in orientation of each player’s vector components throughout each play. These calculations can help develop personalized injury prevention systems for players.

Inspiration for this analysis and code used:

  1. @jdruzzi’s Big Data Bowl 2021 submission: Shadow Cornerback + Coverage Analysis
  2. @statsbymichaellopez’s notebook NFL tracking: wrangling, Voronoi, and sonars
  3. @asonty’s ngs_highlights GitHub

Note: In this post, we will discuss our current submission to the competition. At the end of the post, we will discuss the model we are currently building: predicting target receiver trajectory. The inspiration for this analysis and the roadmap we are following is outlined in this AWS Machine Learning blog post.

Data

All of the data we used was provided to us upon entry to the competition. The data includes game, player, play, and tracking data. In total there are 57 unique features. Here is a brief description of the datasets:

Game data: a csv file containing the teams playing in each game

Players data: a csv file containing contain player information of all players in the tracking data

Plays data: a csv file containing passing play information from each game

Tracking data: 17 csv file containing player and football tracking data from each week of the regular season

Target data: a supplementary csv file containing information on the target receiver for each passing play

A full description of the data and each variable can be found here. Target data can be found here.

Pre-Processing & Exploration

There were 253 games in the 2018 regular season and 19,239 plays in the datasets. To bring the data to life we made plots showing different time steps throughout a play. These plots are from the last play of a game between the Miami Dolphins and the New England Patriots in Week 14.

Figure 1 is a snapshot of the players’ positioning on the field at the time the ball was snapped. The ball is shown as a white circle with no number inside of it.

Figure 1
Figure 2

Figure 2 shows the players’ positioning around 3.5 seconds into the play. Here we have included the velocity vectors of each player.

Figure 3

Figure 3 is the same frame as the previous plot (3.5 seconds in), except we’ve removed the velocity vectors, and instead included a Voronoi analysis. This is a popular technique in analyzing European football games, and essentially it’s a way of graphically representing control of the field.

So imagine a bubble expanding around each player on the field, and the bubbles expand until they collide with another bubble, at which point they form a line. These lines define each player’s area of control. The Voronoi diagram actually has some pretty cool properties — each point on a line separating any two players is equidistant to both players. And similarly, any corner, so where any three lines meet, three players are equidistant to that point.

Finally, to fully grasp the data, we made an animation using astony’s ngs_highlights GitHub.

Note: The yellow highlights show the fastest player on each team throughout the play.

This ended up being one of the craziest plays of the season: a double lateral play to win the game (and defeat Tom Brady) as the clock expired.

Figure 4

Filtering

We removed the 1,568 plays with penalties and were left with 17,653 plays to analyze. We used Joe Andruzzi’s notebook to calculate distance metrics for players on the field. The metrics include distance to football, distance to the quarterback, distance to a closest teammate, and distance to the closest opponent. From here we filtered the data so we were left with only the target receivers’ distance metrics at the time of ball snap. We converted Pass Result (a string of one of 4 values: complete, incomplete, interception, or sack) into a binary classifier (complete = 1, else = 0). Figure 4 is an example of a single row.

We standardized features like Height that were formatted in both “inches” and “feet-inches.” Finally, we turned Route and Position into dummy variables and filtered out plays where the target receiver route was undefined. After this process, we had 17,639 plays with 54 features.

Here is a correlation matrix of the features, excluding string and identification variables.

Correlation Matrix
Figure 5

We then performed an analysis on different routes run by target receivers. Figure 5 shows the percentages of each route run by target receivers. We see that Hitch routes are the most common target receiver route, followed by Out, Go, and Cross routes.

Figure 6

Figure 6 shows the percentages of completed passes given a route. Screen routes had the highest completion rate, followed by Flat routes.

Figure 7

Figure 7 shows the posterior probabilities of routes, given a completed pass. Hitch routes have the highest posterior probability, followed by Out and Cross routes.

Train Test Split

We first removed the identification and string variables and then split the data into a train and test set, using a test size of 25%. We further split the training data into a train and validation set, using a validation set size of 25%. We had 9,921 observations in the training set, 3,308 observations in the validation set, and 4,410 observations in the test set.

Modeling

Logistic Regression

The first model we used to fit the data was logistic regression. We started with logistic regression because it’s a relatively simple model that has had great success in binary classification. The logistic function, also known as the sigmoid function, is a S-shaped curve that can squeeze any real number into a value (generally) between 0 and 1.

logistic function = 1/(1 + e-x)

e = the base of natural logarithms

x = input value

Similar to linear regression, in logistic regression, input values (x) are linearly combined using weights or coefficients (ß). The difference between linear regression and logistic regression is that in the latter, the dependent variable is binary (0 or 1). An advantage of using logistic regression is that it is simple, computationally inexpensive, and powerful. It is often used as a benchmark model, and feature coefficients can suggest feature importance. A drawback of logistic regression is that it is not suited for nonlinear problems, where it can be easily outperformed by more complex algorithms. It is also sensitive to outliers and collinearity. In our logistic regression, and in all subsequent models, 0 represents the negative class, Not Complete, and 1 represents the positive class, Complete.

Random Forest

(Source: Google Images)

Perhaps one of the most popular machine learning classification models is random forest5. Random forest combines the power of combining individual decision trees to make class predictions. The concept that drives the logic behind random forest is that a group (forest) of uncorrelated models (trees) can together outperform a singular model (tree). This is known as “wisdom of the crowd.”

The random forest algorithm is a “bagging” method, meaning that it draws random bootstrapped samples for the training set. Random forest’s advantage over traditional bagging methods is that it also draws random subsets of features to train each tree. This helps maintain that trees are independent of each other and contributes to reducing complexity.

XGBoost

eXtreme Gradient Boosting (XGBoost) has become a go-to model in the data-science toolkit because it is extremely powerful and easy to implement. Similar to random forest, XGBoost is an ensemble method, meaning it employs multiple models (trees), but unlike random forest, XGBoost leverages gradient boosting. Instead of training each tree in isolation, with gradient boosting, trees are trained in succession, and each tree aims at correcting the errors of the previous tree. This process repeats itself until the model cannot further improve.

We ran a randomized search to estimate parameters for our model, then gradually increased gamma, which acts as a regularization term.

Support Vector Machine

(Source: https://towardsdatascience.com/support-vector-machine-vs-logistic-regression-94cc2975433f)

The goal of a support vector machine (SVM) is to find the optimal decision boundary or “hyperplane” that distinctly classifies the data. The optimal hyperplane is the one that achieves the maximum distance between the data points of each class. The benefit of employing SVM is that it’s quite efficient in performing classification and not very computationally expensive.

Multi-Layer Perceptron

(Source: https://www.researchgate.net/figure/A-hypothetical-example-of-Multilayer-Perceptron-Network_fig4_303875065 )

Multi-Layer Perceptron (MLP) is the original deep learning model and is still utilized today. We chose this model because it is the least complex deep learning model, yet still performs well in classification tasks. Using this model also was a test to see if a simple deep learning model would outperform more complex machine learning models.

To implement MLP, we had to drop 15 rows with null values. To prevent overfitting, we set early stopping to True, which terminates training if the model does not improve after a certain point.

Results

A Receiver Operating Characteristic curve, or ROC curve, is a great way to visualize the performance of classification models. The ROC curve represents probability and the area under the curve (AUC) measures how well a model can distinguish between classes.

From the ROC curve, we can see AUC scores for the top four classifiers, along with the baseline. The top-left corner of the chart is a perfectly performing model, always predicting the correct class. The bottom left corner is a model that always predicts the negative class, or “not complete” in our case, and never predicts any false positives. The top right corner would be a model that always predicts the positive class, or “complete” in our case, and never predicts any false negatives.

The ROC curve suggests that the marginal differences in performance betweens models is negligible. The MLP Classifier and XGBoost Classifier perform slightly better than the Support Vector Classifier and the Random Forest Classifier. We did not include logistic regression in the ROC curve because it made the plot too cluttered, but its curve was just below the Random Forest Classifier.

The table on the left shows the AUC scores, the error rates, and the F1 score for each model. The error rate is calculated as 1 minus the accuracy of the model. Accuracy is defined as the sum of correct predictions over all predictions. F1 score is a comprehensive single value that takes into account precision and sensitivity. The advantage of using an F1 score is that you can evaluate a classification model with one metric.

From these values, we can make the same conclusion as the ROC curve suggests; XGBoost is the best performing model, while the rest are not far behind.

Best Performing Model: XGBoost

XGBoost Confusion Matrix

The XGBoost performed 11.8% better than the baseline accuracy score. Two good metrics in evaluating classification models are sensitivity (aka recall) and specificity (aka selectivity). Sensitivity measures out of all positives, how many did the model predict as positive. Sensitivity is calculated:

TPTP + FN

Specificity measures out of all actual negatives, how many did the model predict as negative. Specificity is calculated:

TNTN + FP

The model is very sensitive, with a sensitivity score of 92%, however, the specificity is only 41%. This tells us that our model is great at predicting Complete passes, but it is not great at predicting Not Complete passes. One way that we can combat this is by increasing the positive class threshold. This would decrease the amount of times that the model incorrectly classifies false observations as true observations.

Conclusion

Predicting pass completions is not an easy feat. There are a lot of moving parts in a play, many of which appear to be quite erratic. The improvement from a simple logistic regression to the XGBoost model was negligible. We believe that the model was hindered from the start; other than height and weight, we didn’t have many player-specific metrics to develop the model. Also because football plays are so complex, the starting location of the target receiver doesn’t tell us too much information about completion.

Our next steps are to delve into time-series predictions and predict receiver trajectories, based on previous trajectory and the trajectories of other players on the field. This idea was inspired by the research blog posted from the AWS Research Lab in conjunction with the NFL Next Gen Stats. Predictions on where a receiver will be can help teams better evaluate receivers and defensive backs. We plan on using the AWS researcher’s framework of a 1-D Convolutional Neural Network (1-D CNN) and a Long Short-Term Memory (LSTM) to make predictions.

NFL Big Data Bowl 2021: https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview

Github: https://github.com/d-r-n/APM-Term-Project

References

1 https://www.researchandmarkets.com/reports/4900339/global-sports-analytics-market-2019-2025

2 http://www.bu.edu/articles/2017/cte-former-nfl-players/

3 https://iq.opengenus.org/advantages-and-disadvantages-of-logistic-regression/

4 https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222

5 https://towardsdatascience.com/understanding-random-forest-58381e0602d2

6 https://towardsdatascience.com/a-beginners-guide-to-xgboost-87f5d4c30ed7

7 https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

8 https://scikitlearn.org/stable/modules/neural_networks_supervised.htm l

9 https://aws.amazon.com/blogs/machine-learning/predicting-defender-trajectories-in-nfls-next-gen-stats/

--

--