A Closer look at the NFL Draft

By: Siddharaj Vaghela and Sai Pesari

Group Contribution Statement: Both members of the group got together on video calls and worked through the coding and writing.

Introduction

One of the biggest events of the year in the National Football League (NFL) is the annual draft. In the NFL Draft, teams choose top collegiate players in the nation to join their team. There are 7 rounds in the draft with 32 picks per round. More information on the NFL Draft can be found here. https://operations.nfl.com/the-players/the-nfl-draft/the-rules-of-the-draft/

The wide receiver position is one of the most important, and it's a position that teams want to fill up the most. Within this project, we will be trying to analyze the NFL draft from the wide receiver perspective. We will examine the association between factors such as the combined performance of the wide receiver, their draft position, their rookie year performance, as well as the creation of our own model to predict the round at which the receiver was drawn up and then evaluate the accuracy of our model.

Here is a link to a wikipedia page explaining what the NFL combine is: https://en.wikipedia.org/wiki/NFL_Scouting_Combine

Curing, Parsing and Handling of Data

We started off by reading in the CSV file that contains player data. The dataset includes entities which are individual players, and attributes include the round a player was drafted, the pick the player was drafted, height, weight, 40 yard dash, shuttle run, and various other combine statistics. The dataset as a whole contains draft data for every player from 2000-2018. We will be focusing on drafted (AKA excluding undrafted) players drafted in the year 2008 to 2017.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import statsmodels.api as sm
from statsmodels.formula.api import ols
In [2]:
df = pd.read_csv("combine_data_since_2000_PROCESSED_2018-04-26.csv")
wr_filter = df["Pos"] == "WR"
df_wr = df[wr_filter]
year_filter = df["Year"] >= 2008
df_wr = df_wr[year_filter]
df_wr = df_wr[df_wr['Round'].notna()]
df_wr
<ipython-input-2-1834d67d7a39>:5: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df_wr = df_wr[year_filter]
Out[2]:
Player Pos Ht Wt Forty Vertical BenchReps BroadJump Cone Shuttle Year Pfr_ID AV Team Round Pick
2609 Adrian Arrington WR 75 203 4.55 NaN NaN NaN NaN NaN 2008 ArriAd00 1.0 New Orleans Saints 7.0 237.0
2610 Donnie Avery WR 71 192 4.43 NaN 16.0 NaN NaN NaN 2008 AverDo00 9.0 St. Louis Rams 2.0 33.0
2621 Earl Bennett WR 71 209 4.48 26.0 15.0 110.0 7.15 4.22 2008 BennEa00 12.0 Chicago Bears 3.0 70.0
2650 Keenan Burton WR 72 201 4.44 38.5 10.0 125.0 6.77 4.20 2008 BurtKe00 3.0 St. Louis Rams 4.0 128.0
2652 Andre Caldwell WR 72 204 4.35 33.0 NaN 124.0 6.75 4.11 2008 CaldAn00 8.0 Cincinnati Bengals 3.0 97.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5833 Ryan Switzer WR 68 181 4.51 32.0 11.0 116.0 6.77 4.00 2017 SwitRy00 1.0 Dallas Cowboys 4.0 133.0
5838 Trent Taylor WR 68 181 4.63 33.0 13.0 117.0 6.74 4.01 2017 TaylTr02 3.0 San Francisco 49ers 5.0 177.0
5839 Taywan Taylor WR 71 203 4.50 33.5 13.0 132.0 6.57 4.21 2017 TaylTa00 2.0 Tennessee Titans 3.0 72.0
5864 Dede Westbrook WR 72 178 NaN NaN NaN NaN NaN NaN 2017 WestDe00 3.0 Jacksonville Jaguars 4.0 110.0
5871 Mike Williams-04 WR 76 218 NaN 32.5 15.0 121.0 NaN NaN 2017 WillMi07 1.0 Los Angeles Chargers 1.0 7.0

280 rows × 16 columns

We now have pre-draft data about drafted wide receivers from 2008-2017. From here, we can now gather data from the rookie seasons of each of these players(the season immediately after they were drafted). This data will enable us to analyze correlations between draft position and immediate perfmance in the NFL. To get this data, we need to scrape it from a website called Pro Football Reference. The data will be in the form of a table, and we will need to scrape 9 years worth of this data(2008-2017). We created a for loop that scrapes data from the pro football reference website and parses it to our needs, which onle includes the wide recevier data.

In [3]:
import requests
from bs4 import BeautifulSoup


wr_data = pd.DataFrame()
for year in range(2008,2018):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
    url = "https://www.pro-football-reference.com/years/" + str(year) + "/receiving.htm"
    r = requests.get(url, headers=headers)
    parse = BeautifulSoup(r.content, "html.parser")
    dat = parse.find("table")
    dat_str = str(dat)
    dat_dat = pd.read_html(dat_str)
    dat_dat = dat_dat[0]
    dat_dat["Season"] = year
    dat_dat = dat_dat.loc[(dat_dat['Pos']=="WR") | (dat_dat['Pos'] == "wr")]
    wr_data = wr_data.append(dat_dat)
    
wr_data.head()
Out[3]:
Rk Player Tm Age Pos G GS Tgt Rec Ctch% Yds Y/R TD 1D Lng Y/Tgt R/G Y/G Fmb Season
0 1 Andre Johnson *+ HOU 27 WR 16 16 171 115 67.3% 1575 13.7 8 79 65 9.2 7.2 98.4 1 2008
1 2 Wes Welker* NWE 27 WR 16 14 149 111 74.5% 1165 10.5 3 57 64 7.8 6.9 72.8 1 2008
2 3 Brandon Marshall* DEN 24 WR 15 15 181 104 57.5% 1265 12.2 6 67 47 7.0 6.9 84.3 4 2008
3 4 Larry Fitzgerald*+ ARI 25 WR 16 16 154 96 62.3% 1431 14.9 12 66 78 9.3 6.0 89.4 1 2008
5 6 T.J. Houshmandzadeh CIN 31 WR 15 15 137 92 67.2% 904 9.8 4 51 46 6.6 6.1 60.3 0 2008

Now we'll join the tables so we can correlate rookie season stats with draft position.

In [4]:
pd.set_option('display.max_columns', 500)
rookie_stats = pd.merge(wr_data, df_wr, on='Player')
year2_filter = rookie_stats["Season"] == rookie_stats["Year"]
rookie_stats = rookie_stats[year2_filter]
rookie_stats
Out[4]:
Rk Player Tm Age Pos_x G GS Tgt Rec Ctch% Yds Y/R TD 1D Lng Y/Tgt R/G Y/G Fmb Season Pos_y Ht Wt Forty Vertical BenchReps BroadJump Cone Shuttle Year Pfr_ID AV Team Round Pick
0 7 Eddie Royal DEN 22 WR 15 15 129 91 70.5% 980 10.8 5 43 93 7.6 6.1 65.3 2 2008 WR 70 184 4.39 36.0 24.0 124.0 7.07 4.34 2008 RoyaEd00 19.0 Denver Broncos 2.0 42.0
6 35 DeSean Jackson PHI 22 WR 16 15 120 62 51.7% 912 14.7 2 43 60 7.6 3.9 57.0 4 2008 WR 70 169 4.35 NaN NaN 120.0 NaN NaN 2008 JackDe00 34.0 Philadelphia Eagles 2.0 49.0
13 56 Donnie Avery STL 24 WR 15 12 102 53 52.0% 674 12.7 3 29 69 6.6 3.5 44.9 0 2008 WR 71 192 4.43 NaN 16.0 NaN NaN NaN 2008 AverDo00 9.0 St. Louis Rams 2.0 33.0
17 39 Austin Collie IND 24 wr 16 5 89 60 67.4% 676 11.3 7 37 39 7.6 3.8 42.3 0 2009 WR 73 200 4.53 34.0 17.0 120.0 6.78 4.24 2009 CollAu00 18.0 Indianapolis Colts 4.0 127.0
24 46 Jeremy Maclin PHI 21 WR 15 13 91 56 61.5% 773 13.8 4 31 56 8.5 3.7 51.5 0 2009 WR 72 198 4.43 NaN NaN NaN NaN NaN 2009 MaclJe00 23.0 Philadelphia Eagles 1.0 19.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
358 50 JuJu Smith-Schuster PIT 21 wr 14 7 79 58 73.4% 917 15.8 7 37 97 11.6 4.1 65.5 0 2017 WR 73 215 4.54 32.5 15.0 120.0 NaN NaN 2017 SmitJu00 10.0 Pittsburgh Steelers 2.0 62.0
361 118 Corey Davis TEN 22 WR 11 9 65 34 52.3% 375 11.0 0 17 37 5.8 3.1 34.1 1 2017 WR 75 209 NaN NaN NaN NaN NaN NaN 2017 DaviCo03 3.0 Tennessee Titans 1.0 5.0
363 150 Kenny Golladay DET 24 wr 11 5 48 28 58.3% 477 17.0 3 18 54 9.9 2.5 43.4 0 2017 WR 76 218 4.50 35.5 18.0 120.0 7.00 4.15 2017 GollKe00 4.0 Detroit Lions 3.0 96.0
365 159 Dede Westbrook JAX 24 wr 7 5 51 27 52.9% 339 12.6 1 15 29 6.6 3.9 48.4 1 2017 WR 72 178 NaN NaN NaN NaN NaN NaN 2017 WestDe00 3.0 Jacksonville Jaguars 4.0 110.0
368 325 Josh Malone CIN 21 WR 11 7 17 6 35.3% 63 10.5 1 3 25 3.7 0.5 5.7 0 2017 WR 75 208 4.40 30.5 10.0 121.0 7.05 4.19 2017 MaloJo00 1.0 Cincinnati Bengals 4.0 128.0

63 rows × 35 columns

Exploratory Data Analysis

There are many ways to measure performance for wide receivers. For more information on how performance is measured in the NFL for wide receivers, feel free to click on the following link!

https://www.footballoutsiders.com/stats/nfl/wr/2019

Some of the most common ways of measuring a wide receiver’s success are receiving yards and touchdowns, which is what we will be using to measure a receiver’s production. First, let’s look at the relationship between average receiving yards and the round a receiver was selected. The graph below shows us the average receiving yards in a wide receiver’s rookie year plotted against the round that wide receiver was taken in, aggregated from 2008-2017. As we can see below, there does seem to be a trend. On average, the later the round a wide receiver was taken in, the less yards per game the player averaged in his first season. According to the graph, for every round deeper into the draft, the average yards per game goes down, except for round 5 which seems to be an outlier.

This trend can be attributed to the fact that players that are taken earlier in the draft are perceived to have more skill and potential, which results in more playing time during their rookie season and better performance if their skills hold true to the test. Additionally, late round players may not have the same skill, and sometimes lack the trust of coaches to the point where they do not get very much playing time their first year.

In [5]:
rookie_stats["Pick_Per_Round"] = rookie_stats["Pick"] % 32
rookie_stats = rookie_stats.astype({"Y/G": float, "GS": float, "TD": float})
rookie_stats = rookie_stats.rename(columns={"Y/G": "Yards_Per_Game", "GS": "Games_Started"})
yards_per_round = rookie_stats.groupby("Round").agg(("Yards_Per_Game")).mean()

ypr = pd.DataFrame(yards_per_round)
ypr = ypr.reset_index()
In [6]:
ypr.plot.scatter(x = 'Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Round', y = 'Yards_Per_Game', data = ypr)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f94179d6640>

Now, we try to compare how different players within the same round performed. Below are 6 different plots, each corresponding to its respective round. In each plot, we have plotted average yards per game vs the pick a player was drafted at. We can see that usually this resulted in a horizontal graph, meaning that the position within a round didn’t matter. This means for instance, that a player picked 15th in the 1st round wont do far better than the last player picked in the same round.

In [7]:
round1_ = pd.DataFrame
round1 = rookie_stats['Round']== 1.0
round1_ = rookie_stats[round1]
round1_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round1_)

round2_ = pd.DataFrame
round2 = rookie_stats['Round']== 2.0
round2_ = rookie_stats[round2]
round2_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round2_)

round3_ = pd.DataFrame
round3 = rookie_stats['Round']== 3.0
round3_ = rookie_stats[round3]
round3_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round3_)

round4_ = pd.DataFrame
round4 = rookie_stats['Round']== 4.0
round4_ = rookie_stats[round4]
round4_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round4_)

round5_ = pd.DataFrame
round5 = rookie_stats['Round']== 5.0
round5_ = rookie_stats[round5]
round5_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round5_)

round6_ = pd.DataFrame
round6 = rookie_stats['Round']== 6.0
round6_ = rookie_stats[round6]
round6_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round6_)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9430a15100>

In this next graph, we plot the average games started during a rookie season per round. In this graph, we can see that the earlier-drafted players usually play more than the players who are drafted in the later rounds. This once again makes sense because players taken earlier are expected to be immediate contributors.

In [8]:
gs_per_round = rookie_stats.groupby("Pick").agg(("Games_Started")).mean()
gpr = pd.DataFrame(gs_per_round)
gpr = gpr.reset_index()

gpr.plot.scatter(x = 'Pick', y = 'Games_Started', figsize=(12, 10))
sns.regplot(x = 'Pick', y = 'Games_Started', data = gpr)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9417544c70>

This next graph is a histogram that shows how many receivers out of the top 50 receivers (by touchdowns) were from the 1st, 2nd, 3rd, etc. round. Surprisingly, players from the second round combined for more touchdowns than the players from the first round. However, we can see that the general downward trend still holds true.

In [9]:
rookie_stats = rookie_stats.sort_values(by='TD', ascending=False)
top_50_td = rookie_stats.head(50)
plt.hist(top_50_td["Round"])
Out[9]:
(array([13.,  0., 21.,  0.,  5.,  0.,  6.,  0.,  4.,  1.]),
 array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ]),
 <a list of 10 Patch objects>)

Machine Learning and Hypothesis Testing

Our hypothesis was that there is a relationship between the round that a recevier was draft and how many yards per game they had during their rookie season. To see the relationship between yards per game and round picked in the draft, we're going to conduct an F-test

In [10]:
ypg_lm=ols('Yards_Per_Game~Round', data=rookie_stats).fit() #Specify C for Categorical
print(sm.stats.anova_lm(ypg_lm, typ=2))
                sum_sq    df         F    PR(>F)
Round      1214.698696   1.0  5.036503  0.028458
Residual  14711.918446  61.0       NaN       NaN

After conducting the F-test, we see that the F-test statistic is 5.036503 and the P-value is 0.028458. The P-value is less than the significance value of 0.05, so we can reject the null hypothesis. This shows that yards_per_game and round picked are linearly related, so the round a player got picked has an effect on the yards per game he has produced

Another hypothesis was that there is a relationship between a where a player was drafted and how many touchdowns they scored. To see the relationship between number of touchdowns and round picked in the draft, we're going to conduct an F-test.

In [11]:
td_lm=ols('TD~Round', data=rookie_stats).fit() #Specify C for Categorical
print(sm.stats.anova_lm(td_lm, typ=2))
              sum_sq    df         F    PR(>F)
Round       9.441440   1.0  1.475706  0.229129
Residual  390.272846  61.0       NaN       NaN

After conducting the F-test, we see that the F-test statistic is 1.475706 and the P-value is 0.229129. The P-value is greater than the significance value of 0.05, so we failed to reject the null hypothesis. This shows that we fail to state that the relationship between number of touchdowns and round picked is linear.

In this next part, we will be using combine results to create a regression model to predict the round in which an NFL player will be drafted. First, we create a dataframe of players who have done at least 5 of the major combine drills.

In [13]:
complete_drills = df_wr[df_wr['Forty'].notna()]
complete_drills = complete_drills[complete_drills['Vertical'].notna()]
complete_drills = complete_drills[complete_drills['BroadJump'].notna()]
complete_drills = complete_drills[complete_drills['Ht'].notna()]
complete_drills = complete_drills[complete_drills['Cone'].notna()]

Then we'll create a model that will attempt to relate the forty time, vertical, broad jump, height, and cone drill to the round that the player was drafted in.

In [14]:
ml_model=ols('Round~Forty+Vertical+BroadJump+Ht+Cone', data=complete_drills).fit()
resid_df = pd.DataFrame()
resid_df["resid"] = ml_model.resid
resid_df["fitted"] = ml_model.fittedvalues

sns.residplot(x = 'fitted', y = 'resid', data = resid_df)
plt.title("Residual error versus fitted")
plt.show()

The residual points are centered around 0. This shows us that the normality assumption has not been severely violated and that our errors are within reason.

We can now attempt to find the expected draft position based on the model, which will be plotted alongside the real draft position of each player.

In [15]:
complete_drills["expected_pos_draft"] = ml_model.params[0] + (ml_model.params[1] * complete_drills["Forty"])+ (ml_model.params[2] * complete_drills["Vertical"])+ (ml_model.params[3] * complete_drills["BroadJump"])+ (ml_model.params[4] * complete_drills["Ht"])+ (ml_model.params[5] * complete_drills["Cone"])
complete_drills.plot.scatter(x = 'Round', y = 'expected_pos_draft')
sns.regplot(x = 'Round', y = 'expected_pos_draft', data = complete_drills)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f943508c430>

Looking at this graph, we can see that our predictive model using regression is not nearly accurate enough. For every actual round a player was drafted in, there is a multitude of different projections that range across the spectrum. The trend line does not at all fit the actual data points, and we can see very clearly that there is no conclusive evidence that our model is accurate. Based on this, we can conclude that in the dataset we used and along with the data tables we scraped, there is no conclusive evidence that combine results are a significant predictor of draft round. Overall, through this project, we were able to go through the data science pipeline and apply it to the NFL Draft. We were able to analyze data in order to make conclusions about various aspects of the draft (specifically with regards to wide receivers), as well as develop our own regression model to attempt predict the rounds in which wide receivers are drafted. After analyzing the results of our model, we were able to come to the conclusion that the NFL draft is extremely hard to predict.

There have been others who have tried to analyze this in the past, we will post their work below.

https://chance.amstat.org/2016/11/draft-and-nfl-performance/

https://seanjtaylor.github.io/learning-the-draft/

Thank you for your time

In [ ]: