Group Contribution Statement: Both members of the group got together on video calls and worked through the coding and writing.
One of the biggest events of the year in the National Football League (NFL) is the annual draft. In the NFL Draft, teams choose top collegiate players in the nation to join their team. There are 7 rounds in the draft with 32 picks per round. More information on the NFL Draft can be found here. https://operations.nfl.com/the-players/the-nfl-draft/the-rules-of-the-draft/
The wide receiver position is one of the most important, and it's a position that teams want to fill up the most. Within this project, we will be trying to analyze the NFL draft from the wide receiver perspective. We will examine the association between factors such as the combined performance of the wide receiver, their draft position, their rookie year performance, as well as the creation of our own model to predict the round at which the receiver was drawn up and then evaluate the accuracy of our model.
Here is a link to a wikipedia page explaining what the NFL combine is: https://en.wikipedia.org/wiki/NFL_Scouting_Combine
We started off by reading in the CSV file that contains player data. The dataset includes entities which are individual players, and attributes include the round a player was drafted, the pick the player was drafted, height, weight, 40 yard dash, shuttle run, and various other combine statistics. The dataset as a whole contains draft data for every player from 2000-2018. We will be focusing on drafted (AKA excluding undrafted) players drafted in the year 2008 to 2017.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.read_csv("combine_data_since_2000_PROCESSED_2018-04-26.csv")
wr_filter = df["Pos"] == "WR"
df_wr = df[wr_filter]
year_filter = df["Year"] >= 2008
df_wr = df_wr[year_filter]
df_wr = df_wr[df_wr['Round'].notna()]
df_wr
We now have pre-draft data about drafted wide receivers from 2008-2017. From here, we can now gather data from the rookie seasons of each of these players(the season immediately after they were drafted). This data will enable us to analyze correlations between draft position and immediate perfmance in the NFL. To get this data, we need to scrape it from a website called Pro Football Reference. The data will be in the form of a table, and we will need to scrape 9 years worth of this data(2008-2017). We created a for loop that scrapes data from the pro football reference website and parses it to our needs, which onle includes the wide recevier data.
import requests
from bs4 import BeautifulSoup
wr_data = pd.DataFrame()
for year in range(2008,2018):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
url = "https://www.pro-football-reference.com/years/" + str(year) + "/receiving.htm"
r = requests.get(url, headers=headers)
parse = BeautifulSoup(r.content, "html.parser")
dat = parse.find("table")
dat_str = str(dat)
dat_dat = pd.read_html(dat_str)
dat_dat = dat_dat[0]
dat_dat["Season"] = year
dat_dat = dat_dat.loc[(dat_dat['Pos']=="WR") | (dat_dat['Pos'] == "wr")]
wr_data = wr_data.append(dat_dat)
wr_data.head()
Now we'll join the tables so we can correlate rookie season stats with draft position.
pd.set_option('display.max_columns', 500)
rookie_stats = pd.merge(wr_data, df_wr, on='Player')
year2_filter = rookie_stats["Season"] == rookie_stats["Year"]
rookie_stats = rookie_stats[year2_filter]
rookie_stats
There are many ways to measure performance for wide receivers. For more information on how performance is measured in the NFL for wide receivers, feel free to click on the following link!
https://www.footballoutsiders.com/stats/nfl/wr/2019
Some of the most common ways of measuring a wide receiver’s success are receiving yards and touchdowns, which is what we will be using to measure a receiver’s production. First, let’s look at the relationship between average receiving yards and the round a receiver was selected. The graph below shows us the average receiving yards in a wide receiver’s rookie year plotted against the round that wide receiver was taken in, aggregated from 2008-2017. As we can see below, there does seem to be a trend. On average, the later the round a wide receiver was taken in, the less yards per game the player averaged in his first season. According to the graph, for every round deeper into the draft, the average yards per game goes down, except for round 5 which seems to be an outlier.
This trend can be attributed to the fact that players that are taken earlier in the draft are perceived to have more skill and potential, which results in more playing time during their rookie season and better performance if their skills hold true to the test. Additionally, late round players may not have the same skill, and sometimes lack the trust of coaches to the point where they do not get very much playing time their first year.
rookie_stats["Pick_Per_Round"] = rookie_stats["Pick"] % 32
rookie_stats = rookie_stats.astype({"Y/G": float, "GS": float, "TD": float})
rookie_stats = rookie_stats.rename(columns={"Y/G": "Yards_Per_Game", "GS": "Games_Started"})
yards_per_round = rookie_stats.groupby("Round").agg(("Yards_Per_Game")).mean()
ypr = pd.DataFrame(yards_per_round)
ypr = ypr.reset_index()
ypr.plot.scatter(x = 'Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Round', y = 'Yards_Per_Game', data = ypr)
Now, we try to compare how different players within the same round performed. Below are 6 different plots, each corresponding to its respective round. In each plot, we have plotted average yards per game vs the pick a player was drafted at. We can see that usually this resulted in a horizontal graph, meaning that the position within a round didn’t matter. This means for instance, that a player picked 15th in the 1st round wont do far better than the last player picked in the same round.
round1_ = pd.DataFrame
round1 = rookie_stats['Round']== 1.0
round1_ = rookie_stats[round1]
round1_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round1_)
round2_ = pd.DataFrame
round2 = rookie_stats['Round']== 2.0
round2_ = rookie_stats[round2]
round2_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round2_)
round3_ = pd.DataFrame
round3 = rookie_stats['Round']== 3.0
round3_ = rookie_stats[round3]
round3_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round3_)
round4_ = pd.DataFrame
round4 = rookie_stats['Round']== 4.0
round4_ = rookie_stats[round4]
round4_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round4_)
round5_ = pd.DataFrame
round5 = rookie_stats['Round']== 5.0
round5_ = rookie_stats[round5]
round5_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round5_)
round6_ = pd.DataFrame
round6 = rookie_stats['Round']== 6.0
round6_ = rookie_stats[round6]
round6_.plot.scatter(x = 'Pick_Per_Round', y = 'Yards_Per_Game', figsize=(12, 10))
sns.regplot(x = 'Pick_Per_Round', y = 'Yards_Per_Game', data = round6_)
In this next graph, we plot the average games started during a rookie season per round. In this graph, we can see that the earlier-drafted players usually play more than the players who are drafted in the later rounds. This once again makes sense because players taken earlier are expected to be immediate contributors.
gs_per_round = rookie_stats.groupby("Pick").agg(("Games_Started")).mean()
gpr = pd.DataFrame(gs_per_round)
gpr = gpr.reset_index()
gpr.plot.scatter(x = 'Pick', y = 'Games_Started', figsize=(12, 10))
sns.regplot(x = 'Pick', y = 'Games_Started', data = gpr)
This next graph is a histogram that shows how many receivers out of the top 50 receivers (by touchdowns) were from the 1st, 2nd, 3rd, etc. round. Surprisingly, players from the second round combined for more touchdowns than the players from the first round. However, we can see that the general downward trend still holds true.
rookie_stats = rookie_stats.sort_values(by='TD', ascending=False)
top_50_td = rookie_stats.head(50)
plt.hist(top_50_td["Round"])
Our hypothesis was that there is a relationship between the round that a recevier was draft and how many yards per game they had during their rookie season. To see the relationship between yards per game and round picked in the draft, we're going to conduct an F-test
ypg_lm=ols('Yards_Per_Game~Round', data=rookie_stats).fit() #Specify C for Categorical
print(sm.stats.anova_lm(ypg_lm, typ=2))
After conducting the F-test, we see that the F-test statistic is 5.036503 and the P-value is 0.028458. The P-value is less than the significance value of 0.05, so we can reject the null hypothesis. This shows that yards_per_game and round picked are linearly related, so the round a player got picked has an effect on the yards per game he has produced
Another hypothesis was that there is a relationship between a where a player was drafted and how many touchdowns they scored. To see the relationship between number of touchdowns and round picked in the draft, we're going to conduct an F-test.
td_lm=ols('TD~Round', data=rookie_stats).fit() #Specify C for Categorical
print(sm.stats.anova_lm(td_lm, typ=2))
After conducting the F-test, we see that the F-test statistic is 1.475706 and the P-value is 0.229129. The P-value is greater than the significance value of 0.05, so we failed to reject the null hypothesis. This shows that we fail to state that the relationship between number of touchdowns and round picked is linear.
In this next part, we will be using combine results to create a regression model to predict the round in which an NFL player will be drafted. First, we create a dataframe of players who have done at least 5 of the major combine drills.
complete_drills = df_wr[df_wr['Forty'].notna()]
complete_drills = complete_drills[complete_drills['Vertical'].notna()]
complete_drills = complete_drills[complete_drills['BroadJump'].notna()]
complete_drills = complete_drills[complete_drills['Ht'].notna()]
complete_drills = complete_drills[complete_drills['Cone'].notna()]
Then we'll create a model that will attempt to relate the forty time, vertical, broad jump, height, and cone drill to the round that the player was drafted in.
ml_model=ols('Round~Forty+Vertical+BroadJump+Ht+Cone', data=complete_drills).fit()
resid_df = pd.DataFrame()
resid_df["resid"] = ml_model.resid
resid_df["fitted"] = ml_model.fittedvalues
sns.residplot(x = 'fitted', y = 'resid', data = resid_df)
plt.title("Residual error versus fitted")
plt.show()
The residual points are centered around 0. This shows us that the normality assumption has not been severely violated and that our errors are within reason.
We can now attempt to find the expected draft position based on the model, which will be plotted alongside the real draft position of each player.
complete_drills["expected_pos_draft"] = ml_model.params[0] + (ml_model.params[1] * complete_drills["Forty"])+ (ml_model.params[2] * complete_drills["Vertical"])+ (ml_model.params[3] * complete_drills["BroadJump"])+ (ml_model.params[4] * complete_drills["Ht"])+ (ml_model.params[5] * complete_drills["Cone"])
complete_drills.plot.scatter(x = 'Round', y = 'expected_pos_draft')
sns.regplot(x = 'Round', y = 'expected_pos_draft', data = complete_drills)
Looking at this graph, we can see that our predictive model using regression is not nearly accurate enough. For every actual round a player was drafted in, there is a multitude of different projections that range across the spectrum. The trend line does not at all fit the actual data points, and we can see very clearly that there is no conclusive evidence that our model is accurate. Based on this, we can conclude that in the dataset we used and along with the data tables we scraped, there is no conclusive evidence that combine results are a significant predictor of draft round. Overall, through this project, we were able to go through the data science pipeline and apply it to the NFL Draft. We were able to analyze data in order to make conclusions about various aspects of the draft (specifically with regards to wide receivers), as well as develop our own regression model to attempt predict the rounds in which wide receivers are drafted. After analyzing the results of our model, we were able to come to the conclusion that the NFL draft is extremely hard to predict.
There have been others who have tried to analyze this in the past, we will post their work below.
https://chance.amstat.org/2016/11/draft-and-nfl-performance/
https://seanjtaylor.github.io/learning-the-draft/
Thank you for your time