Introduction

For those of us football fans, it is a source of great fun and discussion to debate how long particular players of note will “stay good” for. A gentleman like Patrick Mahomes is right at the top of his game right now. But for how many more years will he be the incredible player he is right now? Five more years? Ten more years? Barring injuries, his career ought to follow some natural progression as he ages. We know that football players tend to slowly fade off in their amazing skills as they approach their 40s, many retiring in their late 30s, but beyond that very vague understanding, it often seems to me that any more detailed predictions are based simply on speculation and conjecture. The goal of this project is to see if it’s possible to use data mining techniques on all the historical NFL player info that exists to try and make a predictive model that can better address the question of how long a particular player will “stay good” for.

This project picks up where a former one left off. It will begin with the current draft of a final project that was submitted for the course CSPB 4502 – Data Mining. The first step was finding all the appropriate data online. While none of it was available in ready-to-use downloadable data sets, the info is publically available in excellent form from present day back to 1970 on the website https://www.nfl.com/stats/player-stats/. Older NFL data stretching all does exist, but is much more spotty in quality with many more gaps of missing or corrupted data. That info is found on this website: https://www.footballdb.com/statistics/nfl/player-stats/. This second website's info stretches back long before the Super Bowl era back to 1940.

The former project scraped all the data from those two websites using an automated bot. This was quite a difficult feat (!), as the former site involves both pop-up ads and a variable-locationed "next page" button that must be clicked. The two data streams were then merged and intenseive data cleaning had to be performed. A great deal of data analysis and finally data mining occurred next, resulting in the final project submission.

In its current form, the predictive machine lets a user choose any NFL player (past or present) and examine their career. For historical past players, it lets the user choose how many year of their career to input in as training data, compares that player's career to all other players who've ever played that position, and uses the mode of those to generate a prediction. Here's an example. We'll have the machine predict the future of the Rams / Cardinals legendary quarterback Kurt Warner, given the first 6 years that he played as training data:

Now we'll have the same machine spit out its best prediction for the future of young 49ers quarterback Brock Purdy's career:

This current version of the predictive engine is lovely and is an absolute ton of fun to play with. That being said, there is some very serious room for improvement. Here are the top 10 areas that most need to be addressed:

(1) How good are the predictions? The biggest flaw of this whole project right now is that there is no way of knowing how accurate or wildly far off the predictions are! This makes the machine, ultimately, no more practical to use than a magic 8 ball. Extensive study needs to be done upon historical NFL players past to determine how close the machine's predictions are to how the actual rest of those players's careers went. This "how close" needs to be a quantitative numerical number... roughly speaking one could describe it as "What percent of the time did the machine get it right?"

(2) Which of many different ways should the NFL data on a given player for a given year be condensed into a single grade score — A, B, C? There are many highly technical ways that this can be done that are quite different from one another.

(3) If either players or individual stats should be compared to other players of the same position in the same year, how should the project interact with years when, for instance, more players threw a football? The project should not be skewed based on how many backups threw a single pass that year. My current best idea for this is to use a scale that compares all the players to the 7th best player of that position per season. This forms a kind of a steady median, of sorts, and lets the work be skewed neither by a single exceptional season by one individual at the top nor by the number of backups throwing a single pass that season.

(4) How to best compare and evaluate running quarterbacks vs pocket passers? This is an incredibly complicated area and one that will require vast amounts of work. My current best idea of how to address this will be to use a scale for each player stat (per seaon) from 0 to 1, compared to the other players of that same season. These will then be weighted with various coefficients, indicating how important that stat is. For example for a runningbacks, maybe the number of touchdowns is multiplied by 2, the number of fumbles is multipled by -4, and the number plays over 30+ yards is multipled by 0.3. All of these weighted scores will then be combined into one final score for that player for that season.

(5) Major research question — What is the best choice for the above coefficients? My current best idea for this is to take known, trusted "Best quarterbacks in 2022" kind of lists... and use machine learning to attempt many many different versions of the coefficients until I can find one that ranks the various players as closely as possible to the known trusted lists.

(6) What should be done with missing or corrupted data? The data from 2025 is spotless and precise, across all stats. The data from 1940 is nearly laughably incomplete. How can the project incorporate such varied different data streams?

(7) Right now, the predictive engine's biggest flaw is that it seems to predict almost all players to retire before they actually historically did. Football is a tough and injury-filled sport. The machine is skewed towards retirement by how many players actually do retire fairly quickly. It predicts all older players to pretty much immediately retire. It predicts running backs to retire sooner than they did. While it is accurate on spans of looking ahead by 2 or 3 seasons, it's dreadfully inaccurate at longer distances into the future. Perhaps the machine needs to first predict an approximate number of seasons it expects a given player to play for and only then predict what it thinks will happen during those seasons.

(8) Major research question — Given all the above, how can we use machine learning to make the machine more and more and more precise and accurate? Each new iteration / each rebuild will get an experimental accuracy score. What methods will be more or less accurate?

(9) Major research question — How close to perfect accuracy can the machine be designed? This is the core research question that will be addressed this semester.

(10) Lastly, once all that is done, we will have a fully functioning data minining / machine learning engine designed to answer the original question that was posed at my dinner table years ago: "For how many more years will Josh Allen play well for?" At the end of this entire project, this machine ought to be able to accurately and precisely answer that question.

DataPrep_EDA