Machine learning models used to predict the height of basketball players, from playstyle and performance statistics. Data collected using the nba_api
package, of the previous 20 seasons.
- Models: Gradient boosted regressors and classifiers, random forests
- Data analysis: Clustering, preparing clean data for better performance.
The results show, that two regressors (one gradient boosted, and a random forest regressor) can predict the height of a player within 1 inch 50% of the time. Moreover, on the data the latter model predicted inside the (-7, 5) inch error range for every player; predicting inside +-2 inches for 72.9% of the players, and +-3 inches for 86.4% of the players.
Preprocessing, analysis, prediction:: inside the ml.ipynb
notebook
Data collection, some cleaning: described in fetch_players.ipynb
.
Data is stored in the data
folder, the combined data for training in the data_combined.csv
file.
The pipeline picture:
A summary of the best models:
Model | Accuracy Score | +/-1 inch % | +/-2 inch % | +/-3 inch % | F1 Score | Error Range |
---|---|---|---|---|---|---|
Random forest classifier | 17.6% | 45.3% | 70.3% | 83.7% | 0.155 | -8 , 6 |
Ensembled (manually biased) forest | 16.2% | 49.0% | 71.3% | 85.4% | 0.149 | -8 , 5 |
Best RF regressor | - | 50.0% | 72.9% | 86.4% | - | -7 , 5 |
Best gradient boosted regressor | - | 49.9% | 69.1% | 84.5% | - | -8 , 6 |
- I recommend to just download (clone) the whole repository.
After downloading the repository, in your command line shell go to this directory, choose/create a Python
/conda
/virtualenv
environment and run pip install -r requirements.txt
.(The dependencies are listed both in the requirements.in
and the requirements.txt
files.)
Then just run the ml.ipynb
notebook cells in some Jupyter environment.
If you want to fetch the data yourself (which is already fetched and stored in the data
folder), you additionally need to run pip install nba_api
to use the package, in the fetch_players.ipynb
notebook.
- Collected using the
nba_api
package, from the NBA stats website.
The NBA website has a statistics subpage (and other subpages for other data) containing data about team statistics and player statistics, including biography (height, weight, age etc.) and career stats. There are multiple APIs provided by the website, but conviniently the nba_api
package provides one Python interface to these APIs. The data is collected using this package, as shown in the fetch_players.ipynb
notebook.
The separate career (and filtered career) and biographical datasets are stored in the respective csv
files.
To rightly train models, data is combined, processed and saved (in the general form for most models) in the data_combined.csv
file. This procedure is described in the ml.ipynb
notebook. (An example step is to normalize statistics by minutes played to get more meaningful features, or removing players who played less than 3 matches time in a season.)
Typical instances of the final dataframe are:
PLAYER_ID | SEASON_ID | AGE | PLAYER_HEIGHT_INCHES | PLAYER_WEIGHT | OREB_PCT | DREB_PCT | TS_PCT | MIN | FGA_PM | FG_PCT | FG3A_PM | FG3_PCT | FTM_PM | BLK_PM | PF_PM |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1630639 | 2022-23 | 22 | 78 | 179 | 0.046 | 0.152 | 0.589 | 217 | 0.40553 | 0.662667 | 0.230415 | 0.266667 | 0.018433 | 0.0 | 0.101382 |
1631260 | 2022-23 | 23 | 77 | 190 | 0.016 | 0.105 | 0.607 | 345 | 0.362319 | 0.424 | 0.304348 | 0.419 | 0.011594 | 0.0 | 0.089855 |
203932 | 2021-22 | 27 | 80 | 235 | 0.086 | 0.136 | 0.617 | 2055 | 0.370316 | 0.564 | 0.084185 | 0.347 | 0.092944 | 0.024818 | 0.062774 |
We predict the PLAYER_HEIGHT_INCHES
from the other attributes. As one can see from the table, aside defensive and shooting statistics there are assist statistics as well. The key predictors are the rebounds, and personal fouls - a higher value of these generally predicts higher height value.
The explanation of the attributes is given in the Explanation of attributes
section below, and in the ml.ipynb
notebook.
- Random forests, gradient boosted models, and specially ensembling forests were chosen, and a tree for comparison. Training and test data is split in a way, that every player appears in only one of the datasets.
As instances of data (players) are rare, noisy (players play very different regardless of height) but attributes are common, more basic models are good choices such as random forests. For improved results, gradient boosted models are also trained. These models however don't have enough data and computing power (e.g. hyperparameter tuning) to perform as well as they could, and typically performed only comparably when initialized with the random forest models.
(Random forests cannot enable indirect relationships between the target and the features, but gradient boosted models can learn these. For example, age clearly doesn't correlate with height. However, a player's performance strongly develops with age, and we predict height from performance - therefore the age attribute could indeed indirectly be useful.)
The models:
- Decision tree: simple model, hyperparameter tuned - just for comparison
- Random forest classifier: hyperparameter tuned
- Ensemble forests: A simple random forest-like model (most common choice, "argmax") and a model that averages the predictions of the trees. (The latter is more biased, hence less common, but does better in most important metrics here.)
- Random forest regressor: using the same hyperparameters as the classifier
- Gradient boosted classifiers: Both hyperparameter tuned, and not tuned. Among the untuned models, the strongest predictor was initialized with the random forest classifier.
- Gradient boosted regressors: initialized with the random forest regressor.
Training and test data is split in a way, that every player appears in only one of the datasets (players have multiple instances, because of multiple seasons, etc.).
To show the strength of this organization of instances, two models were also trained on shuffled data. The results showed it is better to train the models no player appearing in both train and test datasets.
There are multiple metrics to compare the models.
- Accuracy: The percentage of correct predictions. (Only for classifiers.)
- +-1 inch accuracy, +-2, +-3: The percentage of predictions that are within 1 inches of the true height.
- F1 score: The harmonic mean of precision and recall. (Only for classifiers.)
According to the random forest classifier SHAP values, this is the importance (impact, Shapley value) ranking of the attributes:
As we can see, both age and minutes played matter the least, considerably less than any other attribute - indicating that they likely do not matter at all.
However, random forests can only model direct relationships, not indirect ones - and we know that age (and minutes) correlates with performance, which we use to predict height. Hence these attributes for gradient boosted models could be more important than showcased here, and may be utilized better for better predictions.
The rebound attributes are most important, followed by the personal fouls, these stand out among the other attributes. This proved our initial hypothesis, formed from the data exploration and analysis:
There is a steady increase in rebounds with height, these were clear indicators that the rebound statistics are good predictors of height.
Model | Data | Accuracy Score | +/-1 inch % | +/-2 inch % | +/-3 inch % | F1 Score | Error Range |
---|---|---|---|---|---|---|---|
Decision tree | All | 15.7% | 42.8% | 67.7% | 81.3% | 0.154 | -8 , 7 |
Random forest classifier | All | 17.6% | 45.3% | 70.3% | 83.7% | 0.155 | -8 , 6 |
Ensemble forest 1 | All | 15.5% | 44.9% | 70.2% | 83.0% | 0.137 | -8 , 6 |
Ensemble forest averaging | All | 16.2% | 49.0% | 71.3% | 85.4% | 0.149 | -8 , 5 |
Random forest regressor | All | - | 50.0% | 72.9% | 86.4% | - | -7 , 5 |
Gradient boosted classifier, no tuning, previous parameters | All | 17.1% | 44.5 | 64.9% | 78.9% | 0.169 | -14 , 9 |
Gradient boosted classifier, with hyperparameter tuning | All | 15.2% | 41.6% | 65.7% | 79.3% | 0.131 | -8 , 12 |
Gradient boosted classifier, with tuning, initialized from RF | All | 15.8% | 43.5% | 67.4% | 81.3% | 0.148 | -8 , 6 |
Gradient boosted regressor (from the RF model) | All | - | 49.9% | 69.1% | 84.5% | - | -8 , 6 |
Gradient boosted classifier, with shuffled data | Shuffled | 8.8% | 28.1% | 43.4% | 59.8% | 0.087 | -14 , 11 |
Gradient boosted regressor, with shuffled data | Shuffled | - | 38.6% | 67.1% | 82.8% | - | -12 , 6 |
The gradient boosted methods underperformed likely because of lack of data, and the scikit-learn implementation not being state of the art (like XGBoost).
Classifiers: In accuracy, only are used for evaluation, and the random forest classifier does the best with 17.6% (and had the highest F-score), only the gradient boosted classifier initialized from its parameters manages to go above 17% accuracy.
The averaging ensemble forest method however had the best scores in all other categories (+-N inch ranges, error range), and came close to the best regressor method in all categories.
Regressors: We find that random forest regressor works, as the only model hitting 50% accuracy in the +-1 inch range. It also hits the highest score in all other categories: 72.9% +-2 inch accuracy, 86.4% +-3 inch accuracy, and -7,5 error range.
Among the gradient boosted regressors, only the regressor initialized from the random forest does comparably, with 49.9% accuracy in +-1 inch range.
This means, that the random forest regressor can predict the height of a player within 1 inch half of the time. This is a strong result for such a noisy problem.
We also see that it is better to train the model with players not appearing in both train and test datasets.
Outdated summary pdf file for overall explanation.
Link on my personal website: Projects, NBA Height Prediction (to be added).
Attribute | Meaning | Extra info |
---|---|---|
MIN | minutes | |
FG, FG3, FT | Field goal (2 point goals), field goal 3 pointers, free throws ("penalty throws", 1 point)- | M: Made (scored), A: attempted, PCT: percentage |
REB | Rebounds | OREB/DREB: Offensive/defensive rebound |
AST | Assists | AST_PCT: Assist percentage |
STL | Steals | |
BLK | Blocks | |
TOV | Turnovers | Negative statistic |
PF | Personal fouls | Negative statistic |
PTS | Points | |
--- | -------------------------------------------------------------------- | -------------------------------------------------------------------- |
DRAFT | Each year before the season starts, 60(?) college newcomers can be drafted by teams | More "promising" players are drafted in earlier rounds |
GP, GS | Games played, games started | |
NET_RATING | Offensive rating - defensive rating | For a player: measure for how many goals a team scores with him vs. allow in. Can be negative. |
USG_PCT | Usage percentage | Estimate of percentage of team plays |
TS_PCT | True shooting percentage | Measures how well a player shoots- |
Here is my explanation of rating attributes one by one (pre-analysis and model building):
- Position: There is no attribute for position. Otherwise, it could play a significant role in statistics, and could "suggest" the player's physical attributes - it'd be a very valuable predictor. The absence reason is because the NBA does not provide this information as it is not trivially defined. Positions changed a lot throughout the years, nowadays there seems to be mostly 3 positions only whereas it used to be 5 distinguishable positions, moreover it can change game to game and team to team. To obtain such data, there are some predictors that predict the position of players, or calculate them, usually on matches. For that, I'd have to fetch it from all matches and look at where they play most commonly. In the case of this API, this is not really feasible. It would take too much requests and time, so we work without this attribute. (Positions: Point guard, Shooting guard, Small forward, Power forward, Center)
- Age: Hard to tell. As I said, it doesn't correlate with height, but it correlates with performance on which we predict height. Especially for clustering. I think for "attempts" stats, age should not really matter however.
- FG, FG3, FT: I would drop FT_PCT, it's "throwing skill". FTA is not relevant, but FTM may be. For field goals, my thought was "made" is more important than attempt as it suggests success, and made/attempt ratio would suggest even better how successful the player is.
But one of the key ideas I wanted to know is whether "smaller" players tend to play different than other players: do they throw more from far away, as they struggle to penetrate from close, or it's exactly the opposite: they use their agility to score from close? For that, attempts per minute may be a better predictor. I decided on first using attempts alongside made/attempt ratio, we may get some results like short players are less successful in scoring 2-pointers. - REB: I would either drop OREB, DREB, and keep REB(_PM) with position, or keep only OREB+DREB. Since we don't have a position attribute, I keep OREB and DREB as indicators on "position". (As we see from the results, these turned out to be some of the most important attributes). BLK: keep, because some short players even have 0 blocks after many matches. TOV: drop, I guess not really relevant with height, PF: maybe relevant I'll keep it at first. STL: Not sure, but from the data it spans really thin, probably not a good predictor so I drop.
- AST: Drop all. It is more of a skill/team stat. (Although might have some significance, like in soccer the small, fast, agile wingers cross the ball to the big, strong striker who can head it in.)
- PTS: Not needed, we have other points stats
- DRAFT: I'd skip. I thought it can find extreme cases like Giannis being 6'11, but he was only 15th pick.
- NET_RATING: Skip. USG: Undecided, TS: I'd keep.
- Anything else: skip.