NBA Height Prediction from Performance

Machine learning models used to predict the height of basketball players, from playstyle and performance statistics. Data collected using the nba_api package, of the previous 20 seasons.

Models: Gradient boosted regressors and classifiers, random forests
Data analysis: Clustering, preparing clean data for better performance.

The results show, that two regressors (one gradient boosted, and a random forest regressor) can predict the height of a player within 1 inch 50% of the time. Moreover, on the data the latter model predicted inside the (-7, 5) inch error range for every player; predicting inside +-2 inches for 72.9% of the players, and +-3 inches for 86.4% of the players.

Preprocessing, analysis, prediction:: inside the ml.ipynb notebook
Data collection, some cleaning: described in fetch_players.ipynb.
Data is stored in the data folder, the combined data for training in the data_combined.csv file.

The pipeline picture:

A summary of the best models:

Model	Accuracy Score	+/-1 inch %	+/-2 inch %	+/-3 inch %	F1 Score	Error Range
Random forest classifier	17.6%	45.3%	70.3%	83.7%	0.155	-8 , 6
Ensembled (manually biased) forest	16.2%	49.0%	71.3%	85.4%	0.149	-8 , 5
Best RF regressor	-	50.0%	72.9%	86.4%	-	-7 , 5
Best gradient boosted regressor	-	49.9%	69.1%	84.5%	-	-8 , 6

How..?

How to run?

I recommend to just download (clone) the whole repository.

After downloading the repository, in your command line shell go to this directory, choose/create a Python/conda/virtualenv environment and run pip install -r requirements.txt.(The dependencies are listed both in the requirements.in and the requirements.txt files.)
Then just run the ml.ipynb notebook cells in some Jupyter environment.

If you want to fetch the data yourself (which is already fetched and stored in the data folder), you additionally need to run pip install nba_api to use the package, in the fetch_players.ipynb notebook.

How is the data collected and processed?

Collected using the nba_api package, from the NBA stats website.

The NBA website has a statistics subpage (and other subpages for other data) containing data about team statistics and player statistics, including biography (height, weight, age etc.) and career stats. There are multiple APIs provided by the website, but conviniently the nba_api package provides one Python interface to these APIs. The data is collected using this package, as shown in the fetch_players.ipynb notebook.
The separate career (and filtered career) and biographical datasets are stored in the respective csv files.

To rightly train models, data is combined, processed and saved (in the general form for most models) in the data_combined.csv file. This procedure is described in the ml.ipynb notebook. (An example step is to normalize statistics by minutes played to get more meaningful features, or removing players who played less than 3 matches time in a season.)

Typical instances of the final dataframe are:

PLAYER_ID	SEASON_ID	AGE	PLAYER_HEIGHT_INCHES	PLAYER_WEIGHT	OREB_PCT	DREB_PCT	TS_PCT	MIN	FGA_PM	FG_PCT	FG3A_PM	FG3_PCT	FTM_PM	BLK_PM	PF_PM
1630639	2022-23	22	78	179	0.046	0.152	0.589	217	0.40553	0.662667	0.230415	0.266667	0.018433	0.0	0.101382
1631260	2022-23	23	77	190	0.016	0.105	0.607	345	0.362319	0.424	0.304348	0.419	0.011594	0.0	0.089855
203932	2021-22	27	80	235	0.086	0.136	0.617	2055	0.370316	0.564	0.084185	0.347	0.092944	0.024818	0.062774

We predict the PLAYER_HEIGHT_INCHES from the other attributes. As one can see from the table, aside defensive and shooting statistics there are assist statistics as well. The key predictors are the rebounds, and personal fouls - a higher value of these generally predicts higher height value.

The explanation of the attributes is given in the Explanation of attributes section below, and in the ml.ipynb notebook.

How were the models chosen and trained?

Random forests, gradient boosted models, and specially ensembling forests were chosen, and a tree for comparison. Training and test data is split in a way, that every player appears in only one of the datasets.

As instances of data (players) are rare, noisy (players play very different regardless of height) but attributes are common, more basic models are good choices such as random forests. For improved results, gradient boosted models are also trained. These models however don't have enough data and computing power (e.g. hyperparameter tuning) to perform as well as they could, and typically performed only comparably when initialized with the random forest models.
(Random forests cannot enable indirect relationships between the target and the features, but gradient boosted models can learn these. For example, age clearly doesn't correlate with height. However, a player's performance strongly develops with age, and we predict height from performance - therefore the age attribute could indeed indirectly be useful.)

The models:

Decision tree: simple model, hyperparameter tuned - just for comparison
Random forest classifier: hyperparameter tuned
Ensemble forests: A simple random forest-like model (most common choice, "argmax") and a model that averages the predictions of the trees. (The latter is more biased, hence less common, but does better in most important metrics here.)
Random forest regressor: using the same hyperparameters as the classifier
Gradient boosted classifiers: Both hyperparameter tuned, and not tuned. Among the untuned models, the strongest predictor was initialized with the random forest classifier.
Gradient boosted regressors: initialized with the random forest regressor.

Training and test data is split in a way, that every player appears in only one of the datasets (players have multiple instances, because of multiple seasons, etc.).
To show the strength of this organization of instances, two models were also trained on shuffled data. The results showed it is better to train the models no player appearing in both train and test datasets.

How do the models compare?

There are multiple metrics to compare the models.

Accuracy: The percentage of correct predictions. (Only for classifiers.)
+-1 inch accuracy, +-2, +-3: The percentage of predictions that are within 1 inches of the true height.
F1 score: The harmonic mean of precision and recall. (Only for classifiers.)

How do attributes correlate with height? What do the results show?

According to the random forest classifier SHAP values, this is the importance (impact, Shapley value) ranking of the attributes:

As we can see, both age and minutes played matter the least, considerably less than any other attribute - indicating that they likely do not matter at all.
However, random forests can only model direct relationships, not indirect ones - and we know that age (and minutes) correlates with performance, which we use to predict height. Hence these attributes for gradient boosted models could be more important than showcased here, and may be utilized better for better predictions.

The rebound attributes are most important, followed by the personal fouls, these stand out among the other attributes. This proved our initial hypothesis, formed from the data exploration and analysis:

There is a steady increase in rebounds with height, these were clear indicators that the rebound statistics are good predictors of height.

Results table

Model	Data	Accuracy Score	+/-1 inch %	+/-2 inch %	+/-3 inch %	F1 Score	Error Range
Decision tree	All	15.7%	42.8%	67.7%	81.3%	0.154	-8 , 7
Random forest classifier	All	17.6%	45.3%	70.3%	83.7%	0.155	-8 , 6
Ensemble forest 1	All	15.5%	44.9%	70.2%	83.0%	0.137	-8 , 6
Ensemble forest averaging	All	16.2%	49.0%	71.3%	85.4%	0.149	-8 , 5
Random forest regressor	All	-	50.0%	72.9%	86.4%	-	-7 , 5
Gradient boosted classifier, no tuning, previous parameters	All	17.1%	44.5	64.9%	78.9%	0.169	-14 , 9
Gradient boosted classifier, with hyperparameter tuning	All	15.2%	41.6%	65.7%	79.3%	0.131	-8 , 12
Gradient boosted classifier, with tuning, initialized from RF	All	15.8%	43.5%	67.4%	81.3%	0.148	-8 , 6
Gradient boosted regressor (from the RF model)	All	-	49.9%	69.1%	84.5%	-	-8 , 6
Gradient boosted classifier, with shuffled data	Shuffled	8.8%	28.1%	43.4%	59.8%	0.087	-14 , 11
Gradient boosted regressor, with shuffled data	Shuffled	-	38.6%	67.1%	82.8%	-	-12 , 6

The gradient boosted methods underperformed likely because of lack of data, and the scikit-learn implementation not being state of the art (like XGBoost).
Classifiers: In accuracy, only are used for evaluation, and the random forest classifier does the best with 17.6% (and had the highest F-score), only the gradient boosted classifier initialized from its parameters manages to go above 17% accuracy.
The averaging ensemble forest method however had the best scores in all other categories (+-N inch ranges, error range), and came close to the best regressor method in all categories.

Regressors: We find that random forest regressor works, as the only model hitting 50% accuracy in the +-1 inch range. It also hits the highest score in all other categories: 72.9% +-2 inch accuracy, 86.4% +-3 inch accuracy, and -7,5 error range.
Among the gradient boosted regressors, only the regressor initialized from the random forest does comparably, with 49.9% accuracy in +-1 inch range.

This means, that the random forest regressor can predict the height of a player within 1 inch half of the time. This is a strong result for such a noisy problem.

We also see that it is better to train the model with players not appearing in both train and test datasets.

Outdated summary pdf file for overall explanation.

Link on my personal website: Projects, NBA Height Prediction (to be added).

Attributes: Explanation & importance

Attribute	Meaning	Extra info
MIN	minutes
FG, FG3, FT	Field goal (2 point goals), field goal 3 pointers, free throws ("penalty throws", 1 point)-	M: Made (scored), A: attempted, PCT: percentage
REB	Rebounds	OREB/DREB: Offensive/defensive rebound
AST	Assists	AST_PCT: Assist percentage
STL	Steals
BLK	Blocks
TOV	Turnovers	Negative statistic
PF	Personal fouls	Negative statistic
PTS	Points
---	--------------------------------------------------------------------	--------------------------------------------------------------------
DRAFT	Each year before the season starts, 60(?) college newcomers can be drafted by teams	More "promising" players are drafted in earlier rounds
GP, GS	Games played, games started
NET_RATING	Offensive rating - defensive rating	For a player: measure for how many goals a team scores with him vs. allow in. Can be negative.
USG_PCT	Usage percentage	Estimate of percentage of team plays
TS_PCT	True shooting percentage	Measures how well a player shoots-

What attributes are most important?

Here is my explanation of rating attributes one by one (pre-analysis and model building):

Position: There is no attribute for position. Otherwise, it could play a significant role in statistics, and could "suggest" the player's physical attributes - it'd be a very valuable predictor. The absence reason is because the NBA does not provide this information as it is not trivially defined. Positions changed a lot throughout the years, nowadays there seems to be mostly 3 positions only whereas it used to be 5 distinguishable positions, moreover it can change game to game and team to team. To obtain such data, there are some predictors that predict the position of players, or calculate them, usually on matches. For that, I'd have to fetch it from all matches and look at where they play most commonly. In the case of this API, this is not really feasible. It would take too much requests and time, so we work without this attribute. (Positions: Point guard, Shooting guard, Small forward, Power forward, Center)
Age: Hard to tell. As I said, it doesn't correlate with height, but it correlates with performance on which we predict height. Especially for clustering. I think for "attempts" stats, age should not really matter however.
FG, FG3, FT: I would drop FT_PCT, it's "throwing skill". FTA is not relevant, but FTM may be. For field goals, my thought was "made" is more important than attempt as it suggests success, and made/attempt ratio would suggest even better how successful the player is.
But one of the key ideas I wanted to know is whether "smaller" players tend to play different than other players: do they throw more from far away, as they struggle to penetrate from close, or it's exactly the opposite: they use their agility to score from close? For that, attempts per minute may be a better predictor. I decided on first using attempts alongside made/attempt ratio, we may get some results like short players are less successful in scoring 2-pointers.
REB: I would either drop OREB, DREB, and keep REB(_PM) with position, or keep only OREB+DREB. Since we don't have a position attribute, I keep OREB and DREB as indicators on "position". (As we see from the results, these turned out to be some of the most important attributes). BLK: keep, because some short players even have 0 blocks after many matches. TOV: drop, I guess not really relevant with height, PF: maybe relevant I'll keep it at first. STL: Not sure, but from the data it spans really thin, probably not a good predictor so I drop.
AST: Drop all. It is more of a skill/team stat. (Although might have some significance, like in soccer the small, fast, agile wingers cross the ball to the big, strong striker who can head it in.)
PTS: Not needed, we have other points stats
DRAFT: I'd skip. I thought it can find extreme cases like Giannis being 6'11, but he was only 15th pick.
NET_RATING: Skip. USG: Undecided, TS: I'd keep.
Anything else: skip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NBA Height Prediction from Performance

How..?

How to run?

How is the data collected and processed?

How were the models chosen and trained?

How do the models compare?

How do attributes correlate with height? What do the results show?

Results table

Attributes: Explanation & importance

What attributes are most important?

Files

README.md

Latest commit

History

README.md

File metadata and controls

NBA Height Prediction from Performance

How..?

How to run?

How is the data collected and processed?

How were the models chosen and trained?

How do the models compare?

How do attributes correlate with height? What do the results show?

Results table

Attributes: Explanation & importance

What attributes are most important?