Fix: Prevent memory from ballooning during post-training evaluation #3756
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The previous implementation of this function was modifying the target and prediction tensors in place, which is incorrect. It didn't matter at training time because the modified tensors are what we needed to calculate metrics.
However, at prediction time, we aggregate the prediction tensors and this causes the prediction tensor shapes during aggregation to have the wrong shape/size. Specifically in this code block here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/models/predictor.py#L256
Line 256 returns the a dictionary with the right tensors shapes
Line 257 transforms the dictionary structure and preserves the right tensor shapes
Line 258 updates the metrics, but the issue comes from the fact that we realign tensors to have the correct shape here, and then we return the modified tensors. However, the implementation here modifies the tensors inside the target or predictions dictionaries in place. When the outputs (targets) are long, then it means that the tensor shapes go from:
output_predictions
: (batch_size, max_new_tokens)output_probabilities
: (batch_size, max_new_tokens, vocab_size)To
output_predictions
: (batch_size, max_tokens_in_target)output_probabilities
: (batch_size, max_tokens_in_target, vocab_size)And
max_tokens_in_target
can be >>max_new_tokens
. For e.g.,max_tokens_in_target
could be 300 for the output feature, but you may only want to generate the first 64 so you setmax_new_tokens
to 64.This was happening beneath the surface, which is why it wasn't caught so far, and it didn't matter at training time but it definitely has an effect on CPU memory at evaluation time because we accumulate the predictions and then concat the accumulated predictions from batches into a single prediction tensor, which means the size can be many multiples larger than the actual size we'd like (just
max_new_tokens
per row for evaluation).