Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Prevent memory from ballooning during post-training evaluation #3756

Merged
merged 5 commits into from
Oct 25, 2023

Conversation

arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Oct 25, 2023

The previous implementation of this function was modifying the target and prediction tensors in place, which is incorrect. It didn't matter at training time because the modified tensors are what we needed to calculate metrics.

However, at prediction time, we aggregate the prediction tensors and this causes the prediction tensor shapes during aggregation to have the wrong shape/size. Specifically in this code block here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/models/predictor.py#L256

  • Line 256 returns the a dictionary with the right tensors shapes

  • Line 257 transforms the dictionary structure and preserves the right tensor shapes

  • Line 258 updates the metrics, but the issue comes from the fact that we realign tensors to have the correct shape here, and then we return the modified tensors. However, the implementation here modifies the tensors inside the target or predictions dictionaries in place. When the outputs (targets) are long, then it means that the tensor shapes go from:

    • output_predictions: (batch_size, max_new_tokens)
    • output_probabilities: (batch_size, max_new_tokens, vocab_size)
      To
    • output_predictions: (batch_size, max_tokens_in_target)
    • output_probabilities: (batch_size, max_tokens_in_target, vocab_size)

And max_tokens_in_target can be >> max_new_tokens. For e.g., max_tokens_in_target could be 300 for the output feature, but you may only want to generate the first 64 so you set max_new_tokens to 64.

This was happening beneath the surface, which is why it wasn't caught so far, and it didn't matter at training time but it definitely has an effect on CPU memory at evaluation time because we accumulate the predictions and then concat the accumulated predictions from batches into a single prediction tensor, which means the size can be many multiples larger than the actual size we'd like (just max_new_tokens per row for evaluation).

@arnavgarg1 arnavgarg1 changed the title Fix: Prevent memory ballooning during evaluate Fix: Prevent memory from ballooning during post-training evaluation Oct 25, 2023
@github-actions
Copy link

github-actions bot commented Oct 25, 2023

Unit Test Results

  6 files  ±0    6 suites  ±0   21m 8s ⏱️ -30s
12 tests ±0    9 ✔️ ±0    3 💤 ±0  0 ±0 
60 runs  ±0  42 ✔️ ±0  18 💤 ±0  0 ±0 

Results for commit c7070e1. ± Comparison against base commit 8b423f1.

♻️ This comment has been updated with latest results.

@arnavgarg1 arnavgarg1 marked this pull request as ready for review October 25, 2023 19:05
@arnavgarg1 arnavgarg1 merged commit a5628f3 into master Oct 25, 2023
17 checks passed
@arnavgarg1 arnavgarg1 deleted the fix_memory_issue branch October 25, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants