Fix: Prevent memory from ballooning during post-training evaluation #3756

arnavgarg1 · 2023-10-25T15:30:33Z

The previous implementation of this function was modifying the target and prediction tensors in place, which is incorrect. It didn't matter at training time because the modified tensors are what we needed to calculate metrics.

However, at prediction time, we aggregate the prediction tensors and this causes the prediction tensor shapes during aggregation to have the wrong shape/size. Specifically in this code block here: https://github.com/ludwig-ai/ludwig/blob/master/ludwig/models/predictor.py#L256

Line 256 returns the a dictionary with the right tensors shapes
Line 257 transforms the dictionary structure and preserves the right tensor shapes
Line 258 updates the metrics, but the issue comes from the fact that we realign tensors to have the correct shape here, and then we return the modified tensors. However, the implementation here modifies the tensors inside the target or predictions dictionaries in place. When the outputs (targets) are long, then it means that the tensor shapes go from:
- output_predictions: (batch_size, max_new_tokens)
- output_probabilities: (batch_size, max_new_tokens, vocab_size)
  To
- output_predictions: (batch_size, max_tokens_in_target)
- output_probabilities: (batch_size, max_tokens_in_target, vocab_size)

And max_tokens_in_target can be >> max_new_tokens. For e.g., max_tokens_in_target could be 300 for the output feature, but you may only want to generate the first 64 so you set max_new_tokens to 64.

This was happening beneath the surface, which is why it wasn't caught so far, and it didn't matter at training time but it definitely has an effect on CPU memory at evaluation time because we accumulate the predictions and then concat the accumulated predictions from batches into a single prediction tensor, which means the size can be many multiples larger than the actual size we'd like (just max_new_tokens per row for evaluation).

github-actions · 2023-10-25T16:10:43Z

Unit Test Results

  6 files ±0   6 suites ±0 21m 8s ⏱️ -30s
12 tests ±0   9 ✔️ ±0   3 💤 ±0 0 ❌ ±0
60 runs ±0 42 ✔️ ±0 18 💤 ±0 0 ❌ ±0

Results for commit c7070e1. ± Comparison against base commit 8b423f1.

♻️ This comment has been updated with latest results.

Fix: Prevent memory ballooning during evaluate

2877930

arnavgarg1 changed the title ~~Fix: Prevent memory ballooning during evaluate~~ Fix: Prevent memory from ballooning during post-training evaluation Oct 25, 2023

Fix: Boolean value of Tensor with more than one value is ambiguous

c7070e1

arnavgarg1 marked this pull request as ready for review October 25, 2023 19:05

arnavgarg1 requested review from justinxzhao, geoffreyangus and jeffkinnison October 25, 2023 19:05

justinxzhao approved these changes Oct 25, 2023

View reviewed changes

arnavgarg1 added 3 commits October 25, 2023 21:15

Rename function for clarity

de61057

Remove comments

ca06f15

Empty-Commit

44fce10

arnavgarg1 merged commit a5628f3 into master Oct 25, 2023
17 checks passed

arnavgarg1 deleted the fix_memory_issue branch October 25, 2023 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Prevent memory from ballooning during post-training evaluation #3756

Fix: Prevent memory from ballooning during post-training evaluation #3756

arnavgarg1 commented Oct 25, 2023 •

edited

Loading

github-actions bot commented Oct 25, 2023 •

edited

Loading

Fix: Prevent memory from ballooning during post-training evaluation #3756

Fix: Prevent memory from ballooning during post-training evaluation #3756

Conversation

arnavgarg1 commented Oct 25, 2023 • edited Loading

github-actions bot commented Oct 25, 2023 • edited Loading

Unit Test Results

arnavgarg1 commented Oct 25, 2023 •

edited

Loading

github-actions bot commented Oct 25, 2023 •

edited

Loading