-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support lazy tensor allocation #193
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pull request was exported from Phabricator. Differential Revision: D66487952 |
shengfukevin
added a commit
to shengfukevin/param
that referenced
this pull request
Dec 19, 2024
Summary: Support lazy tensor allocation The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations. However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode class TensorAllcationMode(Enum): """ Enum to represent the tensor allocation mode """ # Allocate input tensors that can not be generated when replaying the trace # at the beginning and reuse them for all iterations. PRE_ALLOCATE = 1 # Allocate tensors on the fly and free them after they are out of scope LAZY_ALLOCATE = 2 For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately. The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516) This fixed OOM issue in Llama4 70B. Differential Revision: D66487952
0991279
to
0f5e647
Compare
This pull request was exported from Phabricator. Differential Revision: D66487952 |
shengfukevin
added a commit
to shengfukevin/param
that referenced
this pull request
Dec 26, 2024
Summary: Support lazy tensor allocation The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations. However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode class TensorAllcationMode(Enum): """ Enum to represent the tensor allocation mode """ # Allocate input tensors that can not be generated when replaying the trace # at the beginning and reuse them for all iterations. PRE_ALLOCATE = 1 # Allocate tensors on the fly and free them after they are out of scope LAZY_ALLOCATE = 2 For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately. The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516) This fixed OOM issue in Llama4 70B. Differential Revision: D66487952
0f5e647
to
9a43a57
Compare
This pull request was exported from Phabricator. Differential Revision: D66487952 |
Summary: Support lazy tensor allocation The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations. However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode class TensorAllcationMode(Enum): """ Enum to represent the tensor allocation mode """ # Allocate input tensors that can not be generated when replaying the trace # at the beginning and reuse them for all iterations. PRE_ALLOCATE = 1 # Allocate tensors on the fly and free them after they are out of scope LAZY_ALLOCATE = 2 For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately. The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516) This fixed OOM issue in Llama4 70B. Reviewed By: sanrise Differential Revision: D66487952
9a43a57
to
f9c45d6
Compare
This pull request was exported from Phabricator. Differential Revision: D66487952 |
This pull request has been merged in be232eb. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Support lazy tensor allocation
The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations.
However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode
class TensorAllcationMode(Enum):
"""
Enum to represent the tensor allocation mode
"""
For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately.
The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516)
This fixed OOM issue in Llama4 70B.
Differential Revision: D66487952