Support lazy tensor allocation #193

shengfukevin · 2024-12-18T22:54:36Z

Summary:
Support lazy tensor allocation

The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations.

However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode

class TensorAllcationMode(Enum):
"""
Enum to represent the tensor allocation mode
"""

# Allocate input tensors that can not be generated when replaying the trace
# at the beginning and reuse them for all iterations.
PRE_ALLOCATE = 1

# Allocate tensors on the fly and free them after they are out of scope
LAZY_ALLOCATE = 2

For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately.

The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516)

This fixed OOM issue in Llama4 70B.

Differential Revision: D66487952

facebook-github-bot · 2024-12-18T22:54:49Z

This pull request was exported from Phabricator. Differential Revision: D66487952

Summary: Support lazy tensor allocation The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations. However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode class TensorAllcationMode(Enum): """ Enum to represent the tensor allocation mode """ # Allocate input tensors that can not be generated when replaying the trace # at the beginning and reuse them for all iterations. PRE_ALLOCATE = 1 # Allocate tensors on the fly and free them after they are out of scope LAZY_ALLOCATE = 2 For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately. The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516) This fixed OOM issue in Llama4 70B. Differential Revision: D66487952

facebook-github-bot · 2024-12-19T01:47:59Z

This pull request was exported from Phabricator. Differential Revision: D66487952

Summary: Support lazy tensor allocation The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations. However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode class TensorAllcationMode(Enum): """ Enum to represent the tensor allocation mode """ # Allocate input tensors that can not be generated when replaying the trace # at the beginning and reuse them for all iterations. PRE_ALLOCATE = 1 # Allocate tensors on the fly and free them after they are out of scope LAZY_ALLOCATE = 2 For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately. The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516) This fixed OOM issue in Llama4 70B. Differential Revision: D66487952

facebook-github-bot · 2024-12-26T21:34:18Z

This pull request was exported from Phabricator. Differential Revision: D66487952

Summary: Support lazy tensor allocation The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations. However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode class TensorAllcationMode(Enum): """ Enum to represent the tensor allocation mode """ # Allocate input tensors that can not be generated when replaying the trace # at the beginning and reuse them for all iterations. PRE_ALLOCATE = 1 # Allocate tensors on the fly and free them after they are out of scope LAZY_ALLOCATE = 2 For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately. The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516) This fixed OOM issue in Llama4 70B. Reviewed By: sanrise Differential Revision: D66487952

facebook-github-bot · 2025-01-02T18:16:14Z

This pull request was exported from Phabricator. Differential Revision: D66487952

facebook-github-bot · 2025-01-02T19:12:00Z

This pull request has been merged in be232eb.

shengfukevin requested review from kingchc, louisfeng, sunghlin, shengbao-zheng and briancoutinho as code owners December 18, 2024 22:54

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2024

facebook-github-bot added the fb-exported label Dec 18, 2024

shengfukevin force-pushed the export-D66487952 branch from 0991279 to 0f5e647 Compare December 19, 2024 01:47

shengfukevin force-pushed the export-D66487952 branch from 0f5e647 to 9a43a57 Compare December 26, 2024 21:33

shengfukevin force-pushed the export-D66487952 branch from 9a43a57 to f9c45d6 Compare January 2, 2025 18:16

facebook-github-bot closed this in be232eb Jan 2, 2025

facebook-github-bot added the Merged label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support lazy tensor allocation #193

Support lazy tensor allocation #193

shengfukevin commented Dec 18, 2024

facebook-github-bot commented Dec 18, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 26, 2024

facebook-github-bot commented Jan 2, 2025

facebook-github-bot commented Jan 2, 2025

Support lazy tensor allocation #193

Support lazy tensor allocation #193

Conversation

shengfukevin commented Dec 18, 2024

facebook-github-bot commented Dec 18, 2024

facebook-github-bot commented Dec 19, 2024

facebook-github-bot commented Dec 26, 2024

facebook-github-bot commented Jan 2, 2025

facebook-github-bot commented Jan 2, 2025