Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support lazy tensor allocation #193

Closed

Conversation

shengfukevin
Copy link
Contributor

Summary:
Support lazy tensor allocation

The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations.

However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode

class TensorAllcationMode(Enum):
"""
Enum to represent the tensor allocation mode
"""

# Allocate input tensors that can not be generated when replaying the trace
# at the beginning and reuse them for all iterations.
PRE_ALLOCATE = 1

# Allocate tensors on the fly and free them after they are out of scope
LAZY_ALLOCATE = 2

For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately.

The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516)

This fixed OOM issue in Llama4 70B.

Differential Revision: D66487952

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66487952

shengfukevin added a commit to shengfukevin/param that referenced this pull request Dec 19, 2024
Summary:

Support lazy tensor allocation

The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations.

However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode

class TensorAllcationMode(Enum):
    """
    Enum to represent the tensor allocation mode
    """

    # Allocate input tensors that can not be generated when replaying the trace
    # at the beginning and reuse them for all iterations.
    PRE_ALLOCATE = 1

    # Allocate tensors on the fly and free them after they are out of scope
    LAZY_ALLOCATE = 2

For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately.

The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516)

This fixed OOM issue in Llama4 70B.

Differential Revision: D66487952
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66487952

shengfukevin added a commit to shengfukevin/param that referenced this pull request Dec 26, 2024
Summary:

Support lazy tensor allocation

The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations.

However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode

class TensorAllcationMode(Enum):
    """
    Enum to represent the tensor allocation mode
    """

    # Allocate input tensors that can not be generated when replaying the trace
    # at the beginning and reuse them for all iterations.
    PRE_ALLOCATE = 1

    # Allocate tensors on the fly and free them after they are out of scope
    LAZY_ALLOCATE = 2

For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately.

The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516)

This fixed OOM issue in Llama4 70B.

Differential Revision: D66487952
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66487952

Summary:

Support lazy tensor allocation

The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations.

However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode

class TensorAllcationMode(Enum):
    """
    Enum to represent the tensor allocation mode
    """

    # Allocate input tensors that can not be generated when replaying the trace
    # at the beginning and reuse them for all iterations.
    PRE_ALLOCATE = 1

    # Allocate tensors on the fly and free them after they are out of scope
    LAZY_ALLOCATE = 2

For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately.

The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516)

This fixed OOM issue in Llama4 70B.

Reviewed By: sanrise

Differential Revision: D66487952
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66487952

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in be232eb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants