aac.amd: MI210 - roberta-large with sequence length 8192 and batch_size 1 fails #46

michaelfeil · 2024-02-28T04:43:27Z

Problem Description

MI210 - roberta-large fails with sequence length 8192 and batch_size 1

Both on torch 2.2.0 and torch 2.3.0-20240222 nightly.

Operating System

OS: NAME="Ubuntu" VERSION="22.04.3 LTS (Jammy Jellyfish)"

CPU

model name : AMD EPYC 7763 64-Core Processor

GPU

AMD Instinct MI210

ROCm Version

ROCm 5.7.1

ROCm Component

HIP

Steps to Reproduce

Running the following modeling code with batch_size 1-8, ctx length 4096 works, but I get a memory dump on batch_size 8, ctx length 8192 on Mi210.

from transformers.models.xlm_roberta.modeling_xlm_roberta import XLMRobertaModel #.modeling_roberta import RobertaModel
from transformers.models.xlm_roberta.modeling_xlm_roberta import XLMRobertaForMaskedLM, XLMRobertaForSequenceClassification, \
    XLMRobertaClassificationHead, XLMRobertaForMultipleChoice, XLMRobertaForTokenClassification, XLMRobertaForQuestionAnswering
from flash_attn import flash_attn_func
import torch.nn as nn
import torch
from typing import Optional, Tuple


# Copied from transformers.models.roberta.modeling_roberta.RobertaSelfAttention with Roberta->FlashRoberta
class FlashRobertaSelfAttention(nn.Module):
    def __init__(self, config, position_embedding_type=None):
        super().__init__()
        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
                f"heads ({config.num_attention_heads})"
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout_rate = config.attention_probs_dropout_prob

    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(new_x_shape)
        return x
    
    @staticmethod
    @torch.cuda.amp.custom_fwd(cast_inputs=torch.bfloat16)
    def _flash(query_layer, key_layer, value_layer, dropout_p,
                                                            softmax_scale=None,
                                                            causal=False,
                                                            return_attn_probs=True):
        return flash_attn_func(query_layer, key_layer, value_layer,
                                                            dropout_p=dropout_p,
                                                            softmax_scale=softmax_scale,
                                                            causal=causal,
                                                            return_attn_probs=return_attn_probs)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        encoder_hidden_states: Optional[torch.FloatTensor] = None,
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))
        query_layer = self.transpose_for_scores(self.query(hidden_states))
 
        if not torch.is_grad_enabled():
            # very ugly autocast
            orig_dtype = hidden_states.dtype
            key_layer = key_layer.to(torch.bfloat16)
            value_layer = value_layer.to(torch.bfloat16)
            query_layer = query_layer.to(torch.bfloat16)
        #     print(f"warning: FA needs casting from {orig_dtype}")
        # Flash Attention
        context_layer, _, attention_probs = self._flash(query_layer, key_layer, value_layer,
                                                            dropout_p=self.dropout_rate,
                                                            softmax_scale=None,
                                                            causal=False,
                                                            return_attn_probs=True)

        # Merge heads
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)
        
        if not torch.is_grad_enabled():
            context_layer = context_layer.to(orig_dtype)
        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
        
        return outputs


class FlashXLMRobertaModel(XLMRobertaModel):
    """

    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
    cross-attention is added between the self-attention layers, following the architecture described in
    *FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning* by Tri Dao

    .. _*FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning*: https://tridao.me/publications/flash2/flash2.pdf

    """

    _keys_to_ignore_on_load_missing = [r"position_ids"]

    # Copied from transformers.models.roberta.modeling_roberta.RobertaModel.__init__ with Roberta->FlashRoberta
    def __init__(self, config, add_pooling_layer=True):
        super().__init__(config)

        # Replace legacy RobertaSelfAttention with FlashRobertaSelfAttention
        for attention_layer in self.encoder.layer:
            attention_layer.attention.self = FlashRobertaSelfAttention(config)

        # Initialize weights and apply final processing
        self.post_init()


class FlashRobertaForMaskedLM(XLMRobertaForMaskedLM):
    _keys_to_ignore_on_save = [r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
    _keys_to_ignore_on_load_missing = [r"position_ids", r"lm_head.decoder.weight", r"lm_head.decoder.bias"]
    _keys_to_ignore_on_load_unexpected = [r"pooler"]

    def __init__(self, config):
        super().__init__(config)

        # Replace legacy RobertaModel with FlashXLMRobertaModel
        self.roberta = FlashXLMRobertaModel(config, add_pooling_layer=False)

        # Initialize weights and apply final processing
        self.post_init()


class FlashRobertaForSequenceClassification(XLMRobertaForSequenceClassification):
    _keys_to_ignore_on_load_missing = [r"position_ids"]

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.config = config

        self.roberta = FlashXLMRobertaModel(config, add_pooling_layer=False)
        self.classifier = XLMRobertaClassificationHead(config)

        # Initialize weights and apply final processing
        self.post_init()


class FlashRobertaForMultipleChoice(XLMRobertaForMultipleChoice):
    _keys_to_ignore_on_load_missing = [r"position_ids"]

    def __init__(self, config):
        super().__init__(config)

        self.roberta = FlashXLMRobertaModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, 1)

        # Initialize weights and apply final processing
        self.post_init()


class FlashRobertaForTokenClassification(XLMRobertaForTokenClassification):
    _keys_to_ignore_on_load_unexpected = [r"pooler"]
    _keys_to_ignore_on_load_missing = [r"position_ids"]

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.roberta = FlashXLMRobertaModel(config, add_pooling_layer=False)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents

Agent 1

Name: AMD EPYC 7763 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7763 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2450
BDFID: 0
Internal Node ID: 0
Compute Unit: 64
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 263883024(0xfba8910) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 263883024(0xfba8910) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 263883024(0xfba8910) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 2

Name: AMD EPYC 7763 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7763 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2450
BDFID: 0
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 263885972(0xfba9494) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 263885972(0xfba9494) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 263885972(0xfba9494) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 3

Name: gfx90a
Uuid: GPU-8c5cf33e5f4ce93f
Marketing Name:
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29711(0x740f)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1700
BDFID: 768
Internal Node ID: 2
Compute Unit: 104
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 78
SDMA engine uCode:: 8
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aac.amd: MI210 - roberta-large with sequence length 8192 and batch_size 1 fails #46

aac.amd: MI210 - roberta-large with sequence length 8192 and batch_size 1 fails #46

michaelfeil commented Feb 28, 2024

aac.amd: MI210 - roberta-large with sequence length 8192 and batch_size 1 fails #46

aac.amd: MI210 - roberta-large with sequence length 8192 and batch_size 1 fails #46

Comments

michaelfeil commented Feb 28, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

$ /opt/rocm/bin/rocminfo --support ROCk module is loaded

HSA System Attributes

========== HSA Agents

Additional Information

$ /opt/rocm/bin/rocminfo --support
ROCk module is loaded

==========
HSA Agents