MultiHeadAttention Module #167

vamsikumbuf · 2024-05-20T17:53:00Z

vamsikumbuf
May 20, 2024

Inside the MultiHeadAttention module in Chapter-03, can we convert the following code:

        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

to

        keys = keys.view(b, self.num_heads, num_tokens, self.head_dim) 
        values = values.view(b, self.num_heads, num_tokens, self.head_dim)
        queries = queries.view(b, self.num_heads, num_tokens, self.head_dim)

Does it change anything while calculating the final context vectors using contiguous() ??

Answered by rasbt

May 20, 2024

That's a great question. At first glance, it definitely looks like these should be relatively similar. However, reshaping (viewing) and transposing are not the same.

E.g., you can try:

a = torch.tensor([[1, 2, 3], [4, 5, 6]])
a.view(3, 2)

and compare it to

a.transpose(0, 1)

Both will be 3x2 tensors, but the contents are arranged differently.

One might think the LLM might still pretrain fine. However, when you try to load the pretrained weights from OpenAI in chapter 5 or chapter 6, you will see that the outputs by the LLM will be garbled.

As an experiment, you can change the lines here:

LLMs-from-scratch/ch06/01_main-chapter-code/previous_chapters.py

Lines 85 to 94 in 451a629

…

View full answer

rasbt · 2024-05-20T22:41:35Z

rasbt
May 20, 2024
Maintainer

That's a great question. At first glance, it definitely looks like these should be relatively similar. However, reshaping (viewing) and transposing are not the same.

E.g., you can try:

a = torch.tensor([[1, 2, 3], [4, 5, 6]])
a.view(3, 2)

and compare it to

a.transpose(0, 1)

Both will be 3x2 tensors, but the contents are arranged differently.

One might think the LLM might still pretrain fine. However, when you try to load the pretrained weights from OpenAI in chapter 5 or chapter 6, you will see that the outputs by the LLM will be garbled.

As an experiment, you can change the lines here:

LLMs-from-scratch/ch06/01_main-chapter-code/previous_chapters.py

Lines 85 to 94 in 451a629

    
           # We implicitly split the matrix by adding a `num_heads` dimension 
        
           # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim) 
        
           keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        
           values = values.view(b, num_tokens, self.num_heads, self.head_dim) 
        
           queries = queries.view(b, num_tokens, self.num_heads, self.head_dim) 
        
           # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim) 
        
           keys = keys.transpose(1, 2) 
        
           queries = queries.transpose(1, 2) 
        
           values = values.transpose(1, 2)

and then run the chapter 6 code. You will see that the response here will be nonsensical after the modification (below I am showing the original text to show you which code I mean):

1 reply

vamsikumbuf May 20, 2024
Author

Understood. Thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiHeadAttention Module #167

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

MultiHeadAttention Module #167

vamsikumbuf May 20, 2024

Replies: 1 comment · 1 reply

rasbt May 20, 2024 Maintainer

vamsikumbuf May 20, 2024 Author

vamsikumbuf
May 20, 2024

Replies: 1 comment 1 reply

rasbt
May 20, 2024
Maintainer

vamsikumbuf May 20, 2024
Author