Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help Understanding Beam Search Scores in Hugging Face (LLaMA + LoRA) #35618

Open
2 of 4 tasks
pratcooper opened this issue Jan 10, 2025 · 0 comments
Open
2 of 4 tasks
Labels

Comments

@pratcooper
Copy link

pratcooper commented Jan 10, 2025

System Info

Hello Hugging Face community,

I’m working with a LLaMA-based model that has a LoRA (Low-Rank Adapter) applied, and I’m using beam search in Transformers. I’m trying to debug how the final beam scores are computed, because the step-by-step log probabilities I print out look far more negative than the final “sequence score” reported by Hugging Face.

Below is a sample of my debug output for 4 beams, each showing:

Generated Sequence (token IDs, excluding the prompt/input).
Generated Text (decoded).
Step-by-Step Analysis: Each newly generated token’s log probability.
HF Cumulative Sequence Score (final beam score from generation_output.sequences_scores).
Debug Info (lengths, how many log-prob steps were used vs. available).

=== HuggingFace Beam Analysis (Generated Tokens Only) ===
Input sequence length: 148

--- Beam 1 ---
Generated Sequence (IDs): [32, 3202, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
Generated Text: AUP

Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-28.383789
Step 3: Token='' (ID=128001), LogProb=-32.667973

Final Scores:
HF Cumulative Sequence Score: -0.247081

--- Beam 2 ---
Generated Sequence (IDs): [51154, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
Generated Text: Others

Step-by-Step Analysis:
Step 1: Token='Others' (ID=51154), LogProb=-0.647490
Step 2: Token='' (ID=128001), LogProb=-29.399292

Final Scores:
HF Cumulative Sequence Score: -0.323745

--- Beam 3 ---
Generated Sequence (IDs): [32, 3202, 320, 6546, 1428, 11, 10984, 49541, 13, 15388, 3298, 8, 128001]
Generated Text: AUP (CSAM, Encourg. Illegal Act)

Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-20.869020
Step 3: Token=' (' (ID=320), LogProb=-9.416358
Step 4: Token='CS' (ID=6546), LogProb=-19.269587
Step 5: Token='AM' (ID=1428), LogProb=-23.486216
Step 6: Token=',' (ID=11), LogProb=-10.883574
Step 7: Token=' Enc' (ID=10984), LogProb=-0.144973
Step 8: Token='ourg' (ID=49541), LogProb=-0.001301
Step 9: Token='.' (ID=13), LogProb=-0.001659
Step 10: Token=' Illegal' (ID=15388), LogProb=-20.425816
Step 11: Token=' Act' (ID=3298), LogProb=-14.907486
Step 12: Token=')' (ID=8), LogProb=-0.150186
Step 13: Token='' (ID=128001), LogProb=-17.213655

Final Scores:
HF Cumulative Sequence Score: -1.447294

--- Beam 4 ---
Generated Sequence (IDs): [32, 3202, 320, 6546, 1428, 11, 10984, 49541, 13, 15388, 3298, 6266, 128001]
Generated Text: AUP (CSAM, Encourg. Illegal Act.)

Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-28.162111
Step 3: Token=' (' (ID=320), LogProb=-10.757921
Step 4: Token='CS' (ID=6546), LogProb=-6.859391
Step 5: Token='AM' (ID=1428), LogProb=-20.384962
Step 6: Token=',' (ID=11), LogProb=-15.148496
Step 7: Token=' Enc' (ID=10984), LogProb=-0.298849
Step 8: Token='ourg' (ID=49541), LogProb=-18.535187
Step 9: Token='.' (ID=13), LogProb=-0.006747
Step 10: Token=' Illegal' (ID=15388), LogProb=-14.434349
Step 11: Token=' Act' (ID=3298), LogProb=-12.582914
Step 12: Token='.)' (ID=6266), LogProb=-12.790556
Step 13: Token='' (ID=128001), LogProb=-20.104782

Final Scores:
HF Cumulative Sequence Score: -1.464120

The Question

How does Hugging Face’s beam search compute the final scores (e.g., −0.247081, −0.323745, −1.447294, −1.464120) given the very negative individual log probabilities?

For example, for the first beam, I expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied. However, the final sequences_scores from HF differ significantly from any straightforward summation of the listed token log-probs, even when accounting for a length_penalty.

Can someone help clarify how these scores are calculated?

Who can help?

@gante @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

GENERATION CODE :

model_name = "./Llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = LlamaForCausalLM.from_pretrained(
model_name,
load_in_8bit=False,
torch_dtype=torch.float16,
device_map='auto',
)

adaptor_path = './model_spec/checkpoints/checkpoint-200'
model = PeftModel.from_pretrained(
model,
adaptor_path,
torch_dtype=torch.float16,
)

model.eval()

message = "Lady Sold Children's Clothes That She Don't Send!"
input_raw = "Message: {message}"
input = input_raw.format(message=message)
instruction = "Does this customer-reported message indicate an AUP violation from the following categories? \n[A, B, C]\nIf yes, respond 'AUP'; if not, respond 'Others'."
prompt_template = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
prompt = prompt_template.format(instruction=instruction, input=input)

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to('cuda')
generation_config = GenerationConfig(
temperature=0,
top_p=1,
top_k=-1,
num_beams=4, # Number of beams for beam search
num_return_sequences=4, # Return all beams
)
generate_params = {
"input_ids": input_ids,
"generation_config": generation_config,
"return_dict_in_generate": True,
"output_scores": True,
"max_new_tokens": 128,
}

with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=128
)
s = generation_output.sequences[0]
output = tokenizer.decode(s,skip_special_tokens=True)
result = output.split('assistant')[1].strip()

DECODE CODE :

import torch
import torch.nn.functional as F

def analyze_beams(
generation_output,
tokenizer,
input_ids,
end_of_text_id=128001,
length_penalty=1.0,
ignore_after_first_eos=False
):
"""
Analyzes final beams from a Hugging Face generation output.

1) Excludes the original input tokens, only focusing on "newly generated" tokens.
2) Prints step-by-step tokens (ID & text) + log-probs.
3) Applies optional length penalty for the final "calculated score."
4) Optionally stops counting tokens after first <eos> if 'ignore_after_first_eos=True'.

:param generation_output: Object with attributes:
   - sequences: final beam sequences (tensor shape [num_beams, total_seq_len])
   - sequences_scores: final HF beam scores
   - scores: list of per-step logits ([num_steps], each shape [num_beams, vocab_size])
:param tokenizer: A Hugging Face tokenizer to decode tokens into text.
:param input_ids: The original input_ids (so we can know how many tokens to skip).
:param end_of_text_id: The <eos> or <end_of_text> token ID (default=128001).
:param length_penalty: Exponent for length normalization.
:param ignore_after_first_eos: If True, we ignore any tokens after the first <eos>.
"""

# 1) Determine how many input tokens to skip
input_length = len(input_ids[0])  # e.g. shape [batch_size, seq_len]
print("\n=== HuggingFace Beam Analysis (Generated Tokens Only) ===")
print(f"Input sequence length: {input_length}")

# 2) Convert generation_output.scores into shape [num_beams, steps, vocab_size]
logits = torch.stack(generation_output.scores, dim=1)   # shape [num_beams, steps, vocab_size]
log_probs = F.log_softmax(logits, dim=-1)              # shape [num_beams, steps, vocab_size]

beam_sequences = generation_output.sequences
beam_scores = generation_output.sequences_scores

num_beams = beam_sequences.shape[0]
steps_available = log_probs.shape[1]
vocab_size = log_probs.shape[2]

# 3) Analyze each beam
for beam_idx in range(num_beams):
    print(f"\n--- Beam {beam_idx + 1} ---")

    # Slice out only the newly generated portion (excluding input)
    full_sequence = beam_sequences[beam_idx]
    generated_sequence = full_sequence[input_length:]  # This is your "generated" part

    # Decode text
    generated_text = tokenizer.decode(generated_sequence, skip_special_tokens=True)

    print(f"Generated Sequence (IDs): {generated_sequence.tolist()}")
    print(f"Generated Text: {generated_text}")

    print("\nStep-by-Step Analysis:")
    beam_score_sum = 0.0
    used_step_count = 0

    # We'll iterate over each newly generated token
    for step_idx, token_id in enumerate(generated_sequence):
        if step_idx >= steps_available:
            # We've run out of log_probs steps
            break

        # Retrieve distribution for this beam at this step
        # shape [vocab_size]
        token_log_probs = log_probs[beam_idx, step_idx]

        # The log-prob for the chosen token_id
        token_logp = token_log_probs[token_id].item()

        # Accumulate beam score
        beam_score_sum += token_logp
        used_step_count += 1

        # Print step info
        token_text = tokenizer.decode([token_id], skip_special_tokens=True)
        print(
            f"Step {step_idx + 1}: "
            f"Token='{token_text}' (ID={token_id}), LogProb={token_logp:.6f}"
        )

        # If ignoring repeated <eos>, we break after the first <eos> token
        if ignore_after_first_eos and token_id == end_of_text_id:
            break

    # 4) Apply length penalty
    # If all tokens are used, used_step_count is the length; otherwise we truncated early
    final_len = used_step_count if used_step_count > 0 else 1
    calculated_score = beam_score_sum / (final_len ** length_penalty)

    # 5) Print results
    print("\nFinal Scores:")
    # Show Hugging Face's final beam score
    hf_score = beam_scores[beam_idx].item()
    print(f"  HF Cumulative Sequence Score:        {hf_score:.6f}")
    print(f"  Calculated Score:      {calculated_score:.6f}")

    print("\nDebug Info:")
    print(f"  Full sequence length:       {len(full_sequence)} (including input)")
    print(f"  Generated sequence length:  {len(generated_sequence)}")
    print(f"  Steps of log_probs used:    {used_step_count}")
    print(f"  Steps of log_probs avail:   {steps_available}")
    print(f"  Vocab size:                 {vocab_size}")

Expected behavior

Expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied.

@pratcooper pratcooper added the bug label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant