Help Understanding Beam Search Scores in Hugging Face (LLaMA + LoRA) #35618

pratcooper · 2025-01-10T22:56:16Z

System Info

Hello Hugging Face community,

I’m working with a LLaMA-based model that has a LoRA (Low-Rank Adapter) applied, and I’m using beam search in Transformers. I’m trying to debug how the final beam scores are computed, because the step-by-step log probabilities I print out look far more negative than the final “sequence score” reported by Hugging Face.

Below is a sample of my debug output for 4 beams, each showing:

Generated Sequence (token IDs, excluding the prompt/input).
Generated Text (decoded).
Step-by-Step Analysis: Each newly generated token’s log probability.
HF Cumulative Sequence Score (final beam score from generation_output.sequences_scores).
Debug Info (lengths, how many log-prob steps were used vs. available).

=== HuggingFace Beam Analysis (Generated Tokens Only) ===
Input sequence length: 148

--- Beam 1 ---
Generated Sequence (IDs): [32, 3202, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
Generated Text: AUP

Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-28.383789
Step 3: Token='' (ID=128001), LogProb=-32.667973

Final Scores:
HF Cumulative Sequence Score: -0.247081

--- Beam 2 ---
Generated Sequence (IDs): [51154, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001, 128001]
Generated Text: Others

Step-by-Step Analysis:
Step 1: Token='Others' (ID=51154), LogProb=-0.647490
Step 2: Token='' (ID=128001), LogProb=-29.399292

Final Scores:
HF Cumulative Sequence Score: -0.323745

--- Beam 3 ---
Generated Sequence (IDs): [32, 3202, 320, 6546, 1428, 11, 10984, 49541, 13, 15388, 3298, 8, 128001]
Generated Text: AUP (CSAM, Encourg. Illegal Act)

Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-20.869020
Step 3: Token=' (' (ID=320), LogProb=-9.416358
Step 4: Token='CS' (ID=6546), LogProb=-19.269587
Step 5: Token='AM' (ID=1428), LogProb=-23.486216
Step 6: Token=',' (ID=11), LogProb=-10.883574
Step 7: Token=' Enc' (ID=10984), LogProb=-0.144973
Step 8: Token='ourg' (ID=49541), LogProb=-0.001301
Step 9: Token='.' (ID=13), LogProb=-0.001659
Step 10: Token=' Illegal' (ID=15388), LogProb=-20.425816
Step 11: Token=' Act' (ID=3298), LogProb=-14.907486
Step 12: Token=')' (ID=8), LogProb=-0.150186
Step 13: Token='' (ID=128001), LogProb=-17.213655

Final Scores:
HF Cumulative Sequence Score: -1.447294

--- Beam 4 ---
Generated Sequence (IDs): [32, 3202, 320, 6546, 1428, 11, 10984, 49541, 13, 15388, 3298, 6266, 128001]
Generated Text: AUP (CSAM, Encourg. Illegal Act.)

Step-by-Step Analysis:
Step 1: Token='A' (ID=32), LogProb=-0.741240
Step 2: Token='UP' (ID=3202), LogProb=-28.162111
Step 3: Token=' (' (ID=320), LogProb=-10.757921
Step 4: Token='CS' (ID=6546), LogProb=-6.859391
Step 5: Token='AM' (ID=1428), LogProb=-20.384962
Step 6: Token=',' (ID=11), LogProb=-15.148496
Step 7: Token=' Enc' (ID=10984), LogProb=-0.298849
Step 8: Token='ourg' (ID=49541), LogProb=-18.535187
Step 9: Token='.' (ID=13), LogProb=-0.006747
Step 10: Token=' Illegal' (ID=15388), LogProb=-14.434349
Step 11: Token=' Act' (ID=3298), LogProb=-12.582914
Step 12: Token='.)' (ID=6266), LogProb=-12.790556
Step 13: Token='' (ID=128001), LogProb=-20.104782

Final Scores:
HF Cumulative Sequence Score: -1.464120

The Question

How does Hugging Face’s beam search compute the final scores (e.g., −0.247081, −0.323745, −1.447294, −1.464120) given the very negative individual log probabilities?

For example, for the first beam, I expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied. However, the final sequences_scores from HF differ significantly from any straightforward summation of the listed token log-probs, even when accounting for a length_penalty.

Can someone help clarify how these scores are calculated?

Who can help?

@gante @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

GENERATION CODE :

model_name = "./Llama/Meta-Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = LlamaForCausalLM.from_pretrained(
model_name,
load_in_8bit=False,
torch_dtype=torch.float16,
device_map='auto',
)

adaptor_path = './model_spec/checkpoints/checkpoint-200'
model = PeftModel.from_pretrained(
model,
adaptor_path,
torch_dtype=torch.float16,
)

model.eval()

message = "Lady Sold Children's Clothes That She Don't Send!"
input_raw = "Message: {message}"
input = input_raw.format(message=message)
instruction = "Does this customer-reported message indicate an AUP violation from the following categories? \n[A, B, C]\nIf yes, respond 'AUP'; if not, respond 'Others'."
prompt_template = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{instruction}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
prompt = prompt_template.format(instruction=instruction, input=input)

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to('cuda')
generation_config = GenerationConfig(
temperature=0,
top_p=1,
top_k=-1,
num_beams=4, # Number of beams for beam search
num_return_sequences=4, # Return all beams
)
generate_params = {
"input_ids": input_ids,
"generation_config": generation_config,
"return_dict_in_generate": True,
"output_scores": True,
"max_new_tokens": 128,
}

with torch.no_grad():
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=128
)
s = generation_output.sequences[0]
output = tokenizer.decode(s,skip_special_tokens=True)
result = output.split('assistant')[1].strip()

DECODE CODE :

import torch
import torch.nn.functional as F

def analyze_beams(
generation_output,
tokenizer,
input_ids,
end_of_text_id=128001,
length_penalty=1.0,
ignore_after_first_eos=False
):
"""
Analyzes final beams from a Hugging Face generation output.

1) Excludes the original input tokens, only focusing on "newly generated" tokens.
2) Prints step-by-step tokens (ID & text) + log-probs.
3) Applies optional length penalty for the final "calculated score."
4) Optionally stops counting tokens after first <eos> if 'ignore_after_first_eos=True'.

:param generation_output: Object with attributes:
   - sequences: final beam sequences (tensor shape [num_beams, total_seq_len])
   - sequences_scores: final HF beam scores
   - scores: list of per-step logits ([num_steps], each shape [num_beams, vocab_size])
:param tokenizer: A Hugging Face tokenizer to decode tokens into text.
:param input_ids: The original input_ids (so we can know how many tokens to skip).
:param end_of_text_id: The <eos> or <end_of_text> token ID (default=128001).
:param length_penalty: Exponent for length normalization.
:param ignore_after_first_eos: If True, we ignore any tokens after the first <eos>.
"""

# 1) Determine how many input tokens to skip
input_length = len(input_ids[0])  # e.g. shape [batch_size, seq_len]
print("\n=== HuggingFace Beam Analysis (Generated Tokens Only) ===")
print(f"Input sequence length: {input_length}")

# 2) Convert generation_output.scores into shape [num_beams, steps, vocab_size]
logits = torch.stack(generation_output.scores, dim=1)   # shape [num_beams, steps, vocab_size]
log_probs = F.log_softmax(logits, dim=-1)              # shape [num_beams, steps, vocab_size]

beam_sequences = generation_output.sequences
beam_scores = generation_output.sequences_scores

num_beams = beam_sequences.shape[0]
steps_available = log_probs.shape[1]
vocab_size = log_probs.shape[2]

# 3) Analyze each beam
for beam_idx in range(num_beams):
    print(f"\n--- Beam {beam_idx + 1} ---")

    # Slice out only the newly generated portion (excluding input)
    full_sequence = beam_sequences[beam_idx]
    generated_sequence = full_sequence[input_length:]  # This is your "generated" part

    # Decode text
    generated_text = tokenizer.decode(generated_sequence, skip_special_tokens=True)

    print(f"Generated Sequence (IDs): {generated_sequence.tolist()}")
    print(f"Generated Text: {generated_text}")

    print("\nStep-by-Step Analysis:")
    beam_score_sum = 0.0
    used_step_count = 0

    # We'll iterate over each newly generated token
    for step_idx, token_id in enumerate(generated_sequence):
        if step_idx >= steps_available:
            # We've run out of log_probs steps
            break

        # Retrieve distribution for this beam at this step
        # shape [vocab_size]
        token_log_probs = log_probs[beam_idx, step_idx]

        # The log-prob for the chosen token_id
        token_logp = token_log_probs[token_id].item()

        # Accumulate beam score
        beam_score_sum += token_logp
        used_step_count += 1

        # Print step info
        token_text = tokenizer.decode([token_id], skip_special_tokens=True)
        print(
            f"Step {step_idx + 1}: "
            f"Token='{token_text}' (ID={token_id}), LogProb={token_logp:.6f}"
        )

        # If ignoring repeated <eos>, we break after the first <eos> token
        if ignore_after_first_eos and token_id == end_of_text_id:
            break

    # 4) Apply length penalty
    # If all tokens are used, used_step_count is the length; otherwise we truncated early
    final_len = used_step_count if used_step_count > 0 else 1
    calculated_score = beam_score_sum / (final_len ** length_penalty)

    # 5) Print results
    print("\nFinal Scores:")
    # Show Hugging Face's final beam score
    hf_score = beam_scores[beam_idx].item()
    print(f"  HF Cumulative Sequence Score:        {hf_score:.6f}")
    print(f"  Calculated Score:      {calculated_score:.6f}")

    print("\nDebug Info:")
    print(f"  Full sequence length:       {len(full_sequence)} (including input)")
    print(f"  Generated sequence length:  {len(generated_sequence)}")
    print(f"  Steps of log_probs used:    {used_step_count}")
    print(f"  Steps of log_probs avail:   {steps_available}")
    print(f"  Vocab size:                 {vocab_size}")

Expected behavior

Expected a cumulative probability of (-0.741240 - 28.38378 - 32.667973) / 3 = -20.597667 since no length_penalty is being applied.

The text was updated successfully, but these errors were encountered:

pratcooper added the bug label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help Understanding Beam Search Scores in Hugging Face (LLaMA + LoRA) #35618

Help Understanding Beam Search Scores in Hugging Face (LLaMA + LoRA) #35618

pratcooper commented Jan 10, 2025 •

edited

Loading

Help Understanding Beam Search Scores in Hugging Face (LLaMA + LoRA) #35618

Help Understanding Beam Search Scores in Hugging Face (LLaMA + LoRA) #35618

Comments

pratcooper commented Jan 10, 2025 • edited Loading

System Info

The Question

Who can help?

Information

Tasks

Reproduction

GENERATION CODE :

DECODE CODE :

Expected behavior

pratcooper commented Jan 10, 2025 •

edited

Loading