LM training data attribution question for individual sentences #39

kanishkamisra · 2025-01-01T17:28:38Z

Hi - thank you for making this great library! I am trying to use it to implicate training data for differences in minimal pair sentences of the form:

the keys to the cabinet are on the table
the keys to the cabinet is on the table

where I just want to look at what factors affect "is" vs. "are". This would clearly require changes to the wikitext example where the eval/dev set was simply being grouped into chunks of fixed length sequences as opposed to individual sentences per row. I was wondering if I could simply pad all my queries with some fixed max length and then proceed as normal or is there something else I can do?

I tried using the pad sequence idea but was getting some weird matmul dimension errors (I was just trying with 4 examples in my dev set):

the toys on the table are
the toys on the table is
i think the toy on the table is
i think the toy on the table are

and then:

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=128)

def add_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

tokenized_test_dataset = test_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=None,
    remove_columns=test_dataset["test"].column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on dataset",
    batch_size=4
)

tokenized_test_dataset = tokenized_test_dataset.map(
    add_labels,
    batched=True,
    num_proc=None,
    load_from_cache_file=True,
    batch_size=4
)

when I then run the pairwise score computation, this is the error I get:

RuntimeError: The size of tensor a (4) must match the size of tensor b (512) at non-singleton dimension 1

Any assistance would be much appreciated - please let me know if I should share more details!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LM training data attribution question for individual sentences #39

LM training data attribution question for individual sentences #39

kanishkamisra commented Jan 1, 2025

LM training data attribution question for individual sentences #39

LM training data attribution question for individual sentences #39

Comments

kanishkamisra commented Jan 1, 2025