Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM training data attribution question for individual sentences #39

Open
kanishkamisra opened this issue Jan 1, 2025 · 0 comments
Open

Comments

@kanishkamisra
Copy link

Hi - thank you for making this great library! I am trying to use it to implicate training data for differences in minimal pair sentences of the form:

the keys to the cabinet are on the table
the keys to the cabinet is on the table

where I just want to look at what factors affect "is" vs. "are". This would clearly require changes to the wikitext example where the eval/dev set was simply being grouped into chunks of fixed length sequences as opposed to individual sentences per row. I was wondering if I could simply pad all my queries with some fixed max length and then proceed as normal or is there something else I can do?

I tried using the pad sequence idea but was getting some weird matmul dimension errors (I was just trying with 4 examples in my dev set):

the toys on the table are
the toys on the table is
i think the toy on the table is
i think the toy on the table are

and then:

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=128)

def add_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

tokenized_test_dataset = test_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=None,
    remove_columns=test_dataset["test"].column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on dataset",
    batch_size=4
)

tokenized_test_dataset = tokenized_test_dataset.map(
    add_labels,
    batched=True,
    num_proc=None,
    load_from_cache_file=True,
    batch_size=4
)

when I then run the pairwise score computation, this is the error I get:

RuntimeError: The size of tensor a (4) must match the size of tensor b (512) at non-singleton dimension 1

Any assistance would be much appreciated - please let me know if I should share more details!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant