-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to load data in trl 0.7.8/0.7.9. #1216
Comments
Thanks for the repro @xkszltl , will try to repro and fix the issue and make another patch release |
Hi @xkszltl |
True won't work at all (regardless of the version), that's why it's False initially. Traceback (most recent call last):
File "./test.py", line 58, in main
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 315, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 451, in __iter__
current_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = self.dataset.__getitems__(possibly_batched_index)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2804, in __getitems__
batch = self.__getitem__(keys)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2800, in __getitem__
return self._getitem(key)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2784, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 583, in query_table
_check_valid_index_key(key, size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 536, in _check_valid_index_key
_check_valid_index_key(int(max(key)), size=size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 526, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 23887 is out of bounds for size 0 |
I'm currently pinning to |
hi @xkszltl import datasets
import peft
import transformers
import trl
model_dir = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_dir)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = transformers.AutoModelForCausalLM.from_pretrained(model_dir)
ds_train = datasets.load_dataset("imdb", split="train[:10]")
trainer = trl.SFTTrainer(
model=model,
args=transformers.TrainingArguments(
output_dir="output",
max_steps=1,
remove_unused_columns=True,
),
peft_config=peft.LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=8,
bias="none",
task_type="Causal_LM",
),
train_dataset=ds_train,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=8,
)
trainer.train() |
Are you trying with master or a release? |
@xkszltl on master currently |
@xkszltl can you try and let me know how it goes? |
Only tried on released wheel so that may be the reason. |
I see ok ! if you want you can build from that branch: pip install -U git+https://github.com/huggingface/trl.git@fix-breaking-change |
Still repros on the branch, and I'm using a different dataset this time, not just imdb. |
In case version matters:
|
@xkszltl I am using the same library versions as you and was not able to repro, did you run this script: #1216 (comment) ? |
# CUDA_VISIBLE_DEVICES=0 ./try.py
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.1.0) or chardet (4.0.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████| 771/771 [00:00<00:00, 5.36MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████████████████████████| 500k/500k [00:00<00:00, 789kB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 4.61MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████| 552/552 [00:00<00:00, 4.44MB/s]
config.json: 100%|██████████████████████████████████████████████████████████████████████████| 466/466 [00:00<00:00, 3.74MB/s]
pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████| 2.07M/2.07M [00:00<00:00, 10.1MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████| 138/138 [00:00<00:00, 1.06MB/s]
Downloading readme: 100%|███████████████████████████████████████████████████████████████| 7.81k/7.81k [00:00<00:00, 38.4MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████| 21.0M/21.0M [00:03<00:00, 6.42MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████| 20.5M/20.5M [00:03<00:00, 6.45MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████████| 42.0M/42.0M [00:05<00:00, 7.14MB/s]
Generating train split: 100%|███████████████████████████████████████████████| 25000/25000 [00:00<00:00, 192120.43 examples/s]
Generating test split: 100%|████████████████████████████████████████████████| 25000/25000 [00:00<00:00, 205666.45 examples/s]
Generating unsupervised split: 100%|████████████████████████████████████████| 50000/50000 [00:00<00:00, 221609.66 examples/s]
Map: 100%|██████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1023.63 examples/s]
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "./try.py", line 38, in <module>
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py", line 330, in train
output = super().train(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1821, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 451, in __iter__
current_batch = next(dataloader_iter)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = self.dataset.__getitems__(possibly_batched_index)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2805, in __getitems__
batch = self.__getitem__(keys)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2801, in __getitem__
return self._getitem(key)
File "/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py", line 2785, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 583, in query_table
_check_valid_index_key(key, size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 536, in _check_valid_index_key
_check_valid_index_key(int(max(key)), size=size)
File "/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py", line 526, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 9 is out of bounds for size 0
0%| | 0/1 [00:00<?, ?it/s] |
This is the output from that script. |
And I've seen others talking about something similar: |
This is a new regression introduced in trl 0.7.8 (and 0.7.9), 0.7.7 is fine.
We run into issues of
ValueError: too many dimensions 'str'
when loading data to the trainer.Here's a simple LLAMA2+LoRA fine-tuning on IMDB dataset as minimal repro:
0.7.7 works:
0.7.8 failed:
The text was updated successfully, but these errors were encountered: