You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some of the training data (specifically, the GPT2 generated datasets) contain texts of length 0. This causes training (and would cause inference) to error out. Is this expected? Please see the error message below:
Loading data/webtext.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 49837.49it/s]
Loading data/webtext.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48536.65it/s]
Loading data/webtext.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48406.80it/s]
Loading data/xl-1542M.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 46902.35it/s]
Loading data/xl-1542M.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45678.24it/s]
Loading data/xl-1542M.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45654.67it/s]
Epoch 1: 10%|█ | 2098/20834 [22:20<3:19:33, 1.56it/s, acc=0.856, loss=0.297]
Traceback (most recent call last):
File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/local/home/openai_code/detector/train.py", line 324, in <module>
run(**vars(args))
File "/local/home/openai_code/detector/train.py", line 255, in run
train_metrics = train(model, optimizer, device, train_loader, f'Epoch {epoch}')
File "/local/home/openai_code/detector/train.py", line 108, in train
for texts, masks, labels in loop:
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
for obj in iterable:
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
data = self._next_data()
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/local/home/openai_code/detector/dataset.py", line 60, in __getitem__
tokens = self.tokenizer.encode(text)
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1427, in encode
**kwargs,
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1569, in encode_plus
first_ids = get_input_ids(text)
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1541, in get_input_ids
tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1265, in tokenize
text = self.prepare_for_tokenization(text, **kwargs)
File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_roberta.py", line 239, in prepare_for_tokenization
if add_prefix_space and not text[0].isspace():
IndexError: string index out of range
The following datasets contain entries with length of 0:
The text was updated successfully, but these errors were encountered:
veenapaddy
changed the title
Training code fails on 0 length inputs (which are in several datasets)
Training code fails on 0 length inputs (which are in several datasets included by the author/used in the report)
Apr 24, 2023
Some of the training data (specifically, the GPT2 generated datasets) contain texts of length 0. This causes training (and would cause inference) to error out. Is this expected? Please see the error message below:
The following datasets contain entries with length of 0:
The text was updated successfully, but these errors were encountered: