Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training code fails on 0 length inputs (which are in several datasets included by the author/used in the report) #51

Open
veenapaddy opened this issue Apr 24, 2023 · 0 comments

Comments

@veenapaddy
Copy link

veenapaddy commented Apr 24, 2023

Some of the training data (specifically, the GPT2 generated datasets) contain texts of length 0. This causes training (and would cause inference) to error out. Is this expected? Please see the error message below:

Loading data/webtext.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 49837.49it/s]
Loading data/webtext.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48536.65it/s]
Loading data/webtext.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 48406.80it/s]
Loading data/xl-1542M.train.jsonl: 100%|██████████| 250000/250000 [00:05<00:00, 46902.35it/s]
Loading data/xl-1542M.test.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45678.24it/s]
Loading data/xl-1542M.valid.jsonl: 100%|██████████| 5000/5000 [00:00<00:00, 45654.67it/s]
Epoch 1:  10%|█         | 2098/20834 [22:20<3:19:33,  1.56it/s, acc=0.856, loss=0.297]
Traceback (most recent call last):
  File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/local/home/openai_code/detector/train.py", line 324, in <module>
    run(**vars(args))
  File "/local/home/openai_code/detector/train.py", line 255, in run
    train_metrics = train(model, optimizer, device, train_loader, f'Epoch {epoch}')
  File "/local/home/openai_code/detector/train.py", line 108, in train
    for texts, masks, labels in loop:
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/home/openai_code/detector/dataset.py", line 60, in __getitem__
    tokens = self.tokenizer.encode(text)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1427, in encode
    **kwargs,
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1569, in encode_plus
    first_ids = get_input_ids(text)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1541, in get_input_ids
    tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_utils.py", line 1265, in tokenize
    text = self.prepare_for_tokenization(text, **kwargs)
  File "/local/home/openai_code/venv/lib64/python3.7/site-packages/transformers/tokenization_roberta.py", line 239, in prepare_for_tokenization
    if add_prefix_space and not text[0].isspace():
IndexError: string index out of range

The following datasets contain entries with length of 0:

./data/large-762M.train.jsonl
./data/large-762M.valid.jsonl
./data/medium-345M.train.jsonl
./data/small-117M100.valid.jsonl
./data/small-117M.test.jsonl
./data/small-117M.train.jsonl
./data/small-117M.valid.jsonl
./data/xl-1542M.train.jsonl
@veenapaddy veenapaddy changed the title Training code fails on 0 length inputs (which are in several datasets) Training code fails on 0 length inputs (which are in several datasets included by the author/used in the report) Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant