Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved training raw string chunking logic #3476

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dandm1
Copy link

@dandm1 dandm1 commented Aug 6, 2023

Some small fixes to the processing of raw text inputs:

  • Switched the newline favouring code to work on tokens, rather than tokenising, encoding and re-tokenising.
  • Improved split-chunks logic to:
    • prevent creating of a chunk that contains only overlap from the previous block,
    • prevent dropping off the first characters after a hard cut if there is a newline soon after the cut (e.g. titles),
    • if possible adjust the split location for the last chunk to avoid the need for padding.
    • directly output dictionaries, rather than requiring a subsequent call to encode.
  • Also added a model check that outputs a warning when adding EOS tokens if the EOS and BOS tokens are the same in the model definition.

Checklist:

@dandm1 dandm1 changed the title Improved raw string chunking logic Improved training raw string chunking logic Aug 8, 2023
@oobabooga
Copy link
Owner

Can you give one example of chunks generated before and after this change for the same dataset? As a simple sanity check.

@dandm1
Copy link
Author

dandm1 commented Aug 9, 2023

Sure here is an example. An input file and then two train_dataset_sample outputs - one from the current main branch and the other with the PR applied. I've run these with the default settings, except for setting a 30 character small chunks filter (which I don't think did anything).

The issues being fixed are most pronounced on poetry, but would be the same on prose. Looking at these again, it looks like the main branch version is also dropping the last line before each hard cut.

train_dataset_sample_main_branch.txt
train_dataset_sample_fixed.txt
input.txt

@oobabooga
Copy link
Owner

Thanks, that was helpful. The changes overall seem to be in the right direction. One thing I have noticed is that by keeping the default settings and using a raw text file as input, the chunks will consistently start with \n characters, while in the main branch this doesn't happen. Isn't that undesirable?

@dandm1
Copy link
Author

dandm1 commented Aug 10, 2023

I think a lot of that is because the source file has four /n at many of the hard splits, and the hard split is defined as three. It might make sense to trim these after the split command.

@oobabooga
Copy link
Owner

oobabooga commented Aug 10, 2023

This happened on a dataset of my own, which was a random arXiv paper that I copied and pasted. It doesn't have any \n\n\n\n.

I noted that this behavior stopped happening after setting "Prefer Newline Cut Length" to 0 in this PR's branch, so I don't know if those newlines at the beginning of each chunk are intended or not when this parameter is set.

@dandm1
Copy link
Author

dandm1 commented Aug 10, 2023

Yeah you are right, looks like it was struggle with the fact that encode('\n') actually returns three tokens. Have updated now to deal with the newline tokens being a list - and cuts are looking a lot cleaner on newlines. Also added logic to strip newlines off the start and end of a hard split, if they are present.

@dandm1
Copy link
Author

dandm1 commented Aug 10, 2023

Updated example training dataset
train_dataset_sample.txt

@FartyPants
Copy link
Contributor

It would great if someone can also check the sentence chunking from my PR in comparison:
#3382
I have currently access only to lenovo potato

@oobabooga
Copy link
Owner

oobabooga commented Aug 27, 2023

I have been trying to understand the chunking method better (I didn't write it, it was contributed in a PR), and I feel like the current training implementation is more complex than it should be. More importantly, it seems to use parameters that are not used anywhere else:

  • Overlap Length
  • Prefer Newline Cut Length
  • Hard Cut String
  • Ignore small blocks

For raw text files, I would prefer to simply encode the entire file, making sure to add a BOS at the beginning and an EOS at the very end, and then do a simple loop over the encoded IDS getting Cutoff Length tokens at a time (is that a mainstream parameter?). The job of adding <s> and </s> mid-text is up to the user, and "Hard Cut String" should not exist. I am not confident to simply do that though because I have 0 experience with training and maybe there is something conceptually wrong about it.

@dandm1
Copy link
Author

dandm1 commented Aug 27, 2023

There may be a place for simplified training, but there is definitely some need to be able to control how a raw text string is converted into blocks for training purposes. I claim no great expertise either, but I would think it risks confusing the model if it is trained on too many samples that consist of the end of one story and the start of another one. We definitely got noticably worse results training with datasets that contained a lot of padding tokens. I believe that overlap and hard cuts are necessary to produce sensibly structured blocks; you could argue about the other ones I guess, but they definitely allow tidier string sections to be created.

Honestly there are more features that would be useful in the raw dataset creation - like the sentence based cutoffs that @FartyPants wrote, multi-file handling, sample shuffling and the ability to pack small sections together to avoid padding. Maybe do these basic tidy up changes now and a longer-term plan could be to make dataset preparation available as an extension point?

@FartyPants
Copy link
Contributor

I decided to move the whole enhanced training as it's own Training PRO extension

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants