Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run pre-training? #155

Open
BlessedTatonka opened this issue Dec 23, 2024 · 7 comments
Open

How to run pre-training? #155

BlessedTatonka opened this issue Dec 23, 2024 · 7 comments

Comments

@BlessedTatonka
Copy link

Thank you for the great work! I have a question regarding pre-training. Could you please clarify which YAML configuration file should be used to achieve a similar pre-training setup as ModernBert, but for a different language? I noticed that in the yamls folder, there doesn’t seem to be a specific file for this purpose. The only related script I found is generate_eval_config.py, which, if I understand correctly, generates a YAML configuration using ModernBert’s training params. Is my understanding correct, or am I missing something?

@yzimmermann
Copy link

Was wondering about this, too!

@chaofan520
Copy link

same question

@GithubX-F
Copy link

same question.
Are there any examples provided?

@NohTow
Copy link
Collaborator

NohTow commented Jan 9, 2025

Hello,
Sorry for the delayed response.
We plan to write proper guides to run the pretraining in the next days, we have been a bit short on time lately.
In the mean time, I have dropped the configs for the first step of pretraining (warmup+stable phases) here if you want to give it a shot before we clean up everything. The ones for context extension and decay will be added shortly.

Note that the data path should point to data folder in the MDS format, you have and example with the C4 dataset here.
Also note that, to use streaming: True, you might need to decompress the data using this script. Disabling streaming makes pretraining faster and solves an issue with uneven GPUs memory allocation (see #85).

Again, sorry for the delay and hopefully we'll have better documentation soon.

@BramVanroy
Copy link

@NohTow Hi there! Any update on a proper step-by-step guide for pretraining?

@NohTow
Copy link
Collaborator

NohTow commented Feb 4, 2025

Hello,

Until we update the readmes and merge the configs, the above comment is the closest thing to a step-by-step guide.
I agree that this is not optimal for now, and I again apologies for the delay, but could you specify what information are you lacking w.r.t the comment so we can add it to the readme?
Thanks!

Edit: actually, I forgot but #183 that adds a bit of documentation to the main readme has been merged, so besides merging the configs, is there anything you are missing?

@ebrarkiziloglu
Copy link

Hi @NohTow, following your guide, I am encountering an issue in pre-training. Could you help with #199 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants