How to run pre-training? #155

BlessedTatonka · 2024-12-23T10:51:24Z

Thank you for the great work! I have a question regarding pre-training. Could you please clarify which YAML configuration file should be used to achieve a similar pre-training setup as ModernBert, but for a different language? I noticed that in the yamls folder, there doesn’t seem to be a specific file for this purpose. The only related script I found is generate_eval_config.py, which, if I understand correctly, generates a YAML configuration using ModernBert’s training params. Is my understanding correct, or am I missing something?

yzimmermann · 2024-12-25T12:12:09Z

Was wondering about this, too!

chaofan520 · 2024-12-26T02:58:02Z

same question

GithubX-F · 2025-01-08T05:26:54Z

same question.
Are there any examples provided?

NohTow · 2025-01-09T09:24:38Z

Hello,
Sorry for the delayed response.
We plan to write proper guides to run the pretraining in the next days, we have been a bit short on time lately.
In the mean time, I have dropped the configs for the first step of pretraining (warmup+stable phases) here if you want to give it a shot before we clean up everything. The ones for context extension and decay will be added shortly.

Note that the data path should point to data folder in the MDS format, you have and example with the C4 dataset here.
Also note that, to use streaming: True, you might need to decompress the data using this script. Disabling streaming makes pretraining faster and solves an issue with uneven GPUs memory allocation (see #85).

Again, sorry for the delay and hopefully we'll have better documentation soon.

BramVanroy · 2025-02-04T13:49:52Z

@NohTow Hi there! Any update on a proper step-by-step guide for pretraining?

NohTow · 2025-02-04T14:29:12Z

Hello,

Until we update the readmes and merge the configs, the above comment is the closest thing to a step-by-step guide.
I agree that this is not optimal for now, and I again apologies for the delay, but could you specify what information are you lacking w.r.t the comment so we can add it to the readme?
Thanks!

Edit: actually, I forgot but #183 that adds a bit of documentation to the main readme has been merged, so besides merging the configs, is there anything you are missing?

ebrarkiziloglu · 2025-02-13T09:17:18Z

Hi @NohTow, following your guide, I am encountering an issue in pre-training. Could you help with #199 ?

NohTow mentioned this issue Jan 14, 2025

Yaml used to train the ModernBert #178

Open

lgienapp mentioned this issue Jan 28, 2025

Can't find yamls/main/modernbert-base.yaml in repo #192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run pre-training? #155

How to run pre-training? #155

BlessedTatonka commented Dec 23, 2024

yzimmermann commented Dec 25, 2024

chaofan520 commented Dec 26, 2024

GithubX-F commented Jan 8, 2025

NohTow commented Jan 9, 2025 •

edited

Loading

BramVanroy commented Feb 4, 2025

NohTow commented Feb 4, 2025 •

edited

Loading

ebrarkiziloglu commented Feb 13, 2025

How to run pre-training? #155

How to run pre-training? #155

Comments

BlessedTatonka commented Dec 23, 2024

yzimmermann commented Dec 25, 2024

chaofan520 commented Dec 26, 2024

GithubX-F commented Jan 8, 2025

NohTow commented Jan 9, 2025 • edited Loading

BramVanroy commented Feb 4, 2025

NohTow commented Feb 4, 2025 • edited Loading

ebrarkiziloglu commented Feb 13, 2025

NohTow commented Jan 9, 2025 •

edited

Loading

NohTow commented Feb 4, 2025 •

edited

Loading