Questions about the Training Process of the Latest b28c512nbt Model #1015

Sword-Nomad · 2025-01-22T07:19:13Z

I would like to understand more about the training process of the KataGo models, specifically regarding the training method of the latest b28c512nbt model.

According to the KataGo paper, the training process appears to involve progressively increasing the network size, starting from smaller architectures like b6c96, then moving to b10c128, b15c192, and finally to b28c512nbt. This progressive approach allows the model to gradually learn more complex features while avoiding the computational cost and difficulty of training a very large network from the beginning.

However, I am uncertain if this is the actual training strategy used by KataGo. Therefore, I would like to confirm the following points:

Training Strategy: Was the b28c512nbt model of KataGo indeed trained by progressively increasing the network size from smaller networks? Or was it initialized and trained directly as the final large network?
Training Process: If KataGo adopted the progressive network size increase strategy, do the changes in network size, training duration, and data volume at each stage align with the information provided in the Approximate Elo Ratings Graph?
Thank you very much for your assistance!

lightvector · 2025-01-22T14:18:30Z

Yes, the b28c512nbt network did not even exist at the beginning, the architecture wasn't invented yet, it was brought in later by being trained on data of other networks and then switched to.

The graph you posted is the training history of the run. You can see clearly from the graph you posted that the network size was progressively changed, that's why the graph starts with b6c96, and then changes to other networks. Is there something that makes you not trust the graph? It's the official graph and if you already know about it, it answers your own question.

lightvector · 2025-01-22T14:28:21Z

Note that one thing that might be indeed confusing about one of the settings of the graph - "Upload time" is just the time the network was first created and uploaded to website for use for distributed self-play by contributors. However, the earliest half of networks were all uploaded at once because they were all trained offline, prior to crowdsourcing the data generation for KataGo, and the upload time is just when the website itself recorded them being uploaded. See the "g170" run at https://katagoarchive.org/index.html - the current ongoing crowdsourced "kata1" run is a continuation of the "g170" run which was the offline run whose networks were all uploaded at once at the start, these have the accurate data dumps and quite possibly still even the accurate file modification times for the individual training data files if you unzip them.

Other than that, yes you can trust the dates and such of the files being recorded on the training site, those were the dates that things were uploaded and when networks were switched to for self-play.

Sword-Nomad · 2025-01-23T10:07:19Z

Thank you, David. Regarding your point that "it was brought in later by being trained on data of other networks and then switched to," I have a few more questions that I would like to clarify:

Data Reuse Strategy
When transitioning from smaller networks (e.g., b6c96) to larger networks (e.g., b10c128), which strategy does KataGo employ:
- Direct Reuse of Historical Data: Directly using the self-play historical data generated by the smaller network (b6c96) for training.
- Rapid Generation of New Data: Using the trained smaller network (b6c96) to quickly generate a batch of new self-play data, which is then used to train the larger network (b10c128).
Scope of Data Reuse
Assuming the goal is to train b28c512nbt and multiple networks have already generated self-play data (e.g., b6c96, b10c128, b15c192, b18c384nbt), which strategy does KataGo employ:
- Full Historical Data Reuse: Using self-play data from all previous networks (b6c96, b10c128, b15c192, b18c384nbt, etc.) to train b28c512nbt.
- Single-Step Data Reuse: Only using the self-play data from the most recent network (e.g., b18c384nbt) for training.
Data Mixing Strategy
When training larger networks, which strategy does KataGo employ:
- Phased Training: First training with self-play data from smaller networks, then generating new self-play data with the larger network and continuing training.
- Mixed Training: Combining self-play data from smaller networks with new self-play data generated by the larger network for simultaneous training.
Data Starting Point and Training Range
From the graph, it appears that the training data range for b18c384nbt is approximately from 3.05G to 4.3G, while for b28c512nbt, it is from 4.25G to 4.6G. Based on this, I would like to confirm the following details:
- Significance of Data Starting Point: Does the fact that b28c512nbt starts training from 4.25G mean it reuses self-play data generated by other networks (e.g., b18c384nbt) before 4.25G?
- Training Range Process: Is the data range from 4.25G to 4.6G entirely generated by b28c512nbt through self-play and used for its training? If so, does this imply that b28c512nbt reached training convergence at 4.6G?

Thank you very much for your assistance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the Training Process of the Latest b28c512nbt Model #1015

Questions about the Training Process of the Latest b28c512nbt Model #1015

Sword-Nomad commented Jan 22, 2025 •

edited

Loading

lightvector commented Jan 22, 2025

lightvector commented Jan 22, 2025

Sword-Nomad commented Jan 23, 2025

Questions about the Training Process of the Latest b28c512nbt Model #1015

Questions about the Training Process of the Latest b28c512nbt Model #1015

Comments

Sword-Nomad commented Jan 22, 2025 • edited Loading

lightvector commented Jan 22, 2025

lightvector commented Jan 22, 2025

Sword-Nomad commented Jan 23, 2025

Sword-Nomad commented Jan 22, 2025 •

edited

Loading