Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the Training Process of the Latest b28c512nbt Model #1015

Open
Sword-Nomad opened this issue Jan 22, 2025 · 3 comments
Open

Comments

@Sword-Nomad
Copy link

Sword-Nomad commented Jan 22, 2025

I would like to understand more about the training process of the KataGo models, specifically regarding the training method of the latest b28c512nbt model.

According to the KataGo paper, the training process appears to involve progressively increasing the network size, starting from smaller architectures like b6c96, then moving to b10c128, b15c192, and finally to b28c512nbt. This progressive approach allows the model to gradually learn more complex features while avoiding the computational cost and difficulty of training a very large network from the beginning.

However, I am uncertain if this is the actual training strategy used by KataGo. Therefore, I would like to confirm the following points:

Training Strategy: Was the b28c512nbt model of KataGo indeed trained by progressively increasing the network size from smaller networks? Or was it initialized and trained directly as the final large network?
Training Process: If KataGo adopted the progressive network size increase strategy, do the changes in network size, training duration, and data volume at each stage align with the information provided in the Approximate Elo Ratings Graph?
Thank you very much for your assistance!

Image

@lightvector
Copy link
Owner

Yes, the b28c512nbt network did not even exist at the beginning, the architecture wasn't invented yet, it was brought in later by being trained on data of other networks and then switched to.

The graph you posted is the training history of the run. You can see clearly from the graph you posted that the network size was progressively changed, that's why the graph starts with b6c96, and then changes to other networks. Is there something that makes you not trust the graph? It's the official graph and if you already know about it, it answers your own question.

@lightvector
Copy link
Owner

Note that one thing that might be indeed confusing about one of the settings of the graph - "Upload time" is just the time the network was first created and uploaded to website for use for distributed self-play by contributors. However, the earliest half of networks were all uploaded at once because they were all trained offline, prior to crowdsourcing the data generation for KataGo, and the upload time is just when the website itself recorded them being uploaded. See the "g170" run at https://katagoarchive.org/index.html - the current ongoing crowdsourced "kata1" run is a continuation of the "g170" run which was the offline run whose networks were all uploaded at once at the start, these have the accurate data dumps and quite possibly still even the accurate file modification times for the individual training data files if you unzip them.

Other than that, yes you can trust the dates and such of the files being recorded on the training site, those were the dates that things were uploaded and when networks were switched to for self-play.

@Sword-Nomad
Copy link
Author

Thank you, David. Regarding your point that "it was brought in later by being trained on data of other networks and then switched to," I have a few more questions that I would like to clarify:

  1. Data Reuse Strategy
    When transitioning from smaller networks (e.g., b6c96) to larger networks (e.g., b10c128), which strategy does KataGo employ:

    • Direct Reuse of Historical Data: Directly using the self-play historical data generated by the smaller network (b6c96) for training.
    • Rapid Generation of New Data: Using the trained smaller network (b6c96) to quickly generate a batch of new self-play data, which is then used to train the larger network (b10c128).
  2. Scope of Data Reuse
    Assuming the goal is to train b28c512nbt and multiple networks have already generated self-play data (e.g., b6c96, b10c128, b15c192, b18c384nbt), which strategy does KataGo employ:

    • Full Historical Data Reuse: Using self-play data from all previous networks (b6c96, b10c128, b15c192, b18c384nbt, etc.) to train b28c512nbt.
    • Single-Step Data Reuse: Only using the self-play data from the most recent network (e.g., b18c384nbt) for training.
  3. Data Mixing Strategy
    When training larger networks, which strategy does KataGo employ:

    • Phased Training: First training with self-play data from smaller networks, then generating new self-play data with the larger network and continuing training.
    • Mixed Training: Combining self-play data from smaller networks with new self-play data generated by the larger network for simultaneous training.
  4. Data Starting Point and Training Range
    From the graph, it appears that the training data range for b18c384nbt is approximately from 3.05G to 4.3G, while for b28c512nbt, it is from 4.25G to 4.6G. Based on this, I would like to confirm the following details:

    • Significance of Data Starting Point: Does the fact that b28c512nbt starts training from 4.25G mean it reuses self-play data generated by other networks (e.g., b18c384nbt) before 4.25G?
    • Training Range Process: Is the data range from 4.25G to 4.6G entirely generated by b28c512nbt through self-play and used for its training? If so, does this imply that b28c512nbt reached training convergence at 4.6G?

Image

Thank you very much for your assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants