Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

horser1 · 2025-01-13T07:15:46Z

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy?
you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

Huoyuan100861 · 2025-01-15T11:07:48Z

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy? you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

horser1 · 2025-01-15T12:04:42Z

The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.

Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch.
whole_time=single_batch_time*N*number of epochs
but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs?
Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

Huoyuan100861 · 2025-02-11T02:04:05Z

The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.

Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch. whole_time=single_batch_time*N*number of epochs but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs? Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

emmmmm, SimAI is a simulator and does not actually train models with data, so it does not track model convergence. However, some community members are researching how to predict training convergence metrics for large models, which may be integrated into SimAI in the future. If you're interested in this area, feel free to reach out for further discussion.

horser1 · 2025-02-25T08:47:15Z

The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.

Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch. whole_time=single_batch_time*N*number of epochs but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs? Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

emmmmm, SimAI is a simulator and does not actually train models with data, so it does not track model convergence. However, some community members are researching how to predict training convergence metrics for large models, which may be integrated into SimAI in the future. If you're interested in this area, feel free to reach out for further discussion.

Thanks for your explanation, I'm trying to find the researches on how to predict training convergence metrics for large models. But I didn't find any work about it. Can you provide me with some related work or papers?
Thank you very much! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

horser1 commented Jan 13, 2025

Huoyuan100861 commented Jan 15, 2025

horser1 commented Jan 15, 2025

Huoyuan100861 commented Feb 11, 2025

horser1 commented Feb 25, 2025

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Comments

horser1 commented Jan 13, 2025

Huoyuan100861 commented Jan 15, 2025

horser1 commented Jan 15, 2025

Huoyuan100861 commented Feb 11, 2025

horser1 commented Feb 25, 2025