Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you consider model accuracy Issues and how to get the whole trainig time of AI training process? #70

Open
horser1 opened this issue Jan 13, 2025 · 4 comments

Comments

@horser1
Copy link

horser1 commented Jan 13, 2025

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy?
you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

@Huoyuan100861
Copy link
Collaborator

hello, thanks for your excellent work on large scale AI training simulation. And I'm curious if you're considering model accuracy issues. Do the parameters you provide for modification have an impact on model accuracy? you said it can "Evaluate the time consumption of AI tasks", as far as I'm known, Astra-sim can only get the time of a single batch but not the whole training process(because it is related to the number of epochs, and so on...). So I'm also curious that how do you think about this problem?

  1. The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
  2. Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

@horser1
Copy link
Author

horser1 commented Jan 15, 2025

  1. The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
  2. Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch.
whole_time=single_batch_time*N*number of epochs
but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs?
Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

@Huoyuan100861
Copy link
Collaborator

  1. The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
  2. Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch. whole_time=single_batch_time*N*number of epochs but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs? Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

emmmmm, SimAI is a simulator and does not actually train models with data, so it does not track model convergence. However, some community members are researching how to predict training convergence metrics for large models, which may be integrated into SimAI in the future. If you're interested in this area, feel free to reach out for further discussion.

@horser1
Copy link
Author

horser1 commented Feb 25, 2025

  1. The model's performance and accuracy can vary with different parameters. For instance, if you split your model's parallelism to an extreme extent, it could lead to very small matrix multiplication dimensions, significantly reducing computational efficiency and causing high fluctuations in training time.
  2. Isn't the entire training process just a series of N global batch iterations? N is determined by the size of your training dataset and can be calculated through simple arithmetic. We welcome you to contribute a pull request to SimAI to improve this aspect.

Thanks for your answers, yeal, the total time can indeed be obtained by a simple calculation of the time of a batch. whole_time=single_batch_time*N*number of epochs but some parameters may affect the convergence speed, leading to a higher number of epochs. After modifying some parameters, the number of epochs may change? So how to decide the number of epochs? Besides, I wonder that have you considered the effect of modifying the parameters on the accuracy of the model? If modifying the parameters bring an increase in E2E performance, but results in a much worse model accuracy, then it seems that this optimization is pointless.

emmmmm, SimAI is a simulator and does not actually train models with data, so it does not track model convergence. However, some community members are researching how to predict training convergence metrics for large models, which may be integrated into SimAI in the future. If you're interested in this area, feel free to reach out for further discussion.

Thanks for your explanation, I'm trying to find the researches on how to predict training convergence metrics for large models. But I didn't find any work about it. Can you provide me with some related work or papers?
Thank you very much! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants