-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification of Non-Determinism for Serial Steps in Models #36
Comments
What about other sources of non-determinism e.g. silent data corruption of hardware state, stochastic rounding. Will these be permitted? Silent data corruption due to high energy particles is challenging to eliminate from all hardware structures. I suppose stochastic rounding is permitted as long as it can be deterministically reproduced? How are the sources of non-determinism expected to be validated? The example of eventual consistent gradients being not permitted would then imply mechanisms like deep gradient compression are not permitted? Isn't it possible for eventually consistent gradients to be deterministic? |
SWG Recommendation:
|
It seems like prohibiting stale gradients or eventual consistency will penalize several vendors (particularly those that have limited storage). For example, Graphcore has been fairly clear that they believe recomputing gradients from snapshots is a valid option to reduce the memory footprint. |
Allow me to clarify one thing: this recommendation refer to the "Closed" division of mlperf (as opposed to the open division). If there are considerations here you think may have been missed, please don't hesitate to reach out. It's not our intention to penalize any vendor; please bring this up this issue on the MLPerf mailing list (https://groups.google.com/forum/#!forum/mlperf) to continue discussion on this topic. |
SWG: We will reach out for more comments and re-evaluate this to provide more clarify and incorporate more input. Please reach out if you'd like to be a part of this process. |
Victor -
I would like to be part of the process.
Gary Lauterbach
[email protected]<mailto:[email protected]>
CTO and Co-Founder
Cerebras Systems
On Aug 2, 2018, at 11:41 AM, Victor Bittorf <[email protected]<mailto:[email protected]>> wrote:
SWG:
We will reach out for more comments and re-evaluate this to provide more clarify and incorporate more input. Please reach out if you'd like to be a part of this process.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<mlcommons/policies#36 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ATCb4qBK3jXPEzFXVRUlCVGB898Aojuwks5uM0fOgaJpZM4VTbQ0>.
|
Victor, the distinction between closed and open is more important than you may realize. In the initial discussion of MLPerf, the open division was characterized as being good for research, for speculative techniques and experiments that are outside of common practices. The closed division was characterized as being good for 'apples to apples' comparisons for real hardware and systems. Prohibiting certain techniques (e.g., batch size = 1, eventual consistency), is an implicit statement by MLPerf that those techniques are experiment and not commercially relevant. That's a problem for any company interested in such techniques, because it's MLPerf saying they are merely experimental. |
Hi all, We should strive to admit a range of architectures to closed, but still need to draw the line somewhere and say "this is a different class of thing." Some questions that might help consider this change: |
Reconsider as part of both ends of batch size scaling. |
SWG: Action Item to Gary (@ad6fp): what would be a conservative small batch size threshold to allow this? Do you have data to show this is necessary below that batch size? |
SWG: Something along these lines may be required to support small batch sizes. We are working on enable to both large and small batch sizes. Our understanding is that this is not needed for this submission cycle. This can be revisited during a future submission cycle. |
MLPerf generally prohibits introduction of non-determinism to achieve speedups. There is grey area here, though. For example, I generally believe that MLPerf allows for non-determinism when it comes from threading and interleaving (or resulting from different batch sizes) as scaling up models or where such non-determinism is already inherent from shuffling.
I want to clarify that it is my understanding such non-determinism is explicitly not allowed (at least for v0.5) to introduce lazy/eventual consistency where the reference implementation uses strong consistency. For example, using stale / eventually consistent gradients (or models) is prohibited unless the reference implementation does the same.
The text was updated successfully, but these errors were encountered: