Clarification of Non-Determinism for Serial Steps in Models #36

bitfort · 2018-07-17T19:28:06Z

MLPerf generally prohibits introduction of non-determinism to achieve speedups. There is grey area here, though. For example, I generally believe that MLPerf allows for non-determinism when it comes from threading and interleaving (or resulting from different batch sizes) as scaling up models or where such non-determinism is already inherent from shuffling.

I want to clarify that it is my understanding such non-determinism is explicitly not allowed (at least for v0.5) to introduce lazy/eventual consistency where the reference implementation uses strong consistency. For example, using stale / eventually consistent gradients (or models) is prohibited unless the reference implementation does the same.

ad6fp · 2018-07-26T01:44:50Z

What about other sources of non-determinism e.g. silent data corruption of hardware state, stochastic rounding. Will these be permitted? Silent data corruption due to high energy particles is challenging to eliminate from all hardware structures. I suppose stochastic rounding is permitted as long as it can be deterministically reproduced? How are the sources of non-determinism expected to be validated?

The example of eventual consistent gradients being not permitted would then imply mechanisms like deep gradient compression are not permitted? Isn't it possible for eventually consistent gradients to be deterministic?

bitfort · 2018-08-01T17:01:31Z

SWG Recommendation:

We understand that these systems involve a certain level of non-determinism. In general, follow the reference implementation. Latitude is provided for more efficient shuffles and other such implementation level decisions. Use of stale gradients or introducing eventual consistency is not allowed unless otherwise noted.
Additional clarification for half precision loss scaling (dropping updates) will be proposed at the next submitters meeting.

TheKanter · 2018-08-01T18:06:52Z

It seems like prohibiting stale gradients or eventual consistency will penalize several vendors (particularly those that have limited storage). For example, Graphcore has been fairly clear that they believe recomputing gradients from snapshots is a valid option to reduce the memory footprint.

bitfort · 2018-08-02T01:01:45Z

Allow me to clarify one thing: this recommendation refer to the "Closed" division of mlperf (as opposed to the open division).

If there are considerations here you think may have been missed, please don't hesitate to reach out. It's not our intention to penalize any vendor; please bring this up this issue on the MLPerf mailing list (https://groups.google.com/forum/#!forum/mlperf) to continue discussion on this topic.

bitfort · 2018-08-02T18:41:17Z

SWG:

We will reach out for more comments and re-evaluate this to provide more clarify and incorporate more input. Please reach out if you'd like to be a part of this process.

ad6fp · 2018-08-03T15:37:42Z

Victor - I would like to be part of the process. Gary Lauterbach [email protected]<mailto:[email protected]> CTO and Co-Founder Cerebras Systems On Aug 2, 2018, at 11:41 AM, Victor Bittorf <[email protected]<mailto:[email protected]>> wrote: SWG: We will reach out for more comments and re-evaluate this to provide more clarify and incorporate more input. Please reach out if you'd like to be a part of this process. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<mlcommons/policies#36 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ATCb4qBK3jXPEzFXVRUlCVGB898Aojuwks5uM0fOgaJpZM4VTbQ0>.

TheKanter · 2018-08-03T16:29:51Z

Victor, the distinction between closed and open is more important than you may realize.

In the initial discussion of MLPerf, the open division was characterized as being good for research, for speculative techniques and experiments that are outside of common practices. The closed division was characterized as being good for 'apples to apples' comparisons for real hardware and systems.

Prohibiting certain techniques (e.g., batch size = 1, eventual consistency), is an implicit statement by MLPerf that those techniques are experiment and not commercially relevant. That's a problem for any company interested in such techniques, because it's MLPerf saying they are merely experimental.

petermattson · 2018-08-03T23:34:43Z

Hi all,
Philosophically, closed implementations should be mathematically equivalent subject to basic computing limits (e.g. fp ordering). We've allowed some small deviations that come at a cost (e.g. fp normalization is clearly defined and adds work). This is a more challenging issue because:
(1) it essentially allows use of a different optimizer
(2) in some cases, it may be a performance optimization on larger architectures
So it's both a fairly big deviation, and one that may sometimes come with a performance gain rather than a cost.

We should strive to admit a range of architectures to closed, but still need to draw the line somewhere and say "this is a different class of thing." Some questions that might help consider this change:
(1) What principles should we use to make such calls that don't admit everything?
(2) In this particular case, what data could we use to make a decision? E.g. for what memory sizes is this effectively required? What work has been done on performance impacts?
(3) What approaches might we use to limit delta and/or impose cost if we decided to admit it?
Best,
Peter

petermattson · 2018-09-23T23:58:55Z

Reconsider as part of both ends of batch size scaling.

bitfort · 2018-09-27T18:56:26Z

SWG:

Action Item to Gary (@ad6fp): what would be a conservative small batch size threshold to allow this? Do you have data to show this is necessary below that batch size?

bitfort · 2018-10-09T20:43:49Z

SWG:

Something along these lines may be required to support small batch sizes. We are working on enable to both large and small batch sizes. Our understanding is that this is not needed for this submission cycle. This can be revisited during a future submission cycle.

bitfort added the Next Meeting Item to be discussed in the next Working Group label Aug 1, 2018

petermattson added Backlog An issue to be discussed in a future Working Group, but not the immediate next one. and removed Next Meeting Item to be discussed in the next Working Group labels Sep 9, 2018

petermattson added the Next Meeting Item to be discussed in the next Working Group label Sep 23, 2018

petermattson removed the Backlog An issue to be discussed in a future Working Group, but not the immediate next one. label Sep 23, 2018

bitfort added Backlog An issue to be discussed in a future Working Group, but not the immediate next one. and removed Next Meeting Item to be discussed in the next Working Group labels Oct 9, 2018

nvpaulius mentioned this issue Jan 16, 2020

Investigate allowing async training for DLRM #273

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification of Non-Determinism for Serial Steps in Models #36

Clarification of Non-Determinism for Serial Steps in Models #36

bitfort commented Jul 17, 2018

ad6fp commented Jul 26, 2018

bitfort commented Aug 1, 2018

TheKanter commented Aug 1, 2018

bitfort commented Aug 2, 2018

bitfort commented Aug 2, 2018

ad6fp commented Aug 3, 2018 via email

TheKanter commented Aug 3, 2018

petermattson commented Aug 3, 2018

petermattson commented Sep 23, 2018

bitfort commented Sep 27, 2018

bitfort commented Oct 9, 2018

Clarification of Non-Determinism for Serial Steps in Models #36

Clarification of Non-Determinism for Serial Steps in Models #36

Comments

bitfort commented Jul 17, 2018

ad6fp commented Jul 26, 2018

bitfort commented Aug 1, 2018

TheKanter commented Aug 1, 2018

bitfort commented Aug 2, 2018

bitfort commented Aug 2, 2018

ad6fp commented Aug 3, 2018 via email

TheKanter commented Aug 3, 2018

petermattson commented Aug 3, 2018

petermattson commented Sep 23, 2018

bitfort commented Sep 27, 2018

bitfort commented Oct 9, 2018