Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WPT tests tracker #265

Closed
BruceDai opened this issue May 5, 2022 · 21 comments
Closed

WPT tests tracker #265

BruceDai opened this issue May 5, 2022 · 21 comments

Comments

@BruceDai
Copy link
Contributor

BruceDai commented May 5, 2022

Thanks @fdwr's great efforts for reviewing and @Honry's approvals and helps, our WPT WebNN tests PRs have been all landed after previous blocker of syncing updated WebNN IDL interfaces on web-platform-tests/wpt#36908 being resolved by my fixing CI failure PR.

Now there're such 432 WPT WebNN operations tests covered total 40 ops for first wave models after convTranspose2d tests web-platform-tests/wpt#38100 being landed.
We could run these tests on https://wpt.live/webnn/, eg:

Bruce is ongoing to add tests of remaining ops #338 closely co-working with @fdwr.

WPT WebNN Tests:

1. WebNN API IDL Tests:

2. WebNN API JavaScript Tests (testharness.js) for operations tests:

@anssiko
Copy link
Member

anssiko commented May 5, 2022

@BruceDai, thanks for your contributions to conformance testing. I added webnn-baseline to today's agenda including discussion on ULP tolerances to unblock your work on this (I'm not expecting presentation, just discussion). The webnn-baseline is identified as a CR requirement, so high priority.

@wchao1115 @huningxin your feedback is welcome in this issue to unblock this proposed work. Since we have a busy agenda today, we may need to defer to GH discussion.

@BruceDai
Copy link
Contributor Author

BruceDai commented Jun 28, 2022

I'm sorry to report status late. According to testing ULP tolerances between actual output by WebNN operations with expected data/baseline by WebNN-Baseline on some different HW devices with WebNN-Native DML backend and OpenVINO backend, we observed there're majority ULP tolerances with normal input data and some large ULP distance with some special input data. Here I want to propose following majority ULP tolerances to WG.

@wchao1115 Please also take a look, and I hope that you would share your pervious operations ULP tolerances of DML, thanks.

Op Propose ULP Tolerance
batchNormalization 5
clamp 0
concat 0
conv2d 2
add 1
sub 1
mul 1
div 2
max 0
min 0
pow 3
abs 0
ceil 0
cos 2
exp 2
floor 0
log 3
neg 0
sin 2
tan 4
gemm 1
leakyRelu 1
matmul 1
averagepool2d 2
maxpool2d 0
relu 0
reduceMax 0
reduceMean 0
reduceMin 0
reduceProduct 0
reduceSum 0
reshape 0
sigmoid 2
slice 0
softmax 1
split 0
squeeze 0
tanh 2
transpose 0

I‘ve firstly submitted a PR web-platform-tests/wpt#34287 of adding tests of 8 operations (clamp / concat / relu / reshape / slice / split / squeeze / transpose ) which have 0ULP distance between actual output with expected data/baseline.

@huningxin
Copy link
Contributor

As this is related to wpt which is cr blocker #240, I propose to label this issue with "cr". @anssiko

@anssiko anssiko added the cr label Jun 30, 2022
@BruceDai
Copy link
Contributor Author

Link to #288

@anssiko
Copy link
Member

anssiko commented Sep 9, 2022

[Piggy-packing on this issue with a more generic w-p-t question.]

@BruceDai, could you give us an update on where we are in terms of test coverage for WebNN API w-p-t tests?

Our plan is to migrate the mocha tests to wpt/webnn to satisfy CR readiness criteria tracked in #240.

Looking at the relevant wpt PRs it looks like the migration is in progress.

Do you foresee other blockers besides ULP tolerances discussed in this issue? Thanks for your contributions to w-p-t!

@BruceDai
Copy link
Contributor Author

BruceDai commented Sep 14, 2022

Hi @anssiko, sorry for late response due to the holidays.

Current WebNN API Spec defines 56 operations, WebNN-Baseline has already implemented 42 first wave ops, and WebNN-Polyfill has implemented mostly of them (50/56 including 42 first wave ops). I'm starting to add operation level tests from above listed 8/42 first wave ops. Here's a implemented tests table, please have a look, thanks.

Operations \ tests WebNN-Baseline WebNN-Polyfill WPT Note (Is first wave operation?)
batchNormalization × Yes
clamp Yes
concat Yes
conv2d × Yes
convTranspose2d √(*) × Yes
add × Yes
sub × Yes
mul × Yes
div × Yes
max × Yes
min × Yes
pow × Yes
abs × Yes
ceil × Yes
cos × Yes
exp × Yes
floor × Yes
log × Yes
neg × Yes
sin × Yes
tan × Yes
gemm × Yes
gru × Yes
gruCell × Yes
hardSigmoid × × × No
hardSwish × × No
instanceNormalization × × No
leakyRelu × Yes
matmul × Yes
linear × × × No
pad × × No
averagepool2d × Yes
maxpool2d × Yes
l2Pool2d × × No
reduceL1 × × No
reduceL2 × × No
reduceLogSum × × × No
reduceLogSumExp × × No
reduceMax × Yes
reduceMean × Yes
reduceMin × Yes
reduceProduct × Yes
reduceSum × Yes
reduceSumSquare × × × No
relu Yes
resample2d × × No
reshape Yes
sigmoid × Yes
slice Yes
softmax × Yes
softplus × × × No
softsign × × × No
split Yes
squeeze Yes
tanh × Yes
transpose Yes

Note:

  • The Row having in Column WPT means that we've already added this operation tests into WPT with submitted PR
  • The Row having × in Column WPT & Yes in last Column means that we locally migrated this first wave op tests from WebNN-Polyfill to WPT WebNN tests on pending to submit by ULP tolerance
  • (*) convTranspose2d was split from conv2d, so WebNN-Baseline can implicitly support convTranspose2d by invoking conv2d with some options, I'll submit a PR of adding convTranspose2d implementation and updating relevant tests to enable WebNN-Baseline clearly support convTranspose2d

On my opinion, there isn't any other blocker except ULP tolerances which we're working on.

I plan to add first wave operations tests into WPT project firstly, then add tests for others operations which are under implementing on WebNN-Polyfill and WebNN-Baseline. Any suggestion, thanks.

@anssiko
Copy link
Member

anssiko commented Sep 14, 2022

@BruceDai thank you for this update, your plan sounds good to me. Your wpt contributions play an important role in the CR readiness. Please bring any further blockers to the attention of the WG so we can help you address them in a timely manner.

@anssiko
Copy link
Member

anssiko commented Sep 15, 2022

@BruceDai I'll make this a meta issue for WPT tests tracking and rename the issue to reflect that.

Please link the relevant issues and PRs into this meta issue to keep the WG informed of the progress (not everyone is watching the huge wpt repo). We'll review your test plan #265 (comment) on our upcoming call. Thank you!

@anssiko anssiko changed the title Define ULP (unit of least precision) tolerances for Conformance testing of WebNN API WPT tests tracker Sep 15, 2022
@wchao1115
Copy link
Collaborator

@BruceDai We're close to producing an initial list of recommended ULP tolerance for the ops you're listing here. There will be some more explanation as to why we would recommend a certain tolerance value for certain ops in the list.

+= @fdwr.

@fdwr
Copy link
Collaborator

fdwr commented Sep 23, 2022

Hi BruceDai, here's the initial list...

  • Operators can be grouped into different categories, and the tolerances in the same category are generally similar:
    • Data movement: slice, pad, concat, split, reshape, squeeze, unsqueeze, transpose, gather, scatter, padding, depthToSpace, spaceToDepth, topK, oneHot...
    • Data generation: diagonalMatrix, fillValueSequence...
    • Exact math: abs, neg, clamp, ceil, floor, min, max, relu, reduceMin, reduceMax, maxpoolNd...
    • Simple math: add, subtract, multiply, divide, linear, leakyRelu, hardSigmoid, hardSwish...
    • Complex math: exp, log, pow, softsign, softmax, softplus, sigmoid, sqrt...
    • Trigonometric functions: sin, sinh, cos, cosh, tan, tanh...
    • Lossy accumulation: convNd, convTransposeNd, gemm/matmul, batchNormalization, instanceNormalization, layerNormalization, reduceSum, reduceSumSquare, reduceMean, reduceProduct, reduceL1, reduceL2, reduceLogSum, reduceLogSumExp, resampleNd, averagePoolNd, l2PoolNd...
    • Very complex iterative: gru, gruCell, lstm, rnn...
  • Note there are numerical gotchas to beware of, including subtraction of nearly equal numbers (catastrophic cancellation), division by very small numbers (which magnify earlier errors), adding very large and very small numbers (where the large numbers eat the small numbers completely), and asymptotes of trigonometric and nonlinear functions (bad things happen with 0/0 :b). These gotchas make it impossible to pick a single tolerance that works universally, and so they're best avoided with a little control of your input data. I'm not saying you should avoid using random data (that's still fine), but consider the range you generate it within. Otherwise you're not really testing operator behavior conformance, but rather you're just testing the rounding precision of the device (which also matters, but you don't want these to make your tests brittle). So for example, with the linear operator, you can still randomly generate the input, scale, and bias parameters, but ensure scale and bias have consistent signs (both positive or both negative, or else subtraction of nearly equal numbers will eventually bite you in some random permutation). For tangent, avoid querying too close to the repeating asymptotes of 1/4τ and 3/4τ.
  • For the lossy accumulation operators, the potential error grows depending on the number of input elements being sampled per output element ("IEPOE" below) whether it's along a reduction axis like reduceSum and gemm, or a sliding window like conv and averagePool, and so the upper limit for error depends on the parameters, not just a single hard-coded tolerance value. Beware you might witness a very low error running some of these operators and think the precision of the underlying computation is very good, but this is a lie, a false comfort due to round-to-nearest-evens' wonderful tendency to balance out error. You could sum 100 random numbers and get an actual value only a few ULP off from the expected value in the common case, but then you will eventually encounter some outliers that are pretty far off, because the error variance is still wider, and the worst case is broader (broader than say summing 10 numbers). Expectedly, the number of lossy math operations also contributes, not just the number inputs, and the values below are not as tight as they could be in practice, but it's about setting a reasonable upper limit.
  • Signals in the analog world have error, some which vary independent of the strength of the signal (like Gaussian noise in audio or video) bounded by an absolute tolerance (ATOL) and others which vary proportional to the magnitude of the signal bounded by a percentage/relative tolerance (RTOL) of the expected value. Similarly, graphing the error of software math functions will in some cases show error centered some range around the expected value (like with sine and cos which are often implemented via lookup tables with linear interpolation) and in other cases show error proportional to the magnitude of the input (like with convolution and multiplication). In computers, rather than use relative percentages (RTOL), we can instead use the bitwise delta between values to measure the unit's last place (ULP, which you are already familiar with). For ATOL, it's just actual <= expected + atol && actual >= expected - atol.
  • Neither ULP nor ATOL alone is sufficient to cover all the cases, as you'll have legitimate points (not asymptotes) on functions like log at x=1 and atan at x=0 which causes issues for ULP because of the division by nearly zero numbers. So you pick the right metric for the operator.
  • The tolerance can vary based on data type, float16 vs float32 (even if few of the columns below exhibit that). Additionally there are some processors that have more aberrant data types (not standard IEEE) like 12.12 fixed point math or float19of32 which only uses 19 bits and zeros in the bottom 13 bits.
  • GPU's may flush subnormals to zeros whereas CPU's preserve them. If you just compare the CPU result to the GPU result without zeroing subnormals first, you'll get a huge ULP difference; but you only want to zero the CPU result if the CPU result is a denorm and the GPU result is zero, because otherwise you'll get mismatches the other direction if you always zero denorms since GPU's don't zero them for float16.
Op Old Proposed ULP Tolerance float16 float32 notes
batchNormalization 5 6 ULP 6 ULP (a - mean) * scale / sqrt(variance + epsilon) + bias
clamp 0 0 0 if a > high then high elif a < low then low else a
concat 0 0 0
conv2d 2 IEPOE*2 ULP IEPOE*2 ULP number of reduced input elements multiplied by filter and summed (a sliding dot product like pooling). So (Filter.Sizes.W * Filter.Sizes.H * (Input.Sizes.C / GroupCount)) * 2. // * FilterSize.D too if 3D
add 1 1 ULP 1 ULP
sub 1 1 ULP 1 ULP
mul 1 1 ULP 1 ULP
div 2 2 ULP 2 ULP implementations may instead use x * (1/y), and so 1 for reciprocal and 1 for multiply
max 0 0 0
min 0 0 0
pow 3 2 ULP 32 ULP May expand to expₑ(b * log(a)).
abs 0 0 0
ceil 0 0 0
cos 2 1/512 ATOL or 1 ULP 1/1024 ATOL
div 2 2 ULP 2 ULP
exp 2 1 ULP 32 ULP ULP is typically very small (0 to 2), but negative values can yield larger deltas (e.g. exp(-36.7462921143) yields ULP± 27 on my machine). float16 is actually computed using float32 (so 1 ULP for final roundoff).
floor 0 0 0
log 3 1/1024 ATOL or 2 ULP 1/1024 ATOL or 2 ULP
neg 0 0 0
sin 2 1/512 ATOL or 1 ULP 1/1024 ATOL a little looser than GPU specs
tan 4 1/512 ATOL or 1 ULP 1/1024 ATOL
gemm 1 IEPOE*2+3 ULP IEPOE*2+3 ULP (dot(a[i, …], b[.., j]) * alpha) + (beta * C). If no optional C input and alpha/beta are identity, use matmul tolerance
leakyRelu 1 1 ULP 1 ULP if a >= 0 then a else a * alpha
matmul 1 IEPOE*2 ULP IEPOE*2 ULP dot(a[i, …], b[.., j])
averagepool2d 2 IEPOE+2 ULP IEPOE+2 ULP number of reduced element additions and a final division
maxpool2d 0 0 0
relu 0 0 0 max(a, 0)
reduceMax 0 0 0
reduceMean 0 IEPOE+2 ULP IEPOE+2 ULP number of reduced element additions and a final division
reduceMin 0 0 0
reduceProduct 0 IEPOE ULP IEPOE ULP number of reduced multiplications
reduceSum 0 IEPOE ULP IEPOE ULP number of reduced additions
reshape 0 0 0
sigmoid 2 3 32+2 1 / (1 + expₑ(-a)) float16's exp is done as float32 (leaving a few ULP for roundoff)
slice 0 0 0
softmax 1 IEPOE*3+3 ULP IEPOE*3+3 ULP expₑ(a - reducemax(A, axes)) / reducesum(expₑ(A - reducemax(A, axes)), axis); // equivalent expₑ(a) / sum(expₑ(A))
split 0 0 0
squeeze 0 0 0
tan na 1/512 ATOL or 1 ULP 1/1024 ATOL may expand to sin(radians) / cos(radians)
tanh 2 1/512 ATOL or 1 ULP 1/1024 ATOL
transpose 0 0 0
  • ATOL - absolute tolerance (expected within [actual - atol, actual + atol])
  • RTOL - relative tolerance *not used, only mentioned for completeness (expected within [actual * (1-RTOL), actual * (1+RTOL)])
  • ULP - unit last place (expected.asRawBits within [actual.asRawBits - ulp, actual.asRawBits + ulp])
  • IEPOE - input elements per output element (depends on individual operator): e.g.
    • GEMM = a.sizes.width (or b.sizes.height)
    • Conv2D = filter.sizes.w * filter.sizes.h * (input.sizes.c / groupCount)
    • Reduction = input sizes multiplied for each active axis
    • Pooling = window size

Let me know if you have any questions. 🧐

(UPDATE: More continued here: #338 (comment))

@wchao1115
Copy link
Collaborator

wchao1115 commented Sep 23, 2022

Big thanks to @fdwr for your contribution. @BruceDai Please note that the proposed tolerances are all relative to an ideal baseline. In our WebML call earlier in the week, I believe we've agreed that the WPT test must be relative to a framework-agnostic reference implementation of WebNN.

I think we'll need a new repo under the webmachinelearning GitHub organization specifically to host the ref implementation for our WPT tests @anssiko and @huningxin Do you have any objection to that? This is something we can help too.

@huningxin
Copy link
Contributor

Thanks much @fdwr , that's a significant contribution!

@wchao1115 , I agreed we should host the reference implementation that generates the ideal baseline results. I think that's the reason we created webnn-baseline repo and implemented the first-wave ops. These ops are implemented in JavaScript double precision calculation and follows the straightforward algorithms, such as conv2d.

@BruceDai
Copy link
Contributor Author

Thanks much @fdwr and @wchao1115 !

  • What's the definition of "IEPOE"? It would be much helpful for me to implement "IEPOE" in JavaScript for WPT tests if there's an algorithm of that .
  • What's the concrete value for ATOL of float32 and float16? You mentioned RTOL, while actual <= expected + atol && actual >= expected - atol missed RTOL, should it be actual <= rtol * expected + atol && actual >= rtol * expected - atol? then what's the concrete value for RTOL if using RTOL?
  • Regarding to zeroing subnormals, we once met this case, I'm going to add a checking whether result number is a subnormal, and use zero (0.0) instead of subnormal.
  • About exp op, I had some observations on Some thoughts of defining ULP tolerance of exp op #288, it seems that the ULP tolerance being a fixed value doesn't apply to exp op @fdwr PTAL, thanks.
  • In WebNN API Spec, there're these three ops (batchNormalization / conv2d / convTranspose2d) having fused activation option, what's the ULP tolerance for these ops if they used fused activation option? For example, conv2d fusing sigmoid activation, what's ULP tolerance for this float32 case, should it still follow IEPOE*2 ULP tolerance of conv2d or follow 3 ULP of sigmoid?

@wchao1115
Copy link
Collaborator

@BruceDai It might be more time-efficient if we would arrange a short 15 minutes presentation at the next WG call to walk through and QA over this topic. @anssiko what do you think?

@anssiko
Copy link
Member

anssiko commented Sep 28, 2022

@wchao1115 I'll put @fdwr on the agenda for our next 6 Oct call, working title "Recommended tolerances for WPT tests".

@BruceDai
Copy link
Contributor Author

Thanks @fdwr !
I updated last PR web-platform-tests/wpt#34287 following above precision-metrics suggestions, updated existed data movement ops float32 tests which use ULP metrics, added tanh op float32 tests which uses ATOL metrics and gemm op float32 tests which uses IEPOE metrics, this PR is under reviewing.
And others float32 tests for remaining first-wave ops of web-platform-tests/wpt#36202 have been updating test data (float64 inputs + float32 baseline) and precision-metrics.

@BruceDai
Copy link
Contributor Author

BruceDai commented Nov 3, 2022

Feng discussed with @fdwr about move test data onto separated JSON files which would make maintain tests easily later. now PR web-platform-tests/wpt#36782 was submitted for reviewing, others tests would add soon.

@anssiko
Copy link
Member

anssiko commented Nov 24, 2022

@BruceDai thanks for your continued work on WebNN WPT. Can you help answer the following questions:

I'm trying to identify opportunities to broaden our WPT contributor base. I'm aware of participants who are eager to get our remaining CR tasks completed and may be able to help in various capacities.

@BruceDai
Copy link
Contributor Author

The opened issues cover rest of ops which are unimplemented in WebNN-Baseline, if they're fixed, we could leverage these pure JavaScript implementations to get baseline test data for contributing op tests onto wpt.

  • Any specific open issues you'd like to bring to the next WG meeting for discussion?

Current I have none open about tests, I'm still focusing on refining and adding first wave operations tests onto wpt.

Since I've been refining test JSON files of wpt WebNN tests PRs according to feedbacks, those open PRs of WebNN-Basline are also updating, once finished I would ask @huningxin and @fdwr to help review.
And experts and engineers are welcome to join for implantation and reviewing.

I'm trying to identify opportunities to broaden our WPT contributor base. I'm aware of participants who are eager to get our remaining CR tasks completed and may be able to help in various capacities.

Thanks @anssiko. Looking forward more contributors, hope we fix CR tasks ASAP :)

@BruceDai
Copy link
Contributor Author

BruceDai commented Feb 3, 2023

@anssiko I updated first top comment, please take a look, thanks.

BTW, may we close this issue and track on #338? Thanks.

@anssiko
Copy link
Member

anssiko commented Feb 3, 2023

@BruceDai @fdwr, others, with your continued contributions we are able to not just meet but exceed the test coverage expectations for the Candidate Recommendation maturity level. Thanks for your contributions and congratulation on reaching this major wpt milestone! This is pioneering work for wpt due to domain-specific requirements of this API.

I'll close this tracker now and we continue track the remaining work in #338 focusing on two remaining ops.

@anssiko anssiko closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants