Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 122: Remove browser specific failures graph #122

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

jgraham
Copy link
Contributor

@jgraham jgraham commented Sep 13, 2022

@DanielRyanSmith
Copy link
Contributor

I agree that there is likely a better way to leverage these metrics, and it seems like an outright improvement to developer utility if the graph is replaced with links that display queries of all BSFs for a given browser.

@foolip
Copy link
Member

foolip commented Sep 14, 2022

I think we should remove the graph from the top of /results, but I don't think we should just remove it. We have triaged Chrome-only failures to keep our BSF number under 500, and I see it might be time to do that again. And based on PRs from @gsnedders to the metrics code I assume they've looked at it too.

/insights already has "Anomalies" which allows getting to views for browser-specific failures, like this one:
https://wpt.fyi/results/?label=master&label=experimental&product=chrome&product=firefox&product=safari&view=subtest&q=%28chrome%3A%21pass%26chrome%3A%21ok%29%20%28firefox%3Apass%7Cfirefox%3Aok%29%20%28safari%3Apass%7Csafari%3Aok%29

(Although it's buggy, I filed web-platform-tests/wpt.fyi#2964.)

If I can make a wishlist, it would be:

  • Put the graph under /insights
  • Make it possible to click points in the graphs to get the corresponding list of tests (even if the weighting isn't the same)
  • Ensure the "Anomalies" widget gives the test list of the most recent results, if one is comparing Chrome vs. Firefox vs. Safari

@jgraham
Copy link
Contributor Author

jgraham commented Sep 14, 2022

My view is that if people want to use the concept of browser specific failures as an internal tool for understanding areas of interop difficulty that's good, and I fully support that. But I don't think we have widespread agreement on its use as a public-facing metric, and the reasoning in the RFC suggests that the lack of curation makes the numbers difficult to interpret.

If specific vendors want a number to look at I think it's reasonable to make that number an internal metric instead. That has the additional advantage that it allows some customisation e.g. it allows filtering the inputs to exclude tests that aren't considered a priority/problem for whatever reason, or dividing up the score into team-specific metrics rather than just having one top-level number. That isn't something we can do with a purely shared metric.

@gsnedders
Copy link
Member

While I've certainly looked at the metric, it's far from the only data derived from WPT results that I've looked at. I think I otherwise agree with @jgraham here.

@gsnedders
Copy link
Member

To be clear, as the RFC says, there are a variety of biases with this metric, and some of these get quite extreme:

Looking at the Safari data, /html/canvas/offscreen accounts for 32.48% of Safari's current score, and /css/css-ui/compute-kind-widget-generated 7.56%.

I don't personally believe 40.04% of Safari's "incompatibility" or "web developer pain" (or however we want to define the goal of the BSF metric) is down to those two features.

If we look at the graph over the past year with those two directories removed, we see a very different graph:

The browser-specific-failure graph with a slowly decreasing Safari over the first six months, stabilising afterwards

@foolip
Copy link
Member

foolip commented Sep 16, 2022

@gsnedders thanks, that clearly demonstrates the outsized impact of test suites with lots of individual tests. For comparison/posterity, here's the current BSF graph on wpt.fyi:

image

A few options for improving the metric:

  • Simply excluding some directories, like /referrer-policy and /html/canvas/offscreen
  • Giving each directory equal weight
  • Equal weight as a starting point, but with manually adjusted weights for some huge directories like /html

I disagree with deleting the graphs outright, but would be happy with both moving it to /insights and tweaking it.

@jgraham
Copy link
Contributor Author

jgraham commented Sep 16, 2022

I think a proposal for a new interop metric, even if based on BSF, would clearly be something for the Interop team to consider.

@past
Copy link
Member

past commented Sep 16, 2022

Improving the BSF metric seems like a worthwhile goal, either through ideas like the ones Sam and Philip propose or through a reimagined Interop metric based on BSF as James suggests. I would encourage the Interop team to explore that path.

However, since we don't have that yet, removing the metric entirely would be a step backwards. In Chromium we do pay attention to the overall score and invest considerably in improving interoperability over time. Hiding that number in favor of team-specific metrics will regress that effort. It will reduce visibility of Chromium interoperability issues at the organizational level and will pass the burden to individual teams with different priorities.

From my perspective, removing things that are currently in use without a suitable replacement is wrong. But perhaps moving the graph to /insights as an interim step before we have an improved metric would be a reasonable compromise.

@karlcow
Copy link

karlcow commented Sep 20, 2022

Not fully matured idea: If the graph is a kind of barometer on web technologies support across browsers, would it make sense to have there things which are only supported (standard positions) uniformly by the 3 browsers represented in the graph?

@foolip
Copy link
Member

foolip commented Sep 20, 2022

@karlcow I've also toyed with the idea of allowing filtering by spec status or implementation status, and I think that would be valuable. I think at least the following filters would be worth trying out:

  • Filter by standards org or working group
  • Filter by spec status (standards org-specific, mostly relevant for W3C)
  • Filter by implementation status of feature in BCD or manually maintained statuses (to exclude expected failures)

I would not describe the current graph as a barometer on web technologies support across browsers. Rather the idea is to surface browser-specific failures, problems that occur in just one of the 3 tested browsers, which would ideally trend towards zero. A barometer of cross-browser support should instead be growing as the size of the interoperable web platform grows. It's an old presentation by now, but I looked at that in The Interop Update, where I teamed up with @miketaylr.

If we work on filtering and weighting we'll have to see which defaults then make the most sense, but I think it's important to be able to see Chrome-only failures over time that includes features Chrome hasn't implemented at all, such as https://wpt.fyi/results/storage-access-api, MathML (until recently) or fastSeek().

@karlcow
Copy link

karlcow commented Sep 21, 2022

@past

It will reduce visibility of Chromium interoperability issues at the organizational level and will pass the burden to individual teams with different priorities.

What is(are) the audience(s) for the graph?

And depending on that what are the useful views for each specific audience?

@past
Copy link
Member

past commented Oct 3, 2022

The audience is senior leaders who are making sure Chromium remains interoperable and competitive with other browser engines over time. The current view of overall browser specific failures is still useful in that task.

@jgraham
Copy link
Contributor Author

jgraham commented Oct 4, 2022

Whilst I'm happy that Chrome's leadership are finding the graph useful, that usefulness as a metric is not a consensus position among browser vendors, and therefore it seems more appropriate to host it at a Chromium-specific location.

@foolip
Copy link
Member

foolip commented Oct 4, 2022

@jgraham how do you see this RFC interacting with #120? Per that RFC the interop team will take ownership of this.

And I now see that RFC should be considered passed, given approvals and several weeks passing. I'll hold off merging for a bit though.

@gsnedders
Copy link
Member

gsnedders commented Oct 4, 2022

And, as I think the above slightly-modified graph shows, the experience of WebKit leadership has been that understanding the graph has been very difficult. There's no intuitive way to discover that those two directories account for such a disproportionate weight of the metric.

If you look at a view of WPT such as this, seeing Safari has fixed over 10k browser-specific-failures (2102 tests (10512 subtests)) over the past year, it seems reasonable to ask "why has the score continued to creep upwards, with no notable improvement at any point?".

On the face of it, there's a number of potential explanations:

  1. The tests which we've fixed have had little to no impact on the metric,
  2. Tests which fail only in Safari have been added at a rate greater than that of our fixes,
  3. Other browsers are fixing two-browser failures making them lone-browser failures.

Of these:

  1. is a hard hypothesis to test short of adding lots of debug info to the scripts that generate the BSF metric
  2. roughly maps to this query: 977 tests (3436 subtests)
  3. roughly maps to this query: 1927 tests (4630 subtests)

Even from all these, it's hard to understand how we end up at the graph currently on the homepage.

[Edited very slightly later to actually use properly aligned runs]

@past
Copy link
Member

past commented Oct 20, 2022

While still supporting improvements to the graph, I will say that adding up the numbers in your three bullets above seems to reasonably explain the lack of impact of the improvements you made.

@gsnedders
Copy link
Member

While still supporting improvements to the graph, I will say that adding up the numbers in your three bullets above seems to reasonably explain the lack of impact of the improvements you made.

The number of subtests do, yes. But that's a complete coincidence, given the "normalisation" to test.

If you look at the actual directory-level diff, it becomes very apparent the overwhelming majority of the change is in /html/canvas/offscreen. And if you look at that directory, you'll see there's only 1805 tests (1910 subtests), which account for almost the entire lack of overall progression.

Again, the problem is to a large degree weighting all the tests the same.

@gsnedders
Copy link
Member

@jgraham how do you see this RFC interacting with #120? Per that RFC the interop team will take ownership of this.

And I now see that RFC should be considered passed, given approvals and several weeks passing. I'll hold off merging for a bit though.

For anyone confused, I believe we (the WPT Core Team) decided to defer this RFC until the Interop Team had time to consider it.

@foolip
Copy link
Member

foolip commented Feb 4, 2023

We never resolved (merged) #120 but indeed that seems like the best way to handle this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants