Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principle: Write only one algorithm to accomplish a task. #562

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jyasskin
Copy link
Contributor

@jyasskin jyasskin commented Mar 7, 2025

This explains why and when "polyglot" formats are a bad idea.

Fixes #239.

There's some overlap between this and the preceding section, Resolving tension between interoperability and implementability. Do y'all think it's ok, or are there bits we could refactor together?

I'd also like to give an example of parsing divergence yielding security bugs, but I didn't have any readily available. Ideas?


Preview | Diff

This explains why and when "polyglot" formats are a bad idea.
@jyasskin jyasskin requested review from hober and csarven March 7, 2025 00:52
@@ -3488,6 +3505,52 @@ While the best path forward may be to choose not to specify the feature,
there is the risk that some implementations
may ship the feature as a nonstandard API.

<h3 id="multiple-algorithms">Write only one algorithm to accomplish a task</h3>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "goal" instead of "task"? This immediately made me think of the event loop.


When specifying how to accomplish a task, write a single algorithm to do it,
instead of letting implementers pick between multiple algorithms.
It is very difficult to ensure that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is very difficult to ensure that
It is very difficult to ensure that

two different algorithms produce the same results in all cases,
and doing so is rarely worth the cost.

Multiple algorithms seem particularly tempting when defining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think that you need this paragraphs as long as an example mentions a file format.

using either the [[HTML#the-xhtml-syntax|XHTML parsing]]
or [[HTML#syntax|HTML parsing]] algorithm.
Authors who tried to use this syntax tended to produce documents
that actually only worked with one of the two parsers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that actually only worked with one of the two parsers.
that only worked with one of the two parsers.


Note: While [[rfc6838#section-6|structured suffixes]] define that
a document can be parsed in two different ways,
they do not violate this rule because the results have different data models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t the real difference here that the suffix parsing produces an intermediate result?

i suspect that this is insufficient still, because it doesn’t really get at why suffix parsers exist. That is still somewhat contested, but my view is that intermediate results are rarely able to be processed meaningfully, so they are limited to use in diagnostic tools and the like.

Copy link

@msporny msporny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the principle seems too blunt to be useful. Some high level thoughts to start, I'm still trying to think about what text would be useful:

  • Polyglot, as a term, is wrong -- this is about interpreting the same data serialization using different algorithms, not about a single system interpreting that serialization using different algorithms. It's more about an ecosystem interpreting that same data serialization using different algorithms (which is useful, more on that below).
  • Yes, there are cases where this resulted in bad outcomes -- XHTML/HTML is a good example.
  • The comparison between VCDM and SD-JWT-VC is totally wrong, they're two totally different data models, using two totally different serializations, using two totally different algorithms -- and there are a number of us that think that whole thing is a massive standardization failure, so using that as an example of the right way to do something is not what we want to do. The only thing they have in common is the phrase "Verifiable Credential", and even that is being objected to by some of us.
  • The multiple suffixes thing is also contested -- in the IETF MEDIAMAN WG, we couldn't find broad-scale usage of suffix-based processing, what @martinthomson is saying is important here. I'll add that the suffix-based processing is also not a clear example of why this principle is good or bad.

Fundamentally, the principle seems misguided. Yes, at some level one data format and one algorithm is a good thing. However, what a traditional web crawler gets out of a web page is different from what a browser parsing the web page works with is different from what a frontier AI model gets out of a web page. The algorithms that each uses are quite different and useful and this principle seems to be arguing against that.

I think the only solid ground here is the XHTML/HTML example. You're going to get push back on the other items being mentioned if they continue to be mentioned in the way the current PR is written up.

I'll try to think of some constructive text, but wanted to get some preliminary thoughts down in an effort to help shape the PR into something more easily defensible.

@filip26
Copy link

filip26 commented Mar 7, 2025

Algorithms + Data Structures = Programs (Niklaus Wirth).

It’s rational to avoid having two algorithms performing the same function, especially when considering costs like time and space complexity, and to recommend the one that best fits the criteria. However, if this change is based on the assumption:

use either JSON or JSON-LD to parse bytes into their data models.

then there is a misunderstanding of the algorithmization basics. JSON and JSON-LD have different data models, as noted. They involve different data structures in the equation at the top, which means different algorithms are needed because they operate on different data structures.

From this perspective, calling for one algorithm to operate on different data structures does not make sense.

My recommendation would be to use a different argument when advocating for a single algorithm - considering factors such as time complexity, space complexity, etc.

@martinthomson
Copy link
Contributor

I don't agree with Manu about this being misguided. The point here is that the same HTML document is not seeking to express multiple distinct sets of semantics depending on how it is processed, there is just one HTML with one interpretation, and one data model that both the producer of the content and the consumer of content can agree on. If they disagree, that is likely due to one or other being wrong.

This is because there is just a single specification for HTML and a single way to interpret HTML content according to that content.

Obviously, what someone does once they have received HTML might differ, but those differences do not relate to how the HTML itself was interpreted, but how the content at the next layer (that is, the words and images and stuff like that) is interpreted. Sure, a human and an AI model will seek to do different things with the information they are presented with, but the interpretation is singular.

Where CID struggles a little is that there are two paths to the same interpretation. It manages that by giving implementations a choice and promising that the outcome will be the same either way. It's bad, because now there is a third place where a bug can result in a different interpretation (producer, consumer, and now spec), but it's not fundamentally a polyglot in the sense that there are multiple divergent interpretations possible.

The core message is that having divergent paths is undesirable. And yes, that means saying that seeking to have a pure JSON vs a JSON-LD interpretation of the same content is a bad idea. Because divergence in data models means that there is no single interpretation of the content upon which all potential recipients might agree upon.

to assign properties to particular objects than JSON does,
these specifications had to add extra rules to both kinds of parsers
in order to ensure that each input document had exactly one possible interpretation.
[[vc-data-model-2.0 inline]] and [[draft-ietf-oauth-sd-jwt-vc inline]] fixed the problem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manu is right that these are completely different (and that they likely represent standardization failure, though the question of where the failure occurred might be contested). In a sense, it is OK that they are completely different (that they are in competition is potentially bad if they address the same use cases, but there is no risk that one might be mistaken for the other).

I think that it would serve this example better to focus only on the CID case.

@gkellogg
Copy link

This issue really gets at the heart of a basic divide at W3C: one that is browser-centric, vs. one which is data-centric. In fact, JSON-LD does parse JSON (and YAML and CBOR) into a common INFRA-based data structure (called the Internal Representation) which various algorithms operate over to perform different transformations, including to interpret as RDF. This is the core reason behind JSON-LD, which has become extremely widely used on the Web (in large part, due to schema.org).

HTML is also often processed differently, typically by interpreting the resulting DOM. This might be done to extract Microdata/RDFa, interpret the contents of script elements, or to perform extensive re-formatting through ReSpec or Bikeshed. Search engines interpret the DOM for their own uses, so a general principle would seem to settle on a data representation which different applications can use to suit their different use cases.

In the case of Verifiable Credentials, the basic failure would seem to be a lack of agreement on how to work with the data that is represented in the JSON. This is an area the TAG can help with for future specs, rather than getting into a reductionist view that Polyglot formats are fundamentally a bad idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New principle: Discourage polyglot formats
6 participants