Principle: Write only one algorithm to accomplish a task. #562

jyasskin · 2025-03-07T00:52:22Z

This explains why and when "polyglot" formats are a bad idea.

Fixes #239.

There's some overlap between this and the preceding section, Resolving tension between interoperability and implementability. Do y'all think it's ok, or are there bits we could refactor together?

I'd also like to give an example of parsing divergence yielding security bugs, but I didn't have any readily available. Ideas?

Preview | Diff

This explains why and when "polyglot" formats are a bad idea.

annevk · 2025-03-07T07:33:00Z

index.bs

@@ -3488,6 +3505,52 @@ While the best path forward may be to choose not to specify the feature,
 there is the risk that some implementations
 may ship the feature as a nonstandard API.

+<h3 id="multiple-algorithms">Write only one algorithm to accomplish a task</h3>


Maybe "goal" instead of "task"? This immediately made me think of the event loop.

martinthomson · 2025-03-07T08:55:38Z

index.bs

+
+When specifying how to accomplish a task, write a single algorithm to do it,
+instead of letting implementers pick between multiple algorithms.
+It is very difficult to ensure that


Suggested change

It is very difficult to ensure that

It is very difficult to ensure that

martinthomson · 2025-03-07T08:56:42Z

index.bs

+two different algorithms produce the same results in all cases,
+and doing so is rarely worth the cost.
+
+Multiple algorithms seem particularly tempting when defining


I don’t think that you need this paragraphs as long as an example mentions a file format.

martinthomson · 2025-03-07T08:57:29Z

index.bs

+using either the [[HTML#the-xhtml-syntax|XHTML parsing]]
+or [[HTML#syntax|HTML parsing]] algorithm.
+Authors who tried to use this syntax tended to produce documents
+that actually only worked with one of the two parsers.


Suggested change

that actually only worked with one of the two parsers.

that only worked with one of the two parsers.

martinthomson · 2025-03-07T09:01:31Z

index.bs

+
+Note: While [[rfc6838#section-6|structured suffixes]] define that
+a document can be parsed in two different ways,
+they do not violate this rule because the results have different data models.


Isn’t the real difference here that the suffix parsing produces an intermediate result?

i suspect that this is insufficient still, because it doesn’t really get at why suffix parsers exist. That is still somewhat contested, but my view is that intermediate results are rarely able to be processed meaningfully, so they are limited to use in diagnostic tools and the like.

msporny

Hmm, the principle seems too blunt to be useful. Some high level thoughts to start, I'm still trying to think about what text would be useful:

Polyglot, as a term, is wrong -- this is about interpreting the same data serialization using different algorithms, not about a single system interpreting that serialization using different algorithms. It's more about an ecosystem interpreting that same data serialization using different algorithms (which is useful, more on that below).
Yes, there are cases where this resulted in bad outcomes -- XHTML/HTML is a good example.
The comparison between VCDM and SD-JWT-VC is totally wrong, they're two totally different data models, using two totally different serializations, using two totally different algorithms -- and there are a number of us that think that whole thing is a massive standardization failure, so using that as an example of the right way to do something is not what we want to do. The only thing they have in common is the phrase "Verifiable Credential", and even that is being objected to by some of us.
The multiple suffixes thing is also contested -- in the IETF MEDIAMAN WG, we couldn't find broad-scale usage of suffix-based processing, what @martinthomson is saying is important here. I'll add that the suffix-based processing is also not a clear example of why this principle is good or bad.

Fundamentally, the principle seems misguided. Yes, at some level one data format and one algorithm is a good thing. However, what a traditional web crawler gets out of a web page is different from what a browser parsing the web page works with is different from what a frontier AI model gets out of a web page. The algorithms that each uses are quite different and useful and this principle seems to be arguing against that.

I think the only solid ground here is the XHTML/HTML example. You're going to get push back on the other items being mentioned if they continue to be mentioned in the way the current PR is written up.

I'll try to think of some constructive text, but wanted to get some preliminary thoughts down in an effort to help shape the PR into something more easily defensible.

filip26 · 2025-03-07T15:16:20Z

Algorithms + Data Structures = Programs (Niklaus Wirth).

It’s rational to avoid having two algorithms performing the same function, especially when considering costs like time and space complexity, and to recommend the one that best fits the criteria. However, if this change is based on the assumption:

use either JSON or JSON-LD to parse bytes into their data models.

then there is a misunderstanding of the algorithmization basics. JSON and JSON-LD have different data models, as noted. They involve different data structures in the equation at the top, which means different algorithms are needed because they operate on different data structures.

From this perspective, calling for one algorithm to operate on different data structures does not make sense.

My recommendation would be to use a different argument when advocating for a single algorithm - considering factors such as time complexity, space complexity, etc.

martinthomson · 2025-03-10T03:52:40Z

I don't agree with Manu about this being misguided. The point here is that the same HTML document is not seeking to express multiple distinct sets of semantics depending on how it is processed, there is just one HTML with one interpretation, and one data model that both the producer of the content and the consumer of content can agree on. If they disagree, that is likely due to one or other being wrong.

This is because there is just a single specification for HTML and a single way to interpret HTML content according to that content.

Obviously, what someone does once they have received HTML might differ, but those differences do not relate to how the HTML itself was interpreted, but how the content at the next layer (that is, the words and images and stuff like that) is interpreted. Sure, a human and an AI model will seek to do different things with the information they are presented with, but the interpretation is singular.

Where CID struggles a little is that there are two paths to the same interpretation. It manages that by giving implementations a choice and promising that the outcome will be the same either way. It's bad, because now there is a third place where a bug can result in a different interpretation (producer, consumer, and now spec), but it's not fundamentally a polyglot in the sense that there are multiple divergent interpretations possible.

The core message is that having divergent paths is undesirable. And yes, that means saying that seeking to have a pure JSON vs a JSON-LD interpretation of the same content is a bad idea. Because divergence in data models means that there is no single interpretation of the content upon which all potential recipients might agree upon.

martinthomson · 2025-03-10T09:17:27Z

index.bs

+to assign properties to particular objects than JSON does,
+these specifications had to add extra rules to both kinds of parsers
+in order to ensure that each input document had exactly one possible interpretation.
+[[vc-data-model-2.0 inline]] and [[draft-ietf-oauth-sd-jwt-vc inline]] fixed the problem


Manu is right that these are completely different (and that they likely represent standardization failure, though the question of where the failure occurred might be contested). In a sense, it is OK that they are completely different (that they are in competition is potentially bad if they address the same use cases, but there is no risk that one might be mistaken for the other).

I think that it would serve this example better to focus only on the CID case.

gkellogg · 2025-03-10T20:28:11Z

This issue really gets at the heart of a basic divide at W3C: one that is browser-centric, vs. one which is data-centric. In fact, JSON-LD does parse JSON (and YAML and CBOR) into a common INFRA-based data structure (called the Internal Representation) which various algorithms operate over to perform different transformations, including to interpret as RDF. This is the core reason behind JSON-LD, which has become extremely widely used on the Web (in large part, due to schema.org).

HTML is also often processed differently, typically by interpreting the resulting DOM. This might be done to extract Microdata/RDFa, interpret the contents of script elements, or to perform extensive re-formatting through ReSpec or Bikeshed. Search engines interpret the DOM for their own uses, so a general principle would seem to settle on a data representation which different applications can use to suit their different use cases.

In the case of Verifiable Credentials, the basic failure would seem to be a lack of agreement on how to work with the data that is represented in the JSON. This is an area the TAG can help with for future specs, rather than getting into a reductionist view that Polyglot formats are fundamentally a bad idea.

Principle: Write only one algorithm to accomplish a task.

88a3e66

This explains why and when "polyglot" formats are a bad idea.

jyasskin requested review from hober and csarven March 7, 2025 00:52

annevk reviewed Mar 7, 2025

View reviewed changes

martinthomson reviewed Mar 7, 2025

View reviewed changes

msporny reviewed Mar 7, 2025

View reviewed changes

martinthomson reviewed Mar 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Principle: Write only one algorithm to accomplish a task. #562

Principle: Write only one algorithm to accomplish a task. #562

jyasskin commented Mar 7, 2025 •

edited by pr-preview bot

Loading

annevk Mar 7, 2025

martinthomson Mar 7, 2025

martinthomson Mar 7, 2025

martinthomson Mar 7, 2025

martinthomson Mar 7, 2025

msporny left a comment •

edited

Loading

filip26 commented Mar 7, 2025 •

edited

Loading

martinthomson commented Mar 10, 2025

martinthomson Mar 10, 2025

gkellogg commented Mar 10, 2025

	It is very difficult to ensure that

	It is very difficult to ensure that

	that actually only worked with one of the two parsers.
	that only worked with one of the two parsers.

Principle: Write only one algorithm to accomplish a task. #562

Are you sure you want to change the base?

Principle: Write only one algorithm to accomplish a task. #562

Conversation

jyasskin commented Mar 7, 2025 • edited by pr-preview bot Loading

annevk Mar 7, 2025

Choose a reason for hiding this comment

martinthomson Mar 7, 2025

Choose a reason for hiding this comment

martinthomson Mar 7, 2025

Choose a reason for hiding this comment

martinthomson Mar 7, 2025

Choose a reason for hiding this comment

martinthomson Mar 7, 2025

Choose a reason for hiding this comment

msporny left a comment • edited Loading

Choose a reason for hiding this comment

filip26 commented Mar 7, 2025 • edited Loading

martinthomson commented Mar 10, 2025

martinthomson Mar 10, 2025

Choose a reason for hiding this comment

gkellogg commented Mar 10, 2025

jyasskin commented Mar 7, 2025 •

edited by pr-preview bot

Loading

msporny left a comment •

edited

Loading

filip26 commented Mar 7, 2025 •

edited

Loading