Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with non-standard slots #328

Closed
gouttegd opened this issue Nov 5, 2023 · 41 comments · Fixed by #375
Closed

Dealing with non-standard slots #328

gouttegd opened this issue Nov 5, 2023 · 41 comments · Fixed by #375
Assignees
Milestone

Comments

@gouttegd
Copy link
Contributor

gouttegd commented Nov 5, 2023

Among the issues that I believe should be decided along the road to 1.0 is: How should implementations deal with the possible presence of non-standard slots (slots not described anywhere in the specification and/or the schema) in a mapping set?

That is, considering this mapping set (note the presence of a non-standard foo slot in the set metadata and a non-standard bar and baz slots in the mappings metadata.):

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#foo: "The yellow fox jumped over the lazy dog"
subject_id      predicate_id    object_id       mapping_justification   bar     baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    A       D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    B       E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    C       F

what should implementations do?

The current behaviour of both sssom-py and sssom-java is to ignore the non-standard slots (silently, in the case of sssom-java; sssom-py emits a warning) and to accept the rest of the set. That is, running this set through (for example)

sssom parse -I tsv sample.sssom.tsv -o output.sssom.tsv

will produce

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
subject_id      predicate_id    object_id       mapping_justification
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration

I think this behaviour is sane and should be explicitly prescribed in the specification (“Implementations SHOULD ignore any non-standard slot not described in this specification; they MAY print a warning upon encountering any such slot.”).

A related issue is: If a SSSOM producer does want to include additional metadata for which there is no suitable standard slots, and wants those additional metadata to be preserved by the existing tooling (instead of being ignored and discarded as above), what should be the proper way to do that?

Currently, the official way is through the (poorly specified) other slot, which accepts a “list of key value pairs for properties not part of the SSSOM spec” and “can be used to encode additional provenance data”. So, according to the current spec, the example above should have been:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#other:
#  - "foo=The yellow fox jumped over the lazy dog"
subject_id      predicate_id    object_id       mapping_justification   other
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    bar=A|baz=D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    bar=B|baz=E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    bar=C|baz=F

This method of encoding additional metadata doesn’t look great, which is probably why there is a proposition to replace it, in which additional slots would be explicitly defined in a extension_definition slot:

#extension_definition:
#  - key: bar
#    value: <type of value expected in foo>
#  - key: baz
#    value: <type of value expected in baz>

I think that whatever method we think is best to encode additional metadata, it should be decided in time for 1.0.

I’d like to propose a simpler method than the one drafted in #263. I like the idea of explicitly declaring the additional metadata but I don’t see the point of specifying the type of value. Non-standard metadata would be tied to a particular use-case and to particular users, so it should be up to them to know what to expect in any non-standard field that they need.

Instead, I propose that non-standard slots should simply be declared as a list of slot names:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#extra_set_slots:
#  - foo
#extra_slots:
#  - bar
#  - baz
#foo: "The yellow fox jumped over the lazy dog"
subject_id      predicate_id    object_id       mapping_justification   bar     baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    A       D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    B       E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    C       F

The effect of such a declaration would be that the declared slots should be preserved exactly as they are, instead of being ignored. That is, the mapping set above should be “round-trippable” through sssom-py parse, because the foo, bar, and baz slots, being declared, should not be discarded.

Thoughts?

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 5, 2023

Of note: if we decide to keep the current way of encoding additional metadata in the other field, its formal definition in the schema should be amended to make it multi-valued. Currently it is not, which conflicts with the description that makes it a list. (That’s one reason why I qualified this slot of “poorly defined”.)

@matentzn
Copy link
Collaborator

matentzn commented Nov 6, 2023

Important discussion thanks for rekindling. I like your solution!

What about (we discussed this before):

  • bar ends up being adopted by sssom (today the column does not get validated, tomorrow it does)
  • bar was already a built-in entity (validation failure OR trying to validate?)
  • should bar be taking into account by version conversion?
  • related to that: wouldn't this be better: custom:bar instead of bar? This avoid all three above.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 6, 2023

Yes, we’ve discussed it in #263. And so you know what I propose to avoid those issues: requesting that non-standard fields, when someone needs to use them, must be prefixed by ext_ or similar.

That’s the same idea as custom:bar, except that it avoids making the name looking like a CURIE. Since it is not a CURIE, I would find that needlessly confusing.

So the entire idea would be:

  • Non-standard slots that are not declared SHOULD be ignored and discarded.
  • Non-standard slots that are declared SHOULD be preserved as they are.
  • Declared or not, non-standard slots SHOULD be prefixed with ext_.

If someone has used a non-prefixed name (e.g. bar) and that name later becomes the name of a new standard slot:

  • if bar has not been declared as an extension slot: since bar is now a standard slot, the parser should treat it as now indicated by the specification (the parser has no way to know that this was a non-standard slot);
  • if bar has been declared as an extension slot: several options:
    • reject the set outright;
    • treat bar as the standard slot and ignore the extension declaration;
    • treat bar as the non-standard slot, overriding the specification.

I would personally favour rejecting the set outright, since there is a clear conflict between the spec and the extension declaration.

@matentzn
Copy link
Collaborator

matentzn commented Nov 6, 2023

Declared or not, non-standard slots SHOULD be prefixed with ext_.

I think making this "ext" thing a requirement will make your solution less valuable for some people. We could say: if it is declared as an extension slot it is, by definition, not a standard slot. Even if it sounds the same. It still irks me that this precludes any way to serialise it neatly into RDF. That was nicer in the more complex solution.

treat bar as the standard slot and ignore the extension declaration;

I don't think this should happen. I prefer the other two options. But:

treat bar as the non-standard slot, overriding the specification.

Why do you say "overwrite"? If bar is an optional slot, cant a declared extension slot bar be simply an entirely different thing, as if it was declared in a different namespace?

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 6, 2023

I think making this "ext" thing a requirement

To clarify, it’s not a “requirement” – as the spec’s authors, we have no way to require producers of SSSOM to do things one way or another. If people want to use non-standard slots without prefixing with ext_, they will do it whether we like it or not.

The idea is that if they do not use the ext_ prefix, then they have no guarantee that whatever names they are using will not clash with possible future versions of the specification. The only special thing with the ext_ prefix, really, is that we (the authors of the SSSOM spec) pinky-promise that we will never, in any future version, create a standard slot with a name starting with ext_, so people can use such names in full confidence that there will never be a clash with not-yet-existing official slots.

So people who don’t like the idea of prefixing their custom slots with ext_ can very well not do it. It’s just that the onus is then on them to be careful about which names they are choosing. If they choose names specific enough, even without a ext_ prefix the likelihood of a clash with a future standard slot is low.

For example, if biomappings published mappings with a hypothetical biomappings_mapping_id extra slot, that would probably be safe enough, because I can see no reason why any future version of the standard would ever define a slot with such a name. An extra slot with the name mapping_id, on the other hand, would be very risky, because it’s a very “generic” name and I can very well imagine that we could add a slot with that name in the future.

We could say: if it is declared as an extension slot it is, by definition, not a standard slot.

I think this would introduce needless confusion. That would mean that nobody could be sure of what a slot means, even if if is described in the standard, without first checking the list of declared extension slots.

If bar is an optional slot, cant a declared extension slot bar be simply an entirely different thing

As I said: needless confusion. We have a standard that assigns an agreed-upon meaning to a handful of slot names. Allowing people to override that and to say, “in my set, the slot foo means whatever I intend it to mean, regardless of what the specification says”, negates the very principle of a standard exchange format.

Again: under my proposition, people would be free to use whatever slot names they want. As long as the name (1) is declared and (2) does not clash with a standard name, implementations will be required to preserve such extra slots. The recommendation to prefix the non-standard name with ext_ is just an advice, and is the only officially guaranteed way of ensuring the name will not clash with future versions of the specification. Because if there is a clash, the standard-specified meaning should always prevail.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 7, 2023

It’s just that the onus is then on them to be careful about which names they are choosing.

Another clarification: The above does not mean that we (again, the authors of the spec) have to be unhelpful with people who would end up in the situation of having one of their non-standard slot names on the verge of becoming a standard one, if we know about it.

For example, let’s say that at some point (after SSSOM 1.0 is out), we are considering adding a unique identifier to each mapping (I know that idea has already been evoked). We create a ticket to discuss about the idea, and after some discussion we decide that indeed it would be a good thing, and we propose mapping_id as a new slot to hold such an identifier.

Then someone intervenes in the discussion to point out that they have already been using mapping_id as a non-standard slot name in their own mapping sets for quite a while. Unfortunately, the way they have been using their own mapping_id slot does not quite match the way we were thinking the standard version should be used, for whatever reason. For example, they are using randomly generated identifiers, whereas we think that mapping_id should contain serially generated IDs, or whatever. The point is, they ask us if we could consider using another name for the proposed future standard slot, in order to avoid creating a clash with what they have been doing.

And in that situation, I would be completely open to the idea of renaming our future slot (which would not exist yet, so it wouldn’t be a big deal). I would certainly mention to those folks that it would have been better if they had followed our advice and have named their non-standard slot ext_mapping_id, but there would be no need to be a dickhead about it.

@matentzn
Copy link
Collaborator

matentzn commented Nov 7, 2023

Ok, sold on your treatment of ext_.

I think this would introduce needless confusion. That would mean that nobody could be sure of what a slot means, even if if is described in the standard, without first checking the list of declared extension slots.

And

We have a standard that assigns an agreed-upon meaning to a handful of slot names. Allowing people to override that and to say, “in my set, the slot foo means whatever I intend it to mean, regardless of what the specification says”, negates the very principle of a standard exchange format.

While I can see your point clearly here, I am not sure it is as "clear cut" as that.

:contributor and :contributor mean something entirely different depending on which base-namespace is defined. It's a normal mechanism in xml and various RDF serialisations. I totally get you want to protect users from such base namespace declarations, and probably you are right, just not as doubtlessly. We could even inject ext_ in the serialisation if a clash occurs in a future version 😂 and would be killed.

Ok, let's go for now with your proposal of dealing with slot clashes a bit less "confusingly".

@matentzn
Copy link
Collaborator

matentzn commented Nov 7, 2023

With a similar logic you could simply dispense of declarations altogether and say: if a slot is not-standard (not in the spec), it is non-standard extension. I don't think we should do that; but the argument could come up.

@gouttegd
Copy link
Contributor Author

gouttegd commented Nov 7, 2023

:contributor and :contributor mean something entirely different depending on which base-namespace is defined.

Yeah, except that the TSV format has no notion of namespace. Column names are in a flat namespace.

I see your point and I agree that having “namespaced” column names, similar to XML names, would be nice. But nobody would expect that from a TSV file, and none of the standard tools available to manipulate TSV files (such as Pandas) has support for that. I don’t think it is worth the trouble. Let’s keep the Simple Standard for Sharing Ontological Mappings, well… simple.

With a similar logic you could simply dispense of declarations altogether and say: if a slot is not-standard (not in the spec), it is non-standard extension.

Yes. But then it’s all or nothing: either we require that implementations should preserve all non-standard slots, or we don’t require that and implementations are free to discard all non-standard slots.

Right now, the behaviour of both sssom-py and sssom-java is to discard non-standard slots. The specification should reflect that.

Having declared non-standard slots allows existing implementations to keep their default behaviour (of discarding undeclared non-standard slots) while also allowing to require that declared non-standard slots, at least, should be preserved.

@matentzn
Copy link
Collaborator

matentzn commented Nov 9, 2023

I see your point and I agree that having “namespaced” column names, similar to XML names, would be nice. But nobody would expect that from a TSV file,

Just to be clear: what I am saying is that the global "extra_slots" establishes a different namespace. I do understand that you find the risk of "confusion" too high as evidenced by the fact that I could not even get that point across.. 😂

Alright, let's go ahead to a concrete proposal then, I think we are on the same page.

@gouttegd
Copy link
Contributor Author

Having given a bit more thoughts, I’d like to amend my proposal with the following:

Regarding non-standard slots at the level of the set

I don’t think there is actually a need to do this (as proposed initially):

#extra_set_slots:
#  - foo
#foo: "The yellow fox jumped over the lazy dog"

That is, first declaring foo as a non-standard slot and then using foo at the top level, as any other slot.

Instead, we could simply change the definition of the other slot. Currently, as defined in the spec (as as mentioned in the first message), it is supposed to be used like this:

#other:
#  - "foo=The yellow fox jumped over the lazy dog"

i.e. a list a key/value pair. We could make it into a proper dictionary instead, and use it like this:

#other:
#  foo: "The yellow fox jumped over the lazy dog"

This would both make parsing and using those extra metadata easier (no more need to “manually” extract the key and the value downstream of the YAML parser), and remove the need for declaring non-standard slots (by definition, any “sub-slot” under the other slot is a non-standard slot).

This is a breaking change compared to the current definition of the other slot, but one that, I believe, is worth it – and one whose impact can be lessened if we require that the implementations should still support the former syntax.

Regarding non-standard slots at the level of individual mappings

No change to my initial proposal regarding the format: We would still require that non-standard slots be declared (in order for them to be preserved; it would always be OK not to declare non-standard slots, but then implementations would discard them) in the top-level metadata, and then they can appear as normal columns in the TSV section.

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#extra_slots:
#  - bar
#  - baz
subject_id      predicate_id    object_id       mapping_justification   bar     baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    A       D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    B       E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    C       F

But we should then clarify what to do with those bar and baz columns and their values. Here I propose that we (again) change the type of the mapping-level other slot to make it into a dictionary (instead of a list of strings), and that we use that dictionary to store all the values from the (declared) non-standard columns.

This would imply that under this proposal, the other slot should itself never appear as a column in the TSV section. Either the mapping set has no non-standard mapping metadata slots, in which case the other dictionary would always be empty for each mapping, or the mapping set does have non-standard mapping metadata slots (as bar and baz in the example above), in which case it is those slots that should appear as extra columns.

@matentzn
Copy link
Collaborator

Here I propose that we (again) change the type of the mapping-level other slot to make it into a dictionary (instead of a list of strings), and that we use that dictionary to store all the values from the (declared) non-standard columns.

Just to clarify: you are saying that:

  1. other is typed as a dictionary
  2. other cannot be used at mapping level
  3. The key-value pairs in other are at mapping set level, not mapping level (i.e. a hypothetical method that would "denormalise" a mapping set would not migrate foo to become a column in the data frame) in your example:
#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#other:
#  foo: "The yellow fox jumped over the lazy dog"
subject_id      predicate_id    object_id       mapping_justification
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration

would NOT become after denormalisation:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#extra_slots:
#  - foo
subject_id      predicate_id    object_id       mapping_justification   foo
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    "The yellow fox jumped over the lazy dog"

But instead remain the same as it was (unchanged).

@gouttegd
Copy link
Contributor Author

gouttegd commented Dec 27, 2023

other is typed as a dictionary

Yes. Assuming it is possible to have dictionary-typed slots in LinkML, but after looking at the documentation (after writing the previous comment unfortunately), I have seen nothing to suggest that it is the case – LinkML has support for lists (through the multivalued option), but it does not seem to have support for dictionaries (which may be why the proposal in #263 had to define its own key value pair range to emulate dictionary entries, and also why the current definition of the other slot is a string intended to contain entries of the form key=value, again emulating a dictionary).

If LinkML really does not have support for dictionary-typed values, it can only be by design (I cannot imagine that dictionaries would simply have been overlooked), which means I am not going to request that such a support be added as a new feature. Which in turns means that this entire idea can be scrapped.

other cannot be used at mapping level

Not in the TSV representation. other would only be present in the model, where it would be used to store the values from the non-standard columns.

The key-value pairs in other are at mapping set level, not mapping level.

Yes.

To try giving a complete example in one go: if we consider the following TSV+YAML file:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#other:
#  foo: "The yellow fox jumped over the lazy dog"
#extra_slots:
#  - bar
subject_id      predicate_id    object_id       mapping_justification        bar       baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration        A       B

would lead to an in-memory representation as follows:

[... other set-level slots ...]
other:
  foo: "The yellow fox jumped over the lazy dog"
mappings:
  - subject_id: http://purl.obolibrary.org/obo/FBbt_00000001
    predicate_id: https://w3id.org/semapv/vocab/crossSpeciesExactMatch
    object_id: http://purl.obolibrary.org/obo/FBbt_0000468
    mapping_justification: https://w3id.org/semapv/vocab/ManualMappingCuration
    other:
      bar: A

That is, the set-level other slot is merely passed as it is, whereas the mapping-level other slot is constructed from the declared non-standard columns (note that the baz column is discarded, since it is not declared as an extra slot).

gouttegd added a commit to gouttegd/sssom-java that referenced this issue Dec 27, 2023
This commit implements the reading part of the behaviour proposed in
mapping-commons/sssom#328 to deal with non-standard slots.

It supposes that the SSSOM data model has been extended to add two new
mapping-level metadata slots:

* extra_set_slots, to declare set-level non-standard slots;
* extra_slots, to declare mapping-level non-standard slots.

If the extra_set_slots slot is present, then when a non-standard slot is
encountered at the set level, it is:

* discarded (as before) if the non-standard slot is not declared in the
  extra_set_slots list;
* added as a new key/value pair to the special __extra dictionary
  (specifically intended to hold non-standard slots) if it is declared
  in the extra_set_slots list.

Likewise, if the extra_slots slot is present, then if the TSV section
contains a non-standard column, the values from that column are:

* discarded (as before) if the name of the column is not declared in the
  extra_slots list;
* added as a new key/value pair to the special __extra dictionary
  associated with each mapping, if the name of the column is declared in
  the extra_slots list.
gouttegd added a commit to gouttegd/sssom-java that referenced this issue Dec 27, 2023
This commit implements the writing part of the proposed behaviour in
mapping-commons/sssom#328 for dealing with non-standard slots.

If the __extra special dictionary at the mapping set level contains
entries whose keys are declared in extra_set_slots, those entries are
written with the set metadata as if they were normal slot. They are
written in key alphabetical order, after all the other slots.

If the __extra special dictionary at the mapping level contains entries
whose keys are declared in the extra_slots, those entries are written
with the mappings metadata in the TSV section as supplementary columns.
The columns are written in key alphabetical order, after all the other
columns.
@gouttegd
Copy link
Contributor Author

OK, I’ve played a bit with various ideas and in particular I have fully implemented my initial proposal in SSSOM-Java (in a test branch – nothing definitive now, this is just for testing and exploring)… and I would like to discuss a bit more before we commit to anything.

In particular, in my initial proposal, the idea was to do as follows for non-standard slots at the set level:

#extra_set_slots:
#  - foo
#foo: "This non-standard slot has been declared and will be PRESERVED."
#notfoo: "This non-standard slot has not been declared and will be DISCARDED."

That is, we distinguish non-standard slots that the user wants to be preserved by requiring that such slots are declared as “slots-to-be-preserved”. Upon encountering a declared non-standard slot, the parser MUST preserve it, while non-declared non-standard slots are automatically discarded.

But now that I have implemented this behaviour, the more I look at it the more ridiculous it seems to me: This two-steps process (first declaration, then use) is cumbersome, why couldn’t we simply do something like:

#extra_metadata:
#  foo: "This non-standard slot is in the extra_metadata section and will be PRESERVED."
#notfoo: "This non-standard slot is outside of the extra_metadata section and will be DISCARDED."

That is, instead of requiring that non-standard slots be declared before they are used, we require that they are confined to a dedicated part of the metadata block (tentatively named extra_metadata for now). This seems much more convenient to me (I have also implemented that behaviour in another branch).

(This would not apply to non-standard slots at the mapping level. We would still require such slots to be declared in order for them to be preserved rather than discarded. We can’t really use the extra_metadata trick for TSV columns, unfortunately.)

@gouttegd
Copy link
Contributor Author

Another point is that I now think it would not necessarily be unreasonable to recommend that implementations preserve all non-standard slots, without requiring any declaration at all (possibly as an optional behaviour).

I am not saying that this is what we should do (I’d be in fact slightly reluctant to do it as this would be the opposite of the existing behaviour of both SSSOM-Py and SSSOM-Java), but it’s something to be considered.

@gouttegd
Copy link
Contributor Author

Before going any further, could we get a decision on the following 4 propositions?

A) The use of non-standard slots is unsupported.

Only the slots listed in the spec are guaranteed to always be preserved. Implementations are free to do whatever they want (including discarding without even a warning) with any non-standard slot they encounter.

If you want to store/exchange additional metadata for which no standard slot exists, and if you want to be sure said metadata is always preserved no matter what, your only solution is to encode the metadata as key=value pairs in the other slot. That is the only supported mechanism for storing additional metadata in SSSOM.

That is basically the current sitation.

B) Implementations MUST preserve any non-standard slot.

The YAML metadata block can always contain non-standard keys and the TSV section can always contain non-standard columns. Implementations MUST always preserve such slots without requiring the user to do anything special.

That is basically the opposite of the current situation. Note that in this case, the other slot, both at the set level and at the mapping level, could probably be obsoleted as it would no longer really serve any purpose.

C) Implementations MAY/SHOULD preserve any non-standard slot.

Similar to B, with the big difference that implementations are not required to support non-standard slots. For implementations that do support non-standard slots, it is up to them to decide whether this should be the default behaviour or not.

Slightly easier for implementers (who can decide to avoid implementing that feature if they want). More annoying for users as it means they can’t be sure their additional metadata will be preserved or not, since that would depend whether the implementation supports it or not. (Whether this is really a problem is debatable: possibly, non-standard metadata will only be meaningful within the confines of a given project anyway, so it may not be a big deal if they are lost once the mapping set is re-used outside of that project.)

D) Implementations MUST preserve any non-standard slot that is clearly identified as being “to-be-preserved”

There is a mechanism to ”declare” non-standard slots, and users who want to be sure their additional metadata are preserved (but who don’t want to encode said metadata in the other slot, because it is too cumbersome) must use that mechanism. Implementations are required to always preserve any non-standard slot that has been properly declared; non-declared slots can be discarded as usual.

That is basically my initial proposal. I deliberately skip the details of how non-standard slots must be declared for now.

@gouttegd
Copy link
Contributor Author

gouttegd commented Dec 28, 2023

For what it’s worth, as the SSSOM-Java developer I tend slightly toward D, but I would have no objection to implementing B or C (A doesn’t require any implementation as it’s the existing behaviour – obviously I would have no objection to doing nothing :D ).

@matentzn
Copy link
Collaborator

matentzn commented Jan 2, 2024

Awesome analysis, I know you know that :)

It is difficult for me to support one solution over another. Before I give you my opinion, I hope you don't mind me returning to an earlier point first, and forgive me it is about the how rather than the what, which is I agree, a separate thing we should decide after we make an informed decision about the what.

This pertains to your 7 November comment above.

I see your point and I agree that having “namespaced” column names, similar to XML names, would be nice. But nobody would expect that from a TSV file, and none of the standard tools available to manipulate TSV files (such as Pandas) has support for that. I don’t think it is worth the trouble. Let’s keep the Simple Standard for Sharing Ontological Mappings, well… simple.

Here I gave two examples with dc:modified or foaf:funded_by. This could be, or I daresay is, important because many people that would consider implementing a standard such as SSSOM are deeply familiar with metadata modelling, and would be happy if their RDF serialisation would end up mapped to standard properties (such as the popular dublin core). Note that the FAIR Impact, Elixir and EOSC life communities, for example, are pretty keen on the RDF representation in particular.

(EDIT I just deleted a lengthy paragraph I wrote about in-TSV schema extensions thinking it a bit over the top for that early in the year).

Regarding the four proposals:

I am slightly (not strongly) biased against A and B.

  • A defies reality and will mean that people which use non-standard slots, which is basically everyone I surveyed so far, will not be able to use standard tooling for filtering and annotating. The other slot really was a hack; no one is using that, and certainly not for the purpose of "adding columns to a CSV file".
  • B seems to a high degree policing - a specification conformant implementor (for example a browser, a registry API) should be able to decide that it only supports specification internal slots. I am even more against this than A.

With C and D, I have the following thoughts for now:

  1. If we can agree on a slightly more complex model for representing slot definitions (it does not have to be a full fledged schema definition, but at least a very basic things that allow me to control the RDF serialisation) I would favour D.
  2. If we cannot agree on a more complex model for representing slot definitions, and basically, I don't know, apply a default namespace to custom slots in RDF, I would favour C, as being less prescriptive for tool developers. Users can still maintain some level of control by choosing the right tool.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 2, 2024

A defies reality and will mean that people which use non-standard slots, which is basically everyone I surveyed so far, will not be able to use standard tooling for filtering and annotating.

I agree. I mentioned A merely for completeness – and also as a reminder that it’s the only thing we’ll have unless and until we come up with a better solution.

The other slot really was a hack; no one is using that, and certainly not for the purpose of "adding columns to a CSV file".

Also for completeness: we could define a mechanism by which non-standard slots and columns are encoded into the other slot.

That is, if we have a set as follows:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#foo: "The yellow fox jumped over the lazy dog"
subject_id      predicate_id    object_id       mapping_justification   bar     baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    A       D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    B       E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    C       F

implementations could automatically transform it into:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#other:
#  - "foo=The yellow fox jumped over the lazy dog"
subject_id      predicate_id    object_id       mapping_justification   other
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    bar=A|baz=D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    bar=B|baz=E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    bar=C|baz=F

And of course there would be an operation to do the opposite, i.e. to “decode” the contents of the other slot into individual non-standard slots.

This would allow users of a set containing non-standard metadata to be sure that their non-standard metadata “survive” through any SSSOM-compliant tool when needed, without requiring any change to the existing model. That’d be kind of a hack piled on top of another hack, though. Still, it’s a possibility.

B seems to a high degree policing […] I am even more against this than A.

OK, let’s rule this one out.

  1. If we can agree on a slightly more complex model for representing slot definitions (it does not have to be a full fledged schema definition, but at least a very basic things that allow me to control the RDF serialisation) I would favour D.
  2. If we cannot agree on a more complex model for representing slot definitions, and basically, I don't know, apply a default namespace to custom slots in RDF, I would favour C, as being less prescriptive for tool developers.

How about:

Upon encountering a non-standard slot (i.e. an unknown key in the set-level metadata block, or an unknown column in the TSV section):

  • if the slot name looks like a Curie (e.g. foaf:fundedBy) that is resolvable (meaning the prefix is declared in the set’s Curie map, or is one of the built-in prefixes), then we expand the Curie (if the Curie is not resolvable, then it is an error: a SSSOM set must never use undeclared prefixes, the spec is already clear on that);
  • if the slot name does not look like a Curie, we construct an IRI by prefixing the slot name with either a spec-mandated default namespace (e.g. https://w3ids.org/sssom/user_extensions/) or the set’s ID.

That is, the following set:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#  foaf: "http://xmlns.com/foaf/0.1/"
#"foaf:fundedBy": "Scrooge McDuck Foundation"
#foo: "The yellow fox jumped over the lazy dog."
subject_id      predicate_id    object_id       mapping_justification   foaf:bar     baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    A       D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    B       E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    C       F

would have, after parsing, two non-standard set-level metadata:

  • http://xmlns.com/foaf/0.1/fundedBy (with the value "Scrooge McDuck Foundation"),
  • https://w3ids.org/sssom/user_extensions/foo (with the value "The yellow fox jumped over the lazy dog.");

and each mapping would have two non-standard mapping-level metadata:

  • http://xmlns.com/foaf/0.1/bar (with respective values "A", "B", and "C"),
  • https://w3ids.org/sssom/user_extensions/baz (with respective values "D", "E", and "F").

Would that solve your RDF serialisation issue?

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 2, 2024

The last proposal above (treating the non-standard slot name as a Curie) is actually quite orthogonal to the question of requesting non-standard slots to be declared. The example I gave was for undeclared slots, but the same logic would work for declared non-standard slots:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#  foaf: "http://xmlns.com/foaf/0.1/"
#extra_set_metadata:
#  "foaf:fundedBy": "Scrooge McDuck Foundation"
#  foo: "The yellow fox jumped over the lazy dog."
#extra_mapping_metadata:
#  - "foaf:bar"
#  - baz
subject_id      predicate_id    object_id       mapping_justification   foaf:bar     baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    A       D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    B       E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    C       F

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 2, 2024

Or, instead of making a distinction between declared and undeclared slots, we could make the distinction between “slots whose name is a Curie” and “slots whose name is not a Curie” – only the former would be REQUIRED to be preserved, implementations would be free to discard the latter.

That is, with that set again:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#  foaf: "http://xmlns.com/foaf/0.1/"
#"foaf:fundedBy": "Scrooge McDuck Foundation"
#foo: "The yellow fox jumped over the lazy dog."
subject_id      predicate_id    object_id       mapping_justification   foaf:bar     baz
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration    A       D
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:6000002  semapv:ManualMappingCuration    B       E
FBbt:00000003   semapv:crossSpeciesExactMatch   UBERON:0000914  semapv:ManualMappingCuration    C       F

Only the http://xmlns.com/foaf/0.1/fundedBy set-level metadata and http://xmlns.com/foaf/0.1/bar mapping-level metadata would be guaranteed to be preserved.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 3, 2024

OK, here is my current proposition in 5 points.

1. Support for non-standard metadata is OPTIONAL.

Implementations are only REQUIRED to support the standard, spec-defined slots. Support for any kind of non-standard slots is always OPTIONAL.

This is based on what you said above: “a specification conformant implementor (for example a browser, a registry API) should be able to decide that it only supports specification internal slots”.

2. OPTIONAL, but RECOMMENDED support for DECLARED EXTENSIONS

It is possible to declare non-standard metadata slots using the mechanism similar to the one proposed in #263.

That is, the set gets a new metadata slot called extension_definitions specifically intended to declare non-standard slots, as in the following example:

#curie_map:
#  FOAF: "http://xmlns.com/foaf/0.1/"
#  EXAMPLE: "https://example.org/extensions/"
#extension_definitions:
#  - slot_name: funded_by
#    property: "FOAF:fundedBy"
#  - slot_name: bar
#    property: "EXAMPLE:bar"

Where slot_name is the name of the extension slot as it will appear in the rest of the set (as an extra key in the top-level metadata block or as an extra column name in the TSV section) and property is the IRI (possibly CURIEfied as in the example above) identifying the metadata stored in the slot.

There would be a single extension_definitions slot for the entire set, to be used to declare both set-level extra slots and mapping-level extra slots. This is for the sake of simplicity (I don’t see any compelling reason to have, say, a set_extension_definitions for declaring set-level extension metadata and a mapping_extension_definitions for declaring mapping-level extension metadata).

So with the above declarations, a set like this:

#funded_by: "Scrooge McDuck Foundation"
subject_id      predicate_id                    object_id       mapping_justification         bar
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration  A

would end up with the following extended metadata pairs:

  • http://xmlns.com/foaf/0.1/fundedBy = Scrooge McDuck Foundation (at the set level),
  • https://example.org/extensions/bar = A (at the mapping level).

I’d like to impose two additional constraints:

A. Non-standard metadata slots can only contain string values. No nesting of complex data structures (list or dictionaries). That is, it would be forbidden to do things like this, even if funded_by has been properly declared:

#funded_by:
#  - organization: "Scrooge McDuck Foundation"
#    address: "Killmotor Hill, Duckburg, Calisota"
#    amount: "36 cents"

B. The slot_name should be constrained to a small character set (letters, digits, _, a few others) to minimise potential problems when the slot name is used in the YAML metadata block or as a column header. For example, it should not contain a colon character (:), which could confuse a YAML parser.

Implementations would be strongly encouraged (“SHOULD”) to support such declared non-standard extensions, though (as stated in 1. above) it would still be optional.

3. FULLY OPTIONAL support for UNDECLARED EXTENSIONS

Additionally, it would also be possible for an implementation to support undeclared extensions. When the parser encounters an non-standard foo slot (either as an extra key in the YAML metadata block or as an extra column in the TSV section) that does not have a corresponding declaration in the extension_definitions slot, it would construct an IRI of the form https://w3id.org/sssom/undeclared_extensions/foo, and use that IRI as if it had been declared as the property associated with foo.

That is, a non-standard slot like this:

#foo: "The brown fox jumped over the lazy dog."

would be treated as if the metadata block contained the following declaration:

#extension_definitions:
#  - slot_name: foo
#    property: https//w3id.org/sssom/undeclared_extensions/foo

Support for such undeclared non-standard slots would be purely optional (MAY, not SHOULD).

The rationale for supporting undeclared extensions is notably to allow sets created before the introduction of the extension_definitions mechanism to still be usable – at least with the implementations that will support such undeclared extension.

4. Overriding standard slots is strictly forbidden

It is forbidden to declare an extension slot with the same name as an existing standard slot. That is, something like this

#extension_definitions:
#  - slot_name: mapping_tool
#    property: "https://example.org/my_own_mapping_tool_property"

MUST be rejected as invalid.

Rationale: Since support for extensions is optional, not all implementations will use the extension_definitions slot. If we allowed such redefinitions of standard slots, the same mapping set would be interpreted differently depending on the implementation: an implementation aware of extension_definitions would interpret mapping_tool as an extension slot, whereas an implementation that does not support non-standard metadata would interpret mapping_tool as the standard, spec-defined slot. This is a huge no-no in my opinion. Standard slots exist for a reason and users MUST NOT be allowed to redefine them.

5. Serialisation details

It is RECOMMENDED that a SSSOM/TSV writer adheres to the following rules when writing a set containing non-standard metadata:

  1. The contents of the extension_definitions slot should be sorted alphabetically on the slot_name key.

  2. All set-level non-standard metadata slots should be written after all the standard slots in the YAML metadata block. They should be sorted alphabetically on the slot name.

  3. All mapping-level non-standard metadata slots (i.e. extra columns) should be written after all the standard columns in the TSV section. They should be sorted alphabetically on the slot name.

This is intended to ensure that a mapping set can always be written in a predictable way regardless of the presence of non-standard slots (same reason why we already recommend that standard slots are written in the order they are listed in the spec).

This does not concern the parser. A parser that supports non-standard extensions MUST support them independently of the order in which they appear in the file.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 3, 2024

Alternative proposition (based on the idea of allowing CURIEfied slot names, as discussed here):

1. Support for non-standard metadata is OPTIONAL

Implementations are only REQUIRED to support the standard, spec-defined slots. Support for any kind of non-standard slots is always OPTIONAL.

Rationale: There is an explicit desiderata that “a specification conformant implementor (for example a browser, a registry API) should be able to decide that it only supports specification internal slots”.

2. OPTIONAL, but RECOMMENDED support for NAMESPACED EXTENSIONS

It is possible to use non-standard slots provided that:

  1. their name is in a “CURIEfied” form (prefix:name), and
  2. the prefix is dutifully declared using the normal prefix declaration mechanism (i.e. in the curie_map slot), unless it is one of the known “built-in” prefixes (those are the same rules that apply to any use of a CURIE everywhere in SSSOM, nothing special here).

Example:

#curie_map:
#  EXAMPLE: "https://example.org/extensions/"
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  FOAF: "http://xmlns.com/foaf/0.1/"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#"FOAF:fundedBy": "Scrooge McDuck Foundation"
subject_id      predicate_id                    object_id       mapping_justification         EXAMPLE:bar
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration  A

The set contains the non-standard key/value pair {http://xmlns.com/foaf/0.1/fundedBy, "Scrooge McDuck Foundation"}, and the first mapping contains the non-standard key/value pair {https://example.org/extensions/bar, "A"}.

Implementations are strongly encouraged (SHOULD) to support such namespaced extensions.

3. FULLY OPTIONAL support for NON-NAMESPACED EXTENSIONS

Additionally, an implementation that supports NAMESPACED EXTENSIONS can also decide to support NON-NAMESPACED EXTENSIONS. When a slot is both a) not described by the standard and b) not in a CURIE form (no prefix), an implementation MAY decide to keep the metadata. In that case, it should construct an IRI for the metadata by catenating the default namespace https://w3id.org/sssom/extensions/ and the slot name as found in the file. (An implementation that decides not to support non-namespaced extensions would simply discard the slot.)

Example:

#curie_map:
#  FBbt: "http://purl.obolibrary.org/obo/FBbt_"
#  UBERON: "http://purl.obolibrary.org/obo/UBERON_"
#funded_by: "Scrooge McDuck Foundation"
subject_id      predicate_id                    object_id       mapping_justification         bar
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration  A

The set contains the non-standard key/value pair {https://w3id.org/sssom/extensions/funded_by, "Scrooge McDuck Foundation"}, and the first mapping contains the non-standard key/value pair {https://w3id.org/sssom/extensions/bar, "A"}.

4. Non-standard slots can only contain string values

Whether namespaced or not, a non-standard slot can only contain simple string values. No nesting of complex data structures such as this:

#"FOAF:fundedBy":
#  - organization: "Scrooge McDuck Foundation"
#    address: "Killmotor Hill, Duckburg, Calisota"
#    amount: "36 cents"

5. Serialisation considerations

To ensure that a set can be written in a predictable way, it is RECOMMENDED that a SSSOM/TSV writer adheres to the following rules when writing a set containing non-standard slots (whether namespaced or not):

  1. All set-level non-standard metadata slots should be written after all the standard slots in the YAML metadata block. They should be sorted alphabetically on the expanded slot name.

  2. All mapping-level non-standard metadata slots (i.e. extra columns) should be written after all the standard columns in the TSV section. They should be sorted alphabetically on the expanded slot name.

Comments

Compared to the other proposition, this one has two advantages:

  1. No need to declare extension slots (so, no need for a new extension_definitions slot or similar in the model), while still allowing RDF serialisation (since the expanded slot name is an IRI).
  2. No risk of name clash with standard-defined slots:
  • if the extension slot is namespaced, then EXAMPLE:mapping_tool can never be confused with the standard mapping_tool slot (though I think it should be discouraged to create such slots: parsers may not be confused by similar slot names but people may be);
  • if the slot is not namespaced, then mapping_tool is necessarily treated as the standard slot, since there is no indication it is anything else.

The disadvantage is the need to use CURIEfied slot names. This should not be a problem for mapping-level slots (nothing wrong in having a colon in the middle of a column name, all TSV parsers should be able to handle that), but for set-level slots, this requires that the names are escaped, since the colon is a special character in YAML. That is, FOAF:fundedBy at the set level MUST be written as follows:

#"FOAF:fundedBy": "Scrooge McDuck Foundation"

because the following would be invalid YAML:

#FOAF:fundedBy: "Scrooge McDuck Foundation"

Whether this a big deal or not, I don’t know.

@matentzn
Copy link
Collaborator

matentzn commented Jan 3, 2024

Wow man. Ok, lets get started:

Feedback on "Declared extensions" proposal

Support for non-standard metadata is OPTIONAL.

We are agreed.

OPTIONAL, but RECOMMENDED support for DECLARED EXTENSIONS

Wow, great idea! That would solve everything. Some minor corrections:

A. Non-standard metadata slots can only contain string values. No nesting of complex data structures (list or dictionaries).

Just to be a tiny bit more flexible I would propose the word string to be changed to literal. This would leave the door open for for other kinds of literals in the future, like those specified by the xsd namespace (and, what is most important to me, IRI).

I agree with everything else, including the constraints on slot_name. I don't see a reason to allow anything but [a-z0-9_], but maybe we can leave at least upper case characters in there. I am fine with anything we are on the same boat.

FULLY OPTIONAL support for UNDECLARED EXTENSIONS

  • I would for the sake of precision stick to RFC 2119 with our phrasing: https://datatracker.ietf.org/doc/html/rfc2119 and avoid OPTIONAL or FULLY OPTIONAL in favour of MAY.
  • I think proposal three needs a note for developers what should happen if a future version of SSSOM moves the slot to "standard". At least a note that "there is a risk that, in the case of undeclared extensions, the slot may become standard in the future and treated as such".
  1. Non-standard slots can only contain string values

Agreed, but I would prefer we say "literal" rather than string in the specification, to leave the door open for xsd datatype map later.

Serialisation considerations

It is important we use SHOULD rather than MUST here, but else all agreed.

Other considerations

Also for completeness: we could define a mechanism by which non-standard slots and columns are encoded into the other slot.

This would allow users of a set containing non-standard metadata to be sure that their non-standard metadata “survive” through any SSSOM-compliant tool when needed, without requiring any change to the existing model.

Hm, smart. A bit over the top IMO to provide a specification here, but good thinking. I was thinking to drop the level of other to string type once the rest of the proposal is through and market it as a slot that allows to hold arbitrary metadata (whether in dict form or otherwise).

I personally, right this moment, favour the "declared extensions" proposal for the following reasons:

  • I actually don't so much like the idea of CURIFied slot names, and this namespace proposal would make it impossible to use unCURIEfied slot names with a property mapping
  • I want to keep the door open for literal typing, especially IRI and xsd:integer to ensure that I can do this:
#FOAF:fundedBy: wikidata:Q11937

Which is a common use case in the FAIR data world.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 3, 2024

I actually don't so much like the idea of CURIFied slot names, and this namespace proposal would make it impossible to use unCURIEfied slot names with a property mapping

All right, let’s ditch the “namespaced extensions” proposition and polish the “declared extensions” one then.

I would for the sake of precision stick to RFC 2119 with our phrasing

That’s what I intend to do when I’ll rewrite the spec. For this discussion, I’d rather not assume that everyone here is familiar with RFC 2119 terminology, given that I am the one who had to suggest that it should be used more often.

I think proposal three needs a note for developers what should happen if a future version of SSSOM moves the slot to "standard". At least a note that "there is a risk that, in the case of undeclared extensions, the slot may become standard in the future and treated as such".

Agreed. We have already discussed that before, my position is still the same: users are strongly encouraged, when crafting non-standard slot names, to prefix them with ext_, because we explicitly promise that we will never create a new standard slot with a name starting with ext_, so such names would automatically be forever protected against any name clash in any future version of the standard. If users don’t follow that advice and craft non-standard slots with names that do not start with ext_, well, they have been warned: their mapping sets may break with future versions of the standard.

I was thinking to drop the level of other to string type once the rest of the proposal is through and market it as a slot that allows to hold arbitrary metadata (whether in dict form or otherwise).

Not sure what you mean here. Note that currently, other is already of type string (which is one of the problems with that slot: its specification in the model does not match the way the documentation says it should be used: it is specified as a string but documented as a list).

About literal typing

This would leave the door open for for other kinds of literals in the future
[…]
I want to keep the door open for literal typing

If you want to allow for explicit typing of non-standard metadata, we should do that from the start and not “in the future“. Putting out a half-baked mechanism, waiting for developers to implement it, and then coming back to them later saying “oh wait, actually non-standard slots should be typed, sorry I don’t give a damn about the time you spent implementing that and I don’t care if you now have to start all over again because the prerequisites have changed!” would be a moronic move.

Typing, then.

How about:

#extension_definitions:
#  - slot_name: funded_by
#    property: "FOAF:fundedBy"
#    type: "xsd:anyURI"

Random thoughts:

  1. If we do that it would probably be wise to add http://www.w3.org/2001/XMLSchema# to the list of “built-in” prefixes.
  2. In the absence of a type key, xsd:string must be assumed.
  3. No other assumptions should be made. For example, implementations should not be expected to assume that a given property is of a particular type: either the type is explicitly specified, or it is assumed to be xsd:string, whatever the property is.
  4. Open-ended typing will be a royal pain in the ass for implementors using a statically typed language.

Because of (4), and since according to you the common use case is to be able to mark non-standard slots as being IRIs, I would very much favour a less ambitious system where:

  1. a non-standard slot can only be a string;
  2. a non-standard slot can be marked as being a (possibly CURIEfied) IRI (which would still be stored as string).

Something like that:

#extension_definitions:
#  - slot_name: funded_by
#    property: "FOAF:funded_by"
#    is_iri: true

If people need to store other data types (e.g. datetimes, floats, etc.), they can store them as strings and let their use-case-specific applications parse them into the appropriate data type.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 3, 2024

If you really, really want explicit typing, at the very least we should only allow a restricted list of possible types, such as xsd:string, xsd:datetime, xsd:double, xsd:anyURI, possibly a few others – but non-standard slots should not become the place where users can do whatever the hell they want.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 3, 2024

Just one example of the kind of mess explicit typing could cause.

Let the following mapping set (curie_map omitted for the sake of brevity):

#extension_definitions:
#  - slot_name: validation_date
#    property: "EXAMPLE:validationDate"
#    type: "xsd:date"
subject_id      predicate_id                    object_id       mapping_justification         validation_date
FBbt:00000001   semapv:crossSpeciesExactMatch   UBERON:0000468  semapv:ManualMappingCuration  2024-01-03

(Let’s assume EXAMPLE:validationDate is a site-specific property intended to record when the mappings were last validated by a custom process – akin to the standard mapping_date, but still different enough that whoever created that set felt they needed a different metadata.)

We parse that set. The 2024-01-03 value in the validation_date column is parsed into a java.time.Date object (the parser knows to do that since it knows the type of the slot, all good). Because we are working in a statically typed language, and extension slots can be of an arbitrary type (unknown at compile-time), we store the java.time.Date object as a simple java.lang.Object.

When the client code wants to do something with the validation_date slot, it has to consult the extension_definitions slot to find out that the actual type of validationDate is a date, and then cast the java.lang.Object value into a java.time.Date. So far, so good.

But now let’s say we have to merge the above set with that other set:

#extension_definitions:
#  - slot_name: validation_date
#    property: "EXAMPLE:validationDate"
subject_id      predicate_id                    object_id       mapping_justification         validation_date
FBbt:00000002   semapv:crossSpeciesExactMatch   UBERON:0000469  semapv:ManualMappingCuration  2024-01-04

Notice how the creator of that second set forgot to specify the type of the validation_date slot? So when we parse that set, the 2024-01-04 value is parsed as a java.lang.String, then again stored as a java.lang.Object.

Now what happens when we merge the two sets together?

First, we have to decide which of the two validation_date definitions take precedence, since they are not the same (the first one declares the slot as being of type xsd:date, the second declares – implicitly – the slot as being of type xsd:string). Let’s say we decide the definitions of the first merged set take precedence (we could decide the opposite, that would not change anything).

Now we have a merged set in which the extension_definitions says that the validation_slot contains date-type valued. So when iterating over the mappings to process the validation_slot, the client code will confidently try to cast the value from java.lang.Object to java.time.Date – and KABOOM, when attempting to do that over the mappings coming from the second set it will result in a invalid cast exception!

Seems far-fetched? However the only thing needed for that to happen is for someone to forget the type qualifier in a extension_definitions – I can pretty much guarantee that it is something that will happen.

And this is not a problem only for statically typed languages like Java. Let’s imagine the same scenario with a duck-typed language like Python: you believe (because that’s what the extension_definitions slot says) that the validation_date slot contains dates. So you confidently call a date-specific method (say, isoformat()) on values from that slot – and KABOOM, when doing that over the mappings from the second set you will get a AttributeError ('str' object has no attribute 'isoformat')!

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 3, 2024

I could be OK with explicit typing if we agree on two things:

A. Whatever type is used has to be a “simple” type that can be represented as a string (e.g. dates, floats, booleans, IRIs – in CURIE form or not —, etc. are all good; lists, dictionaries and other complex structures are verboten).

B. It is understood that the type indicated in the extension_definitions is merely a type hint, and that client code should never assume that the type hints are 100% accurate. Because the type hints and the actual values are stored in different places, there would always be a risk that the type hints are “out of sync” (as in the case above when we merge several datasets with conflicting extension_definitions). Therefore client code should always adopt a cautious approach whenever it manipulates non-standard slots (as in “the extension definition says the value of that should be a date; well, let’s try to parse it as a date, but let’s be ready to catch a InvalidDateFormatError just in case it is not a date”).

This would still allow a SSSOM/TSV-to-RDF serialiser to use the type hint to write something like this, which I assume is what you want to be able to do:

<EXAMPLE:validation_date rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2024-01-03</EXAMPLE:validation_date>

where the datatype would be provided by the type hint.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 4, 2024

After further thoughts, let’s scrap this idea:

I would very much favour a less ambitious system where:

  • a non-standard slot can only be a string;
  • a non-standard slot can be marked as being a (possibly CURIEfied) IRI (which would still be stored as string).
    Something like that:
#extension_definitions:
#  - slot_name: funded_by
#    property: "FOAF:funded_by"
#    is_iri: true

That was a ridiculous idea: if we have to store one extra bit of information to record the fact that the slot is an IRI, we might as well store the full type information.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 4, 2024

So let’s try to put this typing thing into shape.

1. Syntax

What I already proposed yesterday, i.e.:

#extension_definitions:
#  - slot_name: funded_by
#    property: "FOAF:fundedBy"
#    type: "xsd:anyURI"

A type of xsd:anyURI is to be treated the same way as a LinkML-style uriOrCurie. That is, it is formally an URI but it may be written as a CURIE – in which case the normal SSSOM rules about CURIEs apply: the prefix name MUST be declared for the set to be valid.

If the type key is absent, the slot is assumed to be of type xsd:string.

(Alternatively, if it is expected that most uses of extension slots will be to store identifiers, we could decide that the default type should be xsd:anyURI instead. No preference on that from me.)

2. Range of allowed types

Several possibilities here.

A. Only allow a small subset of types

ONLY the following types are allowed:

  • xsd:string
  • xsd:integer
  • xsd:boolean
  • xsd:double
  • xsd:time
  • xsd:date
  • xsd:datetime
  • xsd:anyURI

Rationale: Having a fixed number of cases to deal with will make it easier for implementors and that list should still be enough to cover most use cases.

B. No limitation on allowed types

No restriction on what the type key can contain, as long as it is a (possibly CURIfied) IRI.

In that case, the type is merely a hint (and I would actually suggest renaming to type key to type_hint to make that clear). Implementation should treat all extension values as nothing more than opaque strings. It is entirely left to client code to interpret the type hint to decide what to do with the values. The only exception is that for extension slots with a xsd:anyURI type, CURIE expansion may be performed.

C. No limitation but basic types are recognised for what they are

Basically a compromise between A and B. There is no restriction on what the type key can contain, but:

  • if it is one of the types listed in A above, implementations must guarantee the type safety of any operation involving the corresponding slots (e.g., if an extension slot is typed with xsd:double, implementations must check that any value in that slot is effectively a floating number, and reject the set otherwise);
  • if it is any other type, then the indicated type is merely a hint and the values are effectively treated as opaque strings.

I strongly favour A here. Then again, I’m afraid that the mere possibility of declaring a type will inevitably lead to people putting whatever they want in the type key (possibly even a custom identifier for a custom type that only makes sense for their particular use case), so B or C could be more realistic.

3. Cardinality

Any mapping set or any mapping can only contain one value for any given extension slot. If an extension slot is present more than once (i.e. two keys with the same name in the YAML metadata block or two columns with the same name in the TSV section), the behaviour is unspecified. Implementations are free to (a) reject the set outright, (b) ignore all duplicated slots, (c) accept only the first occurrence of the slot, (d) accept only the last occurrence of the slot, etc.

4. Optionality

Two possibilities here.

A. The typing feature is an integral part of the declared extensions feature.

Support for declared extensions is only RECOMMENDED (as already agreed upon above), but an implementation that decides to support declared extensions MUST support typing. The two features cannot be separated.

B. The typing feature is optional.

An implementation that supports declared extensions MAY (or maybe SHOULD, but not MUST) support the typing feature. If it does not, then it should treat all extension values as opaque strings.

Option B allows implementers to opt out from having to deal with types at all, while still allowing extension values to be preserved (except for their types). But option A is nicer for users, because they only need to worry about whether their sets will pass through an implementation that supports extensions or not, without having to worry about the supplementary issue of whether the implementation supports typing or not.

@matentzn
Copy link
Collaborator

matentzn commented Jan 5, 2024

I was thinking to drop the level of other to string type once the rest of the proposal is through and market it as a slot that allows to hold arbitrary metadata (whether in dict form or otherwise).

Not sure what you mean here. Note that currently, other is already of type string (which is one of the problems with that slot: its specification in the model does not match the way the documentation says it should be used: it is specified as a string but documented as a list).

Yeah, then, just adjust the "documentation" to say, "wild west".

Now what happens when we merge the two sets together?

😱 OMG I never even considered this.. MERGING. I always assumed that custom slots are namespace by the mapping set they appear in, implicitly, and never considered how complicated the merge implementation will become.

A. Whatever type is used has to be a “simple” type that can be represented as a string (e.g. dates, floats, booleans, IRIs – in CURIE form or not —, etc. are all good; lists, dictionaries and other complex structures are verboten).

100% agreed

B. It is understood that the type indicated in the extension_definitions is merely a type hint, and that client code should never assume that the type hints are 100% accurate.

Great idea, also agreed.

Feedback on final proposal:

If the type key is absent, the slot is assumed to be of type xsd:string.

Yes, string. Not anyURI

A type of xsd:anyURI is to be treated the same way as a LinkML-style uriOrCurie.

I would personally would prefer to just use linkml:uriOrCurie as a type, because the standard RDF rendering for xsd:anyURI typed literal is http://ex.org^^xsd:anyUri and not <http://ex.org>.

  1. Range of allowed types

I am fine with A, with the caveat that I would like to support xsd:anyUri (a literal URI) and linkml:uriOrCurie (an entity reference / RDF resource) individually. I would prefer C to be more future proof, but given your strong feeling I'd rather trust your instinct. (C in the sense of "if datatype x is not recognised we assume xsd:string")

  1. Cardinality

ok by me.

  1. Optionality

I favour B as the less prescriptive of the two. My tendency is to say: "if you want to do this thing (costum extensions), this is how you can" and "if you are a tool developer and you want to support custom extensions, this is how they would look like. Do as you will".

Pandoras box: merges and derivations

The biggest box you opened here is the question of derived mapping sets, in particular through merge processes. The scenario I am worried about is this (similar to your scenario):

Two mapping sets declare the same custom extension slot, with possibly different slot definitions.

My instinct would say that the slot name should always be namespace during a merge, no matter what, but this is very inconvenient as you say; merges are most often performed within a project, and it would be crazy if this meant that the slots get artificially namespaced during the merge. So, my instinct is definitely wrong.

The only sane thing to do is to to check the slot definitions for conflicts and refuse to proceed with the process. Which then puts quite a bit of burden on the merge implementor.. But, well, I guess this is optional anyways, so..

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 5, 2024

I always assumed that custom slots are namespace by the mapping set they appear in, implicitly

Well, that was automatically the case in one of my previous propositions:

we construct an IRI by prefixing the slot name with either a spec-mandated default namespace (e.g. https://w3ids.org/sssom/user_extensions/) or the set’s ID.

Under that system and if we choose to use the set’s ID rather than a global, spec-mandated default namespace, a non-standard slot funded_by would get associated to the IRI <set_ID>/funded_by. This would automatically preclude any clash when merging two sets.

The obvious problem with that solution, though, is that it is then no longer possible to assign existing properties (such as FOAF:fundedBy) to a slot. From what you said, I think it is (much) more important to be able to use such existing properties than to prevent possible definition clashes when merging.

I would personally would prefer to just use linkml:uriOrCurie as a type

Fine with me. So xsd:anyURI for literal URIs (as in https://example.org/^^xsd:anyURI) and linkml:uriOrCurie for identifiers.

B. It is understood that the type indicated in the extension_definitions is merely a type hint, and that client code should never assume that the type hints are 100% accurate.
Great idea, also agreed.
[…]
I am fine with A […] I would prefer C to be more future proof, but given your strong feeling I'd rather trust your instinct. (C in the sense of "if datatype x is not recognised we assume xsd:string")

To be clear, the idea in A is that we do not treat the indicated type as merely a hint. We restrict the list of allowed types precisely so that it reasonably doable to enforce the typing, without having to coerce everything into strings.

In B, on the contrary we do coerce every extension values into strings, without enforcing anything.

And in C, we enforce the typing if the indicated type is one of the agreed-upon “basic types”, or coerce into strings otherwise.

From the point of view of a developer using a SSSOM parsing library (be it SSSOM-Java, SSSOM-Py, or any future compliant implementation), the difference would be in what the parser provides:

  • in A, the parser would say “Here’s an extension value. The associated property is http://example.org/my_property, and the value is an integer – trust me, I checked the type when parsing, if the value had not been a valid integer I would have rejected the set”;
  • in B, the parser would say “Here’s an extension value. The associated property is http://example.org/my_property, and the value is supposedly a http://www.w3.org/2001/XMLSchema#integer – at least that’s what the extension_definitions slot in the mapping set metadata says, I have no clue whether this is correct or not. As far as I am concerned, it’s a string, it’s up to you to figure out what to do with it. You could try Integer.parse() maybe?”;
  • in C, the parser would say one thing or the other depending on whether the indicated type is one of the “basic types” or not (“here’s an extension value that is an integer – trust me, I checked; here’s another value that is supposedly a http://example.org/my_custom_type, I didn’t know what to do with it, so I give you a string, up to you to deal with it”.

I am really on the fence as to which option is “best”.

From the point of view of a developer who would write the actual implementation (i.e., me in the case of SSSOM-Java), option B is clearly the easiest to implement. That is because, in effect, it delegates most of the work to the client code.

Not sure what would be best from the point of view of an user.

Optionality
I favour B as the less prescriptive of the two

OK.

About merging

The only sane thing to do is to to check the slot definitions for conflicts and refuse to proceed with the process.

That’s one possibility, yes (and, well, it’s not that much of a burden: just iterate through the extension definitions of every set to merge and raise an error if there’s a conflict).

Another would be to store the full property name (not merely the slot name) with each extension value. So if you merge this set:

#extension_definitions:
#  - slot_name: funded_by
#    property: "FOAF:fundedBy"
#funded_by: "Scrooge Mc Duck"

and this one:

#extension_definitions:
#  - slot_name: funded_by
#    property: "http://example.org/another/property/to/represent/funder"
#funded_by: "Flintheart Glomgold"

You would end up with two extension values:

  • http://xmlns.com/foaf/0.1/fundedBy with value “Scrooge Mc Duck”;
  • http://example.org/another/property/to/represent/funder with value “Flintheart Glomgold”

No problem so far then. The only issue would be: what do to when serialising the merged set? Which slot names to use? One possibility would be to keep using funded_by for the first property, and to craft a new default slot name for the second, resulting in something like this:

#extension_definitions:
#  - slot_name: funded_by
#    property: "FOAF:fundedBy"
#  - slot_name: extraslot1
#    property: "http://example.org/another/property/to/represent/funder"
#funded_by: "Scrooge Mc Duck"
#extraslot1: "Flintheart Glomgold"

Not very fancy, but it would at least preserve all important informations (the slot names are not really important: the properties they are associated with are).

Another problem that can occur when merging, as already discussed above, is if there is a conflict between the indicated types, e.g. one set defines funded_by as an identifier and the other defines it as a string. Here, we could again refuse to merge, but it would also be possible to simply fallback to the string type.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 5, 2024

I am really on the fence as to which option is “best”.

Actually it’s not so much that I don’t know which option is best. It’s more that each option is best for a different category of persons.

Option A is best for developers using a SSSOM library: all the type checking is done by the library, they can just use whatever values are provided with their (guaranteed) correct types. Of course it means that it is harder for the developers of said SSSOM library.

Option B is best for developers writing a SSSOM library: It’s the easiest to implement. But then it’s harder for developers using the library, since they have to do the type checking that the library does not do for them.

Option C is (probably) best for users: they are free to use whatever types they want instead of being limited to a few basic types. Type checking for the most commonly used types will be done by the library, so it’s quite nice for downstream developers as well. But it puts most of the burden on the developers of SSSOM libraries.

So who do you want to please the most? :D

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 5, 2024

I am on the verge of recommending option C at this point, especially since we agreed above that the typing feature would be optional (an implementation can decide to support extensions without supporting typed extensions), meaning developers of SSSOM libraries who don’t want to implement typing can opt out of doing so. (In which case they would treat everything as strings, which is more or less the same thing as option B except that the type hints would not be preserved.)

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 5, 2024

But I guess there’s an argument to be made that both options A and C are “overkill”, and that maybe we do not want to transform SSSOM into a generic engine for storing typed key/value pairs…

Rah, I don’t know.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 5, 2024

In fact, my concern with option C is that I wonder if it would qualify as an instance of the second-system effect

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 6, 2024

All right, here is now my “final” proposition.

Not “final” in the sense “I won’t accept any further discussion on this, take it or leave it!” – more in the sense “I sure hope I won’t have any other contradicting idea 10 minutes after writing this message”…

1. Support for non-standard metadata is OPTIONAL

a. Implementations are only REQUIRED to support the standard, spec-defined metadata slots. Support for any kind of non-standard slot is always OPTIONAL or RECOMMENDED, but never MANDATORY.

Therefore, users must expect that any non-standard metadata slot will be discarded by an implementation that does not support them.

b. It is left at the discretion of implementations to decide whether to print a warning (or give any other kind of signal to the user) when a non-standard metadata slot is discarded.

2. RECOMMENDED support for DEFINED EXTENSIONS

2.1. How to define extension slots

a. It is possible to define non-standard metadata slots using an extension_definitions key in the top-level metadata block.

b. The format of the extension_definitions key is a list of extension_definition, where an extension_definition is itself a dictionary with the following keys:

  • slot_name: the name of the non-standard metadata slot, as it will appear in the top-level metadata block (for set-level metadata) and/or in the header of the TSV section (for mapping-level metadata);
  • property: the property associated with the non-standard metadata slot;
  • type_hint: the expected type of values in the non-standard metadata slot.

c. The slot_name key is MANDATORY. It MUST be a simple string containing only alphanumeric characters and underscores ([a-zA-Z0-9_]).

FIXME: Do we want to allow non-ASCII characters here? I can imagine that non-English-speaking users could want to be able to use letters from their language…

d. The property key is MANDATORY. It MUST be either an IRI or a CURIE. If it is a CURIE, it MUST be unambiguously resolvable, meaning the prefix name is either one of the built-in prefixes or is declared in the set’s curie_map.

e. The type_hint key is OPTIONAL. If present, it MUST be either an IRI or a CURIE, and is intended to represent the type of values the non-standard slot is expected to contain. If it is a CURIE, it MUST be unambiguously resolvable.

f. When the type_hint is absent, a default type of http://www.w3.org/2001/XMLSchema#string is assumed.

g. For convenience, the following two prefix names are added to the list of the built-in prefixes that do not need to be declared in the curie_map:

  • xsd: http://www.w3.org/2001/XMLSchema#,
  • linkml: https://w3id.org/linkml/.

h. A set MUST NOT define an extension slot with a slot_name that matches an existing standard slot.

i. To avoid any conflict with a future version of the SSSOM standard (which could introduce new standard slot names), users are strongly encouraged to craft non-standard slot names that starts with the ext_ prefix. The SSSOM authors guarantee that no new standard slot with a name starting with ext_ will ever be introduced in any future version of the standard.

This is an advice for users only. Implementations MUST NOT reject a non-standard slot solely on the basis that its name does not start with ext_!

Example of an extension_definitions key (assuming FOAF and EX are declared prefixes in the curie_map):

extension_definitions:
  - slot_name: funded_by
    property: "FOAF:fundedBy"
    type_hint: "linkml:uriOrCurie"
  - slot_name: foo
    property: "EX:fooProperty"
  - slot_name: bar
    property: "EX:barProperty"
    type_hint: "xsd:integer"

2.2. Support for defined extensions

a. Implementations SHOULD support defined extensions. For implementations that do support them, whether such a support is enabled by default or must be explicitly enabled by an user is left at their discretion.

b. Implementations that do not support defined extensions MUST ignore the extension_definitions key. As in (1b) above, they MAY print a warning (or give any other kind of signal to the user) that they are ignoring that key.

For implementations that do support defined extensions:

c. They MUST check that the extension_definitions key is compliant with the rules of section (2.1). Any non-compliant entry (for example, an entry that is missing either the slot_name or the property key) MUST be ignored. A warning SHOULD be emitted in that case.

d. Upon encountering a non-standard metadata slot (be it in the top-level metadata block or in the TSV header), implementations MUST check whether the name of the slot matches the slot_name of one of the defined extensions. If it does, then the slot MUST be preserved through any further processing.

3. OPTIONAL support for undefined extensions

a. An undefined extension is a non-standard metadata slot that is not defined in a extension_definitions key as decribed in section (2.1).

b. Implementations MAY support undefined extensions. It is left at their discretion whether such a support is enabled by default or not.

c. Upon encoutering a non-standard metadata slot that is not a defined extension, an implementation that supports undefined extension MUST behave as if the slot had been declared with a property whose name is the slot name prefixed by the default prefix https://w3id.org/sssom/undefined_extensions/ and a type hint of xsd:string.

4. Restrictions on the values of extension slots.

4.1. General restrictions

The following restrictions apply to all extension slots, regardless of whether they are defined or undefined.

a. Each mapping set and each mapping can have at most one value for each extension slot. This means that the same extension slot cannot be present more than once in the top-level metadata block, and more than once in the TSV header.

How implementations behave upon encountering a repeated extension slot is left at their discretion. They MAY either

  • ignore the repeated slot(s) entirely;
  • accept only one of the occurrences and ignore the others.

b. All extension values MUST be representable as literal strings.

4.2. Optional further restrictions for typed defined extension

If a defined extension slot has a type_hint other than xsd:string, implementations may enforce further constraints on extension values based on the type hint.

b. If the type hint is linkml:uriOrCurie: implementations SHOULD check that, if the value is a CURIE, it is unambiguously resolvable.

c. If the type hint is xsd:integer: implementations MAY check that the value is an integer.

d. If the type hint is xsd:double: implementations MAY check that the value is a floating number.

e. If the type hint is xsd:boolean: implementations MAY check that the value is either true or false.

f. If the type hint is xsd:date: implementations MAY check that the value is a date in the ISO 8601 format (yyyy-mm-dd).

g. If the type hint is xsd:time: implementations MAY check that the value is a time in the ISO 8601 format (hh:mm:ssTZ, where TZ is either Z to indicate UTC time or a time offset of the form [+-]hh:mm).

h. If the type hint is xsd:datetime, implementations MAY check that the value is a date and time value in the ISO 8601 format (yyyy-mm-ddThh:mm:ssTZ).

i. Implementations MAY decide to recognise more types and to enforce type-specific constraints. For example, an implementation could recognise the type xsd:negativeInteger and check that the value starts with a minus sign.

Rationale for this section: It is expected that most implementations will NOT, actually, bother to enforce typing, and will simply coerce all extension values into strings. But it is still useful to state somewhere that some commonly used types should use a standard format. This will reduce the risk that, for example, some set will use american-style dates (MM-DD-YYYY) for its date-typed slots while some others will use european-style dates (DD-MM-YYYY) while some others yet will use ISO-style dates (YYYY-MM-DD).

5. Recommendations for SSSOM/TSV serialisers

a. All set-level non-standard metadata slot (whether defined or undefined, if the implementation supports undefined extensions) SHOULD be written after all the standard slots in the YAML metadata block. They SHOULD be sorted lexicographically on the property name.

b. All mapping-level non-standard metadata slots (i.e., supplementary columns) should be written after all the standard columns in the TSV section. They SHOULD be sorted lexicographically on the property name.

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 7, 2024

Small amendment to the “final” proposition:

Change (3c) from :

Upon encoutering a non-standard metadata slot that is not a defined extension, an implementation that supports undefined extension MUST behave as if the slot had been declared with a property whose name is the slot name prefixed by the default prefix https://w3id.org/sssom/undefined_extensions/ and a type hint of xsd:string.

to

Upon encoutering a non-standard metadata slot that is not a defined extension, an implementation that supports undefined extension MUST behave as if the slot had been declared with a property whose name is the slot name prefixed by the default prefix http://sssom.invalid/ and a type hint of xsd:string

I.e., the default namespace for undefined extensions is http://sssom.invalid/ instead of https://w3id.org/sssom/undefined_extensions/.

Rationale: It doesn’t seem right that extensions that are completely undefined end up in the official namespace for SSSOM – it makes them look like they are valid properties supported by the standard, which is exactly what they are not. Using the .invalid TLD makes it more obvious that those IRIs are constructed IRI that are not supposed to point to anything.

@matentzn
Copy link
Collaborator

matentzn commented Jan 8, 2024

AWESOME. I have some tiny qualms with some details (property key mandatory), and I am missing an important semantic restriction on the custom slots (custom slots may never break the principle of idempotence, which means that no custom column may change the meaning of any column already present, like "predicate_modifier", "subject_pattern", or "author_role" etc) but I think this thread has lost all linearity now and we should move to a PR, where we can work on the details of the phrasing.

Fantastic work!

@gouttegd
Copy link
Contributor Author

gouttegd commented Jan 8, 2024

I have some tiny qualms with some details (property key mandatory)

So you would want to be possible to do:

#extension_definitions:
#  - slot_name: foo

No strong objection to that, but I think this should be discouraged even if supported.

“The property key is OPTIONAL but strongly RECOMMENDED. It allows to associate an unambiguous property to the slot, thereby providing a strong indicator of the intended meaning of the slot. If the property key is absent, a default IRI is constructed in the same way as for undefined extensions: by prefixing the slot name with the default namespace http://sssom.invalid/.”

no custom column may change the meaning of any column already present

100% agree, but do note that this is not something an implementation can enforce. A parser would have no way of knowing that a custom column “changes the meaning” of a standard column.

Such a restriction can only be targeted as users, not implementors.

@matentzn
Copy link
Collaborator

matentzn commented Jan 8, 2024

Sounds good!

@matentzn matentzn added this to the 1.0.0 milestone Jul 22, 2024
gouttegd added a commit that referenced this issue Aug 6, 2024
This commit updates the SSSOM model to allow for _defined extensions_ as
proposed in #328.

It also updates the description of the data model to describe the use of
both defined and undefined extension, and the specification of the
SSSOM/TSV format to explain how SSSOM/TSV parsers and writers should
deal with such extensions.

Overall, this is exactly what was proposed in [this
comment](#328 (comment))
in #328, except that here we need to split the specification in two
parts (one about extensions in general, independently of the
serialisation format, and one about the SSSOM/TSV serialisation of
extensions), while the initial proposition was in a single block.

Co-authored-by: Nico Matentzoglu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants