-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with non-standard slots #328
Comments
Of note: if we decide to keep the current way of encoding additional metadata in the |
Important discussion thanks for rekindling. I like your solution! What about (we discussed this before):
|
Yes, we’ve discussed it in #263. And so you know what I propose to avoid those issues: requesting that non-standard fields, when someone needs to use them, must be prefixed by That’s the same idea as So the entire idea would be:
If someone has used a non-prefixed name (e.g.
I would personally favour rejecting the set outright, since there is a clear conflict between the spec and the extension declaration. |
I think making this "ext" thing a requirement will make your solution less valuable for some people. We could say: if it is declared as an extension slot it is, by definition, not a standard slot. Even if it sounds the same. It still irks me that this precludes any way to serialise it neatly into RDF. That was nicer in the more complex solution.
I don't think this should happen. I prefer the other two options. But:
Why do you say "overwrite"? If |
To clarify, it’s not a “requirement” – as the spec’s authors, we have no way to require producers of SSSOM to do things one way or another. If people want to use non-standard slots without prefixing with The idea is that if they do not use the So people who don’t like the idea of prefixing their custom slots with For example, if biomappings published mappings with a hypothetical
I think this would introduce needless confusion. That would mean that nobody could be sure of what a slot means, even if if is described in the standard, without first checking the list of declared extension slots.
As I said: needless confusion. We have a standard that assigns an agreed-upon meaning to a handful of slot names. Allowing people to override that and to say, “in my set, the slot foo means whatever I intend it to mean, regardless of what the specification says”, negates the very principle of a standard exchange format. Again: under my proposition, people would be free to use whatever slot names they want. As long as the name (1) is declared and (2) does not clash with a standard name, implementations will be required to preserve such extra slots. The recommendation to prefix the non-standard name with |
Another clarification: The above does not mean that we (again, the authors of the spec) have to be unhelpful with people who would end up in the situation of having one of their non-standard slot names on the verge of becoming a standard one, if we know about it. For example, let’s say that at some point (after SSSOM 1.0 is out), we are considering adding a unique identifier to each mapping (I know that idea has already been evoked). We create a ticket to discuss about the idea, and after some discussion we decide that indeed it would be a good thing, and we propose Then someone intervenes in the discussion to point out that they have already been using And in that situation, I would be completely open to the idea of renaming our future slot (which would not exist yet, so it wouldn’t be a big deal). I would certainly mention to those folks that it would have been better if they had followed our advice and have named their non-standard slot |
Ok, sold on your treatment of ext_.
And
While I can see your point clearly here, I am not sure it is as "clear cut" as that. :contributor and :contributor mean something entirely different depending on which base-namespace is defined. It's a normal mechanism in xml and various RDF serialisations. I totally get you want to protect users from such base namespace declarations, and probably you are right, just not as doubtlessly. We could even inject ext_ in the serialisation if a clash occurs in a future version 😂 and would be killed. Ok, let's go for now with your proposal of dealing with slot clashes a bit less "confusingly". |
With a similar logic you could simply dispense of declarations altogether and say: if a slot is not-standard (not in the spec), it is non-standard extension. I don't think we should do that; but the argument could come up. |
Yeah, except that the TSV format has no notion of namespace. Column names are in a flat namespace. I see your point and I agree that having “namespaced” column names, similar to XML names, would be nice. But nobody would expect that from a TSV file, and none of the standard tools available to manipulate TSV files (such as Pandas) has support for that. I don’t think it is worth the trouble. Let’s keep the Simple Standard for Sharing Ontological Mappings, well… simple.
Yes. But then it’s all or nothing: either we require that implementations should preserve all non-standard slots, or we don’t require that and implementations are free to discard all non-standard slots. Right now, the behaviour of both Having declared non-standard slots allows existing implementations to keep their default behaviour (of discarding undeclared non-standard slots) while also allowing to require that declared non-standard slots, at least, should be preserved. |
Just to be clear: what I am saying is that the global "extra_slots" establishes a different namespace. I do understand that you find the risk of "confusion" too high as evidenced by the fact that I could not even get that point across.. 😂 Alright, let's go ahead to a concrete proposal then, I think we are on the same page. |
Having given a bit more thoughts, I’d like to amend my proposal with the following: Regarding non-standard slots at the level of the set I don’t think there is actually a need to do this (as proposed initially):
That is, first declaring Instead, we could simply change the definition of the
i.e. a list a key/value pair. We could make it into a proper dictionary instead, and use it like this:
This would both make parsing and using those extra metadata easier (no more need to “manually” extract the key and the value downstream of the YAML parser), and remove the need for declaring non-standard slots (by definition, any “sub-slot” under the This is a breaking change compared to the current definition of the Regarding non-standard slots at the level of individual mappings No change to my initial proposal regarding the format: We would still require that non-standard slots be declared (in order for them to be preserved; it would always be OK not to declare non-standard slots, but then implementations would discard them) in the top-level metadata, and then they can appear as normal columns in the TSV section.
But we should then clarify what to do with those This would imply that under this proposal, the |
Just to clarify: you are saying that:
would NOT become after denormalisation:
But instead remain the same as it was (unchanged). |
Yes. Assuming it is possible to have dictionary-typed slots in LinkML, but after looking at the documentation (after writing the previous comment unfortunately), I have seen nothing to suggest that it is the case – LinkML has support for lists (through the If LinkML really does not have support for dictionary-typed values, it can only be by design (I cannot imagine that dictionaries would simply have been overlooked), which means I am not going to request that such a support be added as a new feature. Which in turns means that this entire idea can be scrapped.
Not in the TSV representation.
Yes. To try giving a complete example in one go: if we consider the following TSV+YAML file:
would lead to an in-memory representation as follows:
That is, the set-level |
This commit implements the reading part of the behaviour proposed in mapping-commons/sssom#328 to deal with non-standard slots. It supposes that the SSSOM data model has been extended to add two new mapping-level metadata slots: * extra_set_slots, to declare set-level non-standard slots; * extra_slots, to declare mapping-level non-standard slots. If the extra_set_slots slot is present, then when a non-standard slot is encountered at the set level, it is: * discarded (as before) if the non-standard slot is not declared in the extra_set_slots list; * added as a new key/value pair to the special __extra dictionary (specifically intended to hold non-standard slots) if it is declared in the extra_set_slots list. Likewise, if the extra_slots slot is present, then if the TSV section contains a non-standard column, the values from that column are: * discarded (as before) if the name of the column is not declared in the extra_slots list; * added as a new key/value pair to the special __extra dictionary associated with each mapping, if the name of the column is declared in the extra_slots list.
This commit implements the writing part of the proposed behaviour in mapping-commons/sssom#328 for dealing with non-standard slots. If the __extra special dictionary at the mapping set level contains entries whose keys are declared in extra_set_slots, those entries are written with the set metadata as if they were normal slot. They are written in key alphabetical order, after all the other slots. If the __extra special dictionary at the mapping level contains entries whose keys are declared in the extra_slots, those entries are written with the mappings metadata in the TSV section as supplementary columns. The columns are written in key alphabetical order, after all the other columns.
OK, I’ve played a bit with various ideas and in particular I have fully implemented my initial proposal in SSSOM-Java (in a test branch – nothing definitive now, this is just for testing and exploring)… and I would like to discuss a bit more before we commit to anything. In particular, in my initial proposal, the idea was to do as follows for non-standard slots at the set level:
That is, we distinguish non-standard slots that the user wants to be preserved by requiring that such slots are declared as “slots-to-be-preserved”. Upon encountering a declared non-standard slot, the parser MUST preserve it, while non-declared non-standard slots are automatically discarded. But now that I have implemented this behaviour, the more I look at it the more ridiculous it seems to me: This two-steps process (first declaration, then use) is cumbersome, why couldn’t we simply do something like:
That is, instead of requiring that non-standard slots be declared before they are used, we require that they are confined to a dedicated part of the metadata block (tentatively named (This would not apply to non-standard slots at the mapping level. We would still require such slots to be declared in order for them to be preserved rather than discarded. We can’t really use the |
Another point is that I now think it would not necessarily be unreasonable to recommend that implementations preserve all non-standard slots, without requiring any declaration at all (possibly as an optional behaviour). I am not saying that this is what we should do (I’d be in fact slightly reluctant to do it as this would be the opposite of the existing behaviour of both SSSOM-Py and SSSOM-Java), but it’s something to be considered. |
Before going any further, could we get a decision on the following 4 propositions? A) The use of non-standard slots is unsupported. Only the slots listed in the spec are guaranteed to always be preserved. Implementations are free to do whatever they want (including discarding without even a warning) with any non-standard slot they encounter. If you want to store/exchange additional metadata for which no standard slot exists, and if you want to be sure said metadata is always preserved no matter what, your only solution is to encode the metadata as That is basically the current sitation. B) Implementations MUST preserve any non-standard slot. The YAML metadata block can always contain non-standard keys and the TSV section can always contain non-standard columns. Implementations MUST always preserve such slots without requiring the user to do anything special. That is basically the opposite of the current situation. Note that in this case, the C) Implementations MAY/SHOULD preserve any non-standard slot. Similar to B, with the big difference that implementations are not required to support non-standard slots. For implementations that do support non-standard slots, it is up to them to decide whether this should be the default behaviour or not. Slightly easier for implementers (who can decide to avoid implementing that feature if they want). More annoying for users as it means they can’t be sure their additional metadata will be preserved or not, since that would depend whether the implementation supports it or not. (Whether this is really a problem is debatable: possibly, non-standard metadata will only be meaningful within the confines of a given project anyway, so it may not be a big deal if they are lost once the mapping set is re-used outside of that project.) D) Implementations MUST preserve any non-standard slot that is clearly identified as being “to-be-preserved” There is a mechanism to ”declare” non-standard slots, and users who want to be sure their additional metadata are preserved (but who don’t want to encode said metadata in the That is basically my initial proposal. I deliberately skip the details of how non-standard slots must be declared for now. |
For what it’s worth, as the SSSOM-Java developer I tend slightly toward D, but I would have no objection to implementing B or C (A doesn’t require any implementation as it’s the existing behaviour – obviously I would have no objection to doing nothing :D ). |
Awesome analysis, I know you know that :) It is difficult for me to support one solution over another. Before I give you my opinion, I hope you don't mind me returning to an earlier point first, and forgive me it is about the how rather than the what, which is I agree, a separate thing we should decide after we make an informed decision about the what. This pertains to your 7 November comment above.
Here I gave two examples with dc:modified or foaf:funded_by. This could be, or I daresay is, important because many people that would consider implementing a standard such as SSSOM are deeply familiar with metadata modelling, and would be happy if their RDF serialisation would end up mapped to standard properties (such as the popular dublin core). Note that the FAIR Impact, Elixir and EOSC life communities, for example, are pretty keen on the RDF representation in particular. (EDIT I just deleted a lengthy paragraph I wrote about in-TSV schema extensions thinking it a bit over the top for that early in the year). Regarding the four proposals: I am slightly (not strongly) biased against A and B.
With C and D, I have the following thoughts for now:
|
I agree. I mentioned A merely for completeness – and also as a reminder that it’s the only thing we’ll have unless and until we come up with a better solution.
Also for completeness: we could define a mechanism by which non-standard slots and columns are encoded into the That is, if we have a set as follows:
implementations could automatically transform it into:
And of course there would be an operation to do the opposite, i.e. to “decode” the contents of the This would allow users of a set containing non-standard metadata to be sure that their non-standard metadata “survive” through any SSSOM-compliant tool when needed, without requiring any change to the existing model. That’d be kind of a hack piled on top of another hack, though. Still, it’s a possibility.
OK, let’s rule this one out.
How about: Upon encountering a non-standard slot (i.e. an unknown key in the set-level metadata block, or an unknown column in the TSV section):
That is, the following set:
would have, after parsing, two non-standard set-level metadata:
and each mapping would have two non-standard mapping-level metadata:
Would that solve your RDF serialisation issue? |
The last proposal above (treating the non-standard slot name as a Curie) is actually quite orthogonal to the question of requesting non-standard slots to be declared. The example I gave was for undeclared slots, but the same logic would work for declared non-standard slots:
|
Or, instead of making a distinction between declared and undeclared slots, we could make the distinction between “slots whose name is a Curie” and “slots whose name is not a Curie” – only the former would be REQUIRED to be preserved, implementations would be free to discard the latter. That is, with that set again:
Only the |
OK, here is my current proposition in 5 points. 1. Support for non-standard metadata is OPTIONAL.Implementations are only REQUIRED to support the standard, spec-defined slots. Support for any kind of non-standard slots is always OPTIONAL. This is based on what you said above: “a specification conformant implementor (for example a browser, a registry API) should be able to decide that it only supports specification internal slots”. 2. OPTIONAL, but RECOMMENDED support for DECLARED EXTENSIONSIt is possible to declare non-standard metadata slots using the mechanism similar to the one proposed in #263. That is, the set gets a new metadata slot called
Where There would be a single So with the above declarations, a set like this:
would end up with the following extended metadata pairs:
I’d like to impose two additional constraints: A. Non-standard metadata slots can only contain string values. No nesting of complex data structures (list or dictionaries). That is, it would be forbidden to do things like this, even if
B. The Implementations would be strongly encouraged (“SHOULD”) to support such declared non-standard extensions, though (as stated in 1. above) it would still be optional. 3. FULLY OPTIONAL support for UNDECLARED EXTENSIONSAdditionally, it would also be possible for an implementation to support undeclared extensions. When the parser encounters an non-standard That is, a non-standard slot like this:
would be treated as if the metadata block contained the following declaration:
Support for such undeclared non-standard slots would be purely optional (MAY, not SHOULD). The rationale for supporting undeclared extensions is notably to allow sets created before the introduction of the 4. Overriding standard slots is strictly forbiddenIt is forbidden to declare an extension slot with the same name as an existing standard slot. That is, something like this
MUST be rejected as invalid. Rationale: Since support for extensions is optional, not all implementations will use the 5. Serialisation detailsIt is RECOMMENDED that a SSSOM/TSV writer adheres to the following rules when writing a set containing non-standard metadata:
This is intended to ensure that a mapping set can always be written in a predictable way regardless of the presence of non-standard slots (same reason why we already recommend that standard slots are written in the order they are listed in the spec). This does not concern the parser. A parser that supports non-standard extensions MUST support them independently of the order in which they appear in the file. |
Alternative proposition (based on the idea of allowing CURIEfied slot names, as discussed here): 1. Support for non-standard metadata is OPTIONALImplementations are only REQUIRED to support the standard, spec-defined slots. Support for any kind of non-standard slots is always OPTIONAL. Rationale: There is an explicit desiderata that “a specification conformant implementor (for example a browser, a registry API) should be able to decide that it only supports specification internal slots”. 2. OPTIONAL, but RECOMMENDED support for NAMESPACED EXTENSIONSIt is possible to use non-standard slots provided that:
Example:
The set contains the non-standard key/value pair { Implementations are strongly encouraged (SHOULD) to support such namespaced extensions. 3. FULLY OPTIONAL support for NON-NAMESPACED EXTENSIONSAdditionally, an implementation that supports NAMESPACED EXTENSIONS can also decide to support NON-NAMESPACED EXTENSIONS. When a slot is both a) not described by the standard and b) not in a CURIE form (no prefix), an implementation MAY decide to keep the metadata. In that case, it should construct an IRI for the metadata by catenating the default namespace Example:
The set contains the non-standard key/value pair { 4. Non-standard slots can only contain string valuesWhether namespaced or not, a non-standard slot can only contain simple string values. No nesting of complex data structures such as this:
5. Serialisation considerationsTo ensure that a set can be written in a predictable way, it is RECOMMENDED that a SSSOM/TSV writer adheres to the following rules when writing a set containing non-standard slots (whether namespaced or not):
CommentsCompared to the other proposition, this one has two advantages:
The disadvantage is the need to use CURIEfied slot names. This should not be a problem for mapping-level slots (nothing wrong in having a colon in the middle of a column name, all TSV parsers should be able to handle that), but for set-level slots, this requires that the names are escaped, since the colon is a special character in YAML. That is,
because the following would be invalid YAML:
Whether this a big deal or not, I don’t know. |
Wow man. Ok, lets get started: Feedback on "Declared extensions" proposal
We are agreed.
Wow, great idea! That would solve everything. Some minor corrections:
Just to be a tiny bit more flexible I would propose the word I agree with everything else, including the constraints on
Agreed, but I would prefer we say "literal" rather than string in the specification, to leave the door open for xsd datatype map later.
It is important we use SHOULD rather than MUST here, but else all agreed. Other considerations
Hm, smart. A bit over the top IMO to provide a specification here, but good thinking. I was thinking to drop the level of I personally, right this moment, favour the "declared extensions" proposal for the following reasons:
Which is a common use case in the FAIR data world. |
All right, let’s ditch the “namespaced extensions” proposition and polish the “declared extensions” one then.
That’s what I intend to do when I’ll rewrite the spec. For this discussion, I’d rather not assume that everyone here is familiar with RFC 2119 terminology, given that I am the one who had to suggest that it should be used more often.
Agreed. We have already discussed that before, my position is still the same: users are strongly encouraged, when crafting non-standard slot names, to prefix them with
Not sure what you mean here. Note that currently, About literal typing
If you want to allow for explicit typing of non-standard metadata, we should do that from the start and not “in the future“. Putting out a half-baked mechanism, waiting for developers to implement it, and then coming back to them later saying “oh wait, actually non-standard slots should be typed, sorry I don’t give a damn about the time you spent implementing that and I don’t care if you now have to start all over again because the prerequisites have changed!” would be a moronic move. Typing, then. How about:
Random thoughts:
Because of (4), and since according to you the common use case is to be able to mark non-standard slots as being IRIs, I would very much favour a less ambitious system where:
Something like that:
If people need to store other data types (e.g. datetimes, floats, etc.), they can store them as strings and let their use-case-specific applications parse them into the appropriate data type. |
If you really, really want explicit typing, at the very least we should only allow a restricted list of possible types, such as |
Just one example of the kind of mess explicit typing could cause. Let the following mapping set (
(Let’s assume We parse that set. The When the client code wants to do something with the But now let’s say we have to merge the above set with that other set:
Notice how the creator of that second set forgot to specify the type of the Now what happens when we merge the two sets together? First, we have to decide which of the two Now we have a merged set in which the Seems far-fetched? However the only thing needed for that to happen is for someone to forget the And this is not a problem only for statically typed languages like Java. Let’s imagine the same scenario with a duck-typed language like Python: you believe (because that’s what the |
I could be OK with explicit typing if we agree on two things: A. Whatever type is used has to be a “simple” type that can be represented as a string (e.g. dates, floats, booleans, IRIs – in CURIE form or not —, etc. are all good; lists, dictionaries and other complex structures are verboten). B. It is understood that the type indicated in the This would still allow a SSSOM/TSV-to-RDF serialiser to use the type hint to write something like this, which I assume is what you want to be able to do: <EXAMPLE:validation_date rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2024-01-03</EXAMPLE:validation_date> where the datatype would be provided by the type hint. |
After further thoughts, let’s scrap this idea:
That was a ridiculous idea: if we have to store one extra bit of information to record the fact that the slot is an IRI, we might as well store the full type information. |
So let’s try to put this typing thing into shape. 1. SyntaxWhat I already proposed yesterday, i.e.:
A type of If the (Alternatively, if it is expected that most uses of extension slots will be to store identifiers, we could decide that the default type should be 2. Range of allowed typesSeveral possibilities here. A. Only allow a small subset of types ONLY the following types are allowed:
Rationale: Having a fixed number of cases to deal with will make it easier for implementors and that list should still be enough to cover most use cases. B. No limitation on allowed types No restriction on what the In that case, the type is merely a hint (and I would actually suggest renaming to C. No limitation but basic types are recognised for what they are Basically a compromise between A and B. There is no restriction on what the
I strongly favour A here. Then again, I’m afraid that the mere possibility of declaring a type will inevitably lead to people putting whatever they want in the 3. CardinalityAny mapping set or any mapping can only contain one value for any given extension slot. If an extension slot is present more than once (i.e. two keys with the same name in the YAML metadata block or two columns with the same name in the TSV section), the behaviour is unspecified. Implementations are free to (a) reject the set outright, (b) ignore all duplicated slots, (c) accept only the first occurrence of the slot, (d) accept only the last occurrence of the slot, etc. 4. OptionalityTwo possibilities here. A. The typing feature is an integral part of the declared extensions feature. Support for declared extensions is only RECOMMENDED (as already agreed upon above), but an implementation that decides to support declared extensions MUST support typing. The two features cannot be separated. B. The typing feature is optional. An implementation that supports declared extensions MAY (or maybe SHOULD, but not MUST) support the typing feature. If it does not, then it should treat all extension values as opaque strings. Option B allows implementers to opt out from having to deal with types at all, while still allowing extension values to be preserved (except for their types). But option A is nicer for users, because they only need to worry about whether their sets will pass through an implementation that supports extensions or not, without having to worry about the supplementary issue of whether the implementation supports typing or not. |
Yeah, then, just adjust the "documentation" to say, "wild west".
😱 OMG I never even considered this.. MERGING. I always assumed that custom slots are namespace by the mapping set they appear in, implicitly, and never considered how complicated the
100% agreed
Great idea, also agreed. Feedback on final proposal:
Yes, string. Not anyURI
I would personally would prefer to just use
I am fine with A, with the caveat that I would like to support
ok by me.
I favour B as the less prescriptive of the two. My tendency is to say: "if you want to do this thing (costum extensions), this is how you can" and "if you are a tool developer and you want to support custom extensions, this is how they would look like. Do as you will". Pandoras box: merges and derivationsThe biggest box you opened here is the question of derived mapping sets, in particular through merge processes. The scenario I am worried about is this (similar to your scenario):
My instinct would say that the slot name should always be namespace during a merge, no matter what, but this is very inconvenient as you say; merges are most often performed within a project, and it would be crazy if this meant that the slots get artificially namespaced during the merge. So, my instinct is definitely wrong. The only sane thing to do is to to check the slot definitions for conflicts and refuse to proceed with the process. Which then puts quite a bit of burden on the |
Well, that was automatically the case in one of my previous propositions:
Under that system and if we choose to use the set’s ID rather than a global, spec-mandated default namespace, a non-standard slot The obvious problem with that solution, though, is that it is then no longer possible to assign existing properties (such as
Fine with me. So
To be clear, the idea in A is that we do not treat the indicated type as merely a hint. We restrict the list of allowed types precisely so that it reasonably doable to enforce the typing, without having to coerce everything into strings. In B, on the contrary we do coerce every extension values into strings, without enforcing anything. And in C, we enforce the typing if the indicated type is one of the agreed-upon “basic types”, or coerce into strings otherwise. From the point of view of a developer using a SSSOM parsing library (be it SSSOM-Java, SSSOM-Py, or any future compliant implementation), the difference would be in what the parser provides:
I am really on the fence as to which option is “best”. From the point of view of a developer who would write the actual implementation (i.e., me in the case of SSSOM-Java), option B is clearly the easiest to implement. That is because, in effect, it delegates most of the work to the client code. Not sure what would be best from the point of view of an user.
OK. About merging
That’s one possibility, yes (and, well, it’s not that much of a burden: just iterate through the extension definitions of every set to merge and raise an error if there’s a conflict). Another would be to store the full property name (not merely the slot name) with each extension value. So if you merge this set:
and this one:
You would end up with two extension values:
No problem so far then. The only issue would be: what do to when serialising the merged set? Which slot names to use? One possibility would be to keep using
Not very fancy, but it would at least preserve all important informations (the slot names are not really important: the properties they are associated with are). Another problem that can occur when merging, as already discussed above, is if there is a conflict between the indicated types, e.g. one set defines |
Actually it’s not so much that I don’t know which option is best. It’s more that each option is best for a different category of persons. Option A is best for developers using a SSSOM library: all the type checking is done by the library, they can just use whatever values are provided with their (guaranteed) correct types. Of course it means that it is harder for the developers of said SSSOM library. Option B is best for developers writing a SSSOM library: It’s the easiest to implement. But then it’s harder for developers using the library, since they have to do the type checking that the library does not do for them. Option C is (probably) best for users: they are free to use whatever types they want instead of being limited to a few basic types. Type checking for the most commonly used types will be done by the library, so it’s quite nice for downstream developers as well. But it puts most of the burden on the developers of SSSOM libraries. So who do you want to please the most? :D |
I am on the verge of recommending option C at this point, especially since we agreed above that the typing feature would be optional (an implementation can decide to support extensions without supporting typed extensions), meaning developers of SSSOM libraries who don’t want to implement typing can opt out of doing so. (In which case they would treat everything as strings, which is more or less the same thing as option B except that the type hints would not be preserved.) |
But I guess there’s an argument to be made that both options A and C are “overkill”, and that maybe we do not want to transform SSSOM into a generic engine for storing typed key/value pairs… Rah, I don’t know. |
In fact, my concern with option C is that I wonder if it would qualify as an instance of the second-system effect… |
All right, here is now my “final” proposition.
1. Support for non-standard metadata is OPTIONALa. Implementations are only REQUIRED to support the standard, spec-defined metadata slots. Support for any kind of non-standard slot is always OPTIONAL or RECOMMENDED, but never MANDATORY. Therefore, users must expect that any non-standard metadata slot will be discarded by an implementation that does not support them. b. It is left at the discretion of implementations to decide whether to print a warning (or give any other kind of signal to the user) when a non-standard metadata slot is discarded. 2. RECOMMENDED support for DEFINED EXTENSIONS2.1. How to define extension slotsa. It is possible to define non-standard metadata slots using an b. The format of the
c. The
d. The e. The f. When the g. For convenience, the following two prefix names are added to the list of the built-in prefixes that do not need to be declared in the
h. A set MUST NOT define an extension slot with a i. To avoid any conflict with a future version of the SSSOM standard (which could introduce new standard slot names), users are strongly encouraged to craft non-standard slot names that starts with the This is an advice for users only. Implementations MUST NOT reject a non-standard slot solely on the basis that its name does not start with Example of an extension_definitions:
- slot_name: funded_by
property: "FOAF:fundedBy"
type_hint: "linkml:uriOrCurie"
- slot_name: foo
property: "EX:fooProperty"
- slot_name: bar
property: "EX:barProperty"
type_hint: "xsd:integer" 2.2. Support for defined extensionsa. Implementations SHOULD support defined extensions. For implementations that do support them, whether such a support is enabled by default or must be explicitly enabled by an user is left at their discretion. b. Implementations that do not support defined extensions MUST ignore the For implementations that do support defined extensions: c. They MUST check that the d. Upon encountering a non-standard metadata slot (be it in the top-level metadata block or in the TSV header), implementations MUST check whether the name of the slot matches the 3. OPTIONAL support for undefined extensionsa. An undefined extension is a non-standard metadata slot that is not defined in a b. Implementations MAY support undefined extensions. It is left at their discretion whether such a support is enabled by default or not. c. Upon encoutering a non-standard metadata slot that is not a defined extension, an implementation that supports undefined extension MUST behave as if the slot had been declared with a property whose name is the slot name prefixed by the default prefix 4. Restrictions on the values of extension slots.4.1. General restrictionsThe following restrictions apply to all extension slots, regardless of whether they are defined or undefined. a. Each mapping set and each mapping can have at most one value for each extension slot. This means that the same extension slot cannot be present more than once in the top-level metadata block, and more than once in the TSV header. How implementations behave upon encountering a repeated extension slot is left at their discretion. They MAY either
b. All extension values MUST be representable as literal strings. 4.2. Optional further restrictions for typed defined extensionIf a defined extension slot has a b. If the type hint is c. If the type hint is d. If the type hint is e. If the type hint is f. If the type hint is g. If the type hint is h. If the type hint is i. Implementations MAY decide to recognise more types and to enforce type-specific constraints. For example, an implementation could recognise the type
5. Recommendations for SSSOM/TSV serialisersa. All set-level non-standard metadata slot (whether defined or undefined, if the implementation supports undefined extensions) SHOULD be written after all the standard slots in the YAML metadata block. They SHOULD be sorted lexicographically on the property name. b. All mapping-level non-standard metadata slots (i.e., supplementary columns) should be written after all the standard columns in the TSV section. They SHOULD be sorted lexicographically on the property name. |
Small amendment to the “final” proposition: Change (3c) from :
to
I.e., the default namespace for undefined extensions is Rationale: It doesn’t seem right that extensions that are completely undefined end up in the official namespace for SSSOM – it makes them look like they are valid properties supported by the standard, which is exactly what they are not. Using the |
AWESOME. I have some tiny qualms with some details ( Fantastic work! |
So you would want to be possible to do:
No strong objection to that, but I think this should be discouraged even if supported. “The
100% agree, but do note that this is not something an implementation can enforce. A parser would have no way of knowing that a custom column “changes the meaning” of a standard column. Such a restriction can only be targeted as users, not implementors. |
Sounds good! |
This commit updates the SSSOM model to allow for _defined extensions_ as proposed in #328. It also updates the description of the data model to describe the use of both defined and undefined extension, and the specification of the SSSOM/TSV format to explain how SSSOM/TSV parsers and writers should deal with such extensions. Overall, this is exactly what was proposed in [this comment](#328 (comment)) in #328, except that here we need to split the specification in two parts (one about extensions in general, independently of the serialisation format, and one about the SSSOM/TSV serialisation of extensions), while the initial proposition was in a single block. Co-authored-by: Nico Matentzoglu <[email protected]>
Among the issues that I believe should be decided along the road to 1.0 is: How should implementations deal with the possible presence of non-standard slots (slots not described anywhere in the specification and/or the schema) in a mapping set?
That is, considering this mapping set (note the presence of a non-standard foo slot in the set metadata and a non-standard bar and baz slots in the mappings metadata.):
what should implementations do?
The current behaviour of both
sssom-py
andsssom-java
is to ignore the non-standard slots (silently, in the case ofsssom-java
;sssom-py
emits a warning) and to accept the rest of the set. That is, running this set through (for example)will produce
I think this behaviour is sane and should be explicitly prescribed in the specification (“Implementations SHOULD ignore any non-standard slot not described in this specification; they MAY print a warning upon encountering any such slot.”).
A related issue is: If a SSSOM producer does want to include additional metadata for which there is no suitable standard slots, and wants those additional metadata to be preserved by the existing tooling (instead of being ignored and discarded as above), what should be the proper way to do that?
Currently, the official way is through the (poorly specified)
other
slot, which accepts a “list of key value pairs for properties not part of the SSSOM spec” and “can be used to encode additional provenance data”. So, according to the current spec, the example above should have been:This method of encoding additional metadata doesn’t look great, which is probably why there is a proposition to replace it, in which additional slots would be explicitly defined in a
extension_definition
slot:I think that whatever method we think is best to encode additional metadata, it should be decided in time for 1.0.
I’d like to propose a simpler method than the one drafted in #263. I like the idea of explicitly declaring the additional metadata but I don’t see the point of specifying the type of value. Non-standard metadata would be tied to a particular use-case and to particular users, so it should be up to them to know what to expect in any non-standard field that they need.
Instead, I propose that non-standard slots should simply be declared as a list of slot names:
The effect of such a declaration would be that the declared slots should be preserved exactly as they are, instead of being ignored. That is, the mapping set above should be “round-trippable” through
sssom-py parse
, because thefoo
,bar
, andbaz
slots, being declared, should not be discarded.Thoughts?
The text was updated successfully, but these errors were encountered: