-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Last hour fixes #91
base: master
Are you sure you want to change the base?
Last hour fixes #91
Conversation
…calculating the top-level digest
…hat create inconsistencies or are plainly incorrect now
…ell as consistent with RefSeq Sequences
… to me this should depend on the use case/implementation
@@ -725,7 +725,7 @@ This digest allows the comparison to be pre-computed, and more easily compared. | |||
- inherent: false | |||
- collated: false | |||
- passthru: false | |||
- transient: false | |||
- transient: true or false (depending on the use case/implementation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content is a duplicate of the sequences
attribute, except for the sorting. IMO, the considerations to make this transient
or not might include 1) storage space on server 2) performance/overhead on server of just-in-time sorting on requests, 3) whether client-side sorting is problematic for the use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsheff Did you review this? If so, please resolve
Decisions from today:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few clarifications suggested on your edits, around the same_order
section of the comparison results.
|
||
Because the sequence array would contain the same sequences, the comparison function will count them all as matching. | ||
Similarly, the names arrays contain the same names and so all will be counted as a match. | ||
However, the same_order will *not* be true, so it will yield false. | ||
However, the same_order will *not* be true; it will yield false for all attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again incorrect. Here's an example:
it does not yield false for all attributes. The lengths and sequences are in the same order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it will yield false only for the names
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
EDIT: I agree, based on your original example. My idea was to change the example slightly (which also unfortunately makes it more complex) so that the line "This is the same output as a comparison of two sequence collections in different orders" becomes true (see comment below)
@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com | |||
|
|||
## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | |||
|
|||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | |||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This correctly is not correct. This example was not about different orders, just about a name swap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to change the example slightly, see: https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123
@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com | |||
|
|||
## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | |||
|
|||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | |||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? | |
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but where two of the sequences have swapped names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another consequence of changing the example (see https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123)
|
||
Because the sequence array would contain the same sequences, the comparison function will count them all as matching. | ||
Similarly, the names arrays contain the same names and so all will be counted as a match. | ||
However, the same_order will *not* be true, so it will yield false. | ||
However, the same_order will *not* be true; it will yield false for all attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the same_order will *not* be true; it will yield false for all attributes. | |
However, the same_order will *not* necessarily be true; it will yield false for at least one attribute (the `names`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another consequence of changing the example slightly
@sveinugu I made a few suggestions on top of yours. Can you accept/confirm my corrections to your edits now? Then we can merge your changes. |
|
||
Because the sequence array would contain the same sequences, the comparison function will count them all as matching. | ||
Similarly, the names arrays contain the same names and so all will be counted as a match. | ||
However, the same_order will *not* be true, so it will yield false. | ||
However, the same_order will *not* be true; it will yield false for all attributes. | ||
|
||
This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same output as a comparison of two sequence collections in different orders, without the name swap.
My main problem is with this line. In the example where everything is the same except for a name swap, then the output will NOT be the same as a comparison of two sequence collections in different order. As you mention above, in the main example, only the same_order
value for the names
attribute will be false
. In the comparison of two sequence collections in different orders, all same_order
values will be false
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for spamming you with single comments, I should have started the batch review straight away.
So I think we are on the same page except for the last example in the "Compare two sequence collections" (under "limitations"). I suggest to change the example to make the sentence "This is the same output as a comparison of two sequence collections in different orders" correct, which I do not think it is now, and all the small changes are just a consequence of that major change. However, there might be a simpler way to get the message across, as my suggestion makes the whole text slightly more complidated. Either way, perhaps adding an explicit JSON example will simplify things?
@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com | |||
|
|||
## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | |||
|
|||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | |||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to change the example slightly, see: https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123
@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com | |||
|
|||
## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays | |||
|
|||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names? | |||
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another consequence of changing the example (see https://github.com/ga4gh/refget/pull/91/files/fc9624d21aef1486488fe89e5cfe1f3885848f29#r1969833123)
Changes from @nsheff I agree with Co-authored-by: Nathan Sheffield <[email protected]>
@@ -66,5 +66,5 @@ This is the same output as a comparison of two sequence collections in different | |||
|
|||
In this particular example, these results can be distinguished by the `sorted_name_length_pairs` attribute, because this would yield a perfect match for the second example, where all the pairs are intact but in a different order -- but it would NOT yield a match for the example with swapped names, because the name-length pairs would be different. | |||
|
|||
This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent array, similar to `sorted_name_length_pairs`, but including *all* attributes for each element rather than just the names and lengths. The comparison function would then immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections. | |||
This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent collated attribute, similar to `sorted_name_length_pairs`, but including *all* collated attributes for each element rather than just the names and lengths. The comparison function would then immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsheff The Github review process confused me. Here, I approved my own new suggestion, please review this independently.
@nsheff @andrewyatz @tcezard
I should of course have done this earlier, but I have now read the spec detailedly and found a number of issues, errors, and inconsistencies.
The two most important of which is that:
the recommended bas schema includes the
accession
attribute. While I am not against this attribute per se, the issue on this is still open (New schema term: accessions #79) and there is nothing in the decision record about this. On the top of my head, I have at least two issues with this attribute (other than the formalities): 1. Should it allow for more than one accession per sequence? 2. Should the accessions somehow be namespaced, e.g. be in CURIE form?The
length
attribute is now digested, even though it is notinherent
. Related to this is the presentation of the digestion algorithm, where I think the filtering step should be moved to later in order to allow for the creation of level 1 digests for inherent attributes.Please look at the different commits for other issues.
Hope there is still time to fix these issues!