Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last hour fixes #91

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Last hour fixes #91

wants to merge 13 commits into from

Conversation

sveinugu
Copy link
Collaborator

@nsheff @andrewyatz @tcezard

I should of course have done this earlier, but I have now read the spec detailedly and found a number of issues, errors, and inconsistencies.

The two most important of which is that:

  1. the recommended bas schema includes the accession attribute. While I am not against this attribute per se, the issue on this is still open (New schema term: accessions #79) and there is nothing in the decision record about this. On the top of my head, I have at least two issues with this attribute (other than the formalities): 1. Should it allow for more than one accession per sequence? 2. Should the accessions somehow be namespaced, e.g. be in CURIE form?

  2. The length attribute is now digested, even though it is not inherent. Related to this is the presentation of the digestion algorithm, where I think the filtering step should be moved to later in order to allow for the creation of level 1 digests for inherent attributes.

Please look at the different commits for other issues.

Hope there is still time to fix these issues!

@sveinugu sveinugu requested a review from nsheff February 19, 2025 00:15
@@ -725,7 +725,7 @@ This digest allows the comparison to be pre-computed, and more easily compared.
- inherent: false
- collated: false
- passthru: false
- transient: false
- transient: true or false (depending on the use case/implementation)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content is a duplicate of the sequences attribute, except for the sorting. IMO, the considerations to make this transient or not might include 1) storage space on server 2) performance/overhead on server of just-in-time sorting on requests, 3) whether client-side sorting is problematic for the use-case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsheff Did you review this? If so, please resolve

@nsheff
Copy link
Member

nsheff commented Feb 19, 2025

Decisions from today:

  • filtering should be moved to between steps 3 and 4
  • accessions should be removed for now (need to disambiguate sequence accessions from seqcol accessions)

Copy link
Member

@nsheff nsheff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few clarifications suggested on your edits, around the same_order section of the comparison results.


Because the sequence array would contain the same sequences, the comparison function will count them all as matching.
Similarly, the names arrays contain the same names and so all will be counted as a match.
However, the same_order will *not* be true, so it will yield false.
However, the same_order will *not* be true; it will yield false for all attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again incorrect. Here's an example:

https://refget.databio.org/comparison/fLf5M0BOIPIqcfbE6R8oYwxsy-PnoV32/E6zGtGuc8wKYmCMw5gaLW3ppyXsoO6p4

it does not yield false for all attributes. The lengths and sequences are in the same order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it will yield false only for the names attribute.

Copy link
Collaborator Author

@sveinugu sveinugu Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

EDIT: I agree, based on your original example. My idea was to change the example slightly (which also unfortunately makes it more complex) so that the line "This is the same output as a comparison of two sequence collections in different orders" becomes true (see comment below)

@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com

## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays

One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names?
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correctly is not correct. This example was not about different orders, just about a name swap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com

## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays

One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names?
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but where two of the sequences have swapped names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Because the sequence array would contain the same sequences, the comparison function will count them all as matching.
Similarly, the names arrays contain the same names and so all will be counted as a match.
However, the same_order will *not* be true, so it will yield false.
However, the same_order will *not* be true; it will yield false for all attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, the same_order will *not* be true; it will yield false for all attributes.
However, the same_order will *not* necessarily be true; it will yield false for at least one attribute (the `names`).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another consequence of changing the example slightly

@nsheff
Copy link
Member

nsheff commented Feb 25, 2025

@sveinugu I made a few suggestions on top of yours. Can you accept/confirm my corrections to your edits now? Then we can merge your changes.


Because the sequence array would contain the same sequences, the comparison function will count them all as matching.
Similarly, the names arrays contain the same names and so all will be counted as a match.
However, the same_order will *not* be true, so it will yield false.
However, the same_order will *not* be true; it will yield false for all attributes.

This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same output as a comparison of two sequence collections in different orders, without the name swap.

My main problem is with this line. In the example where everything is the same except for a name swap, then the output will NOT be the same as a comparison of two sequence collections in different order. As you mention above, in the main example, only the same_order value for the names attribute will be false. In the comparison of two sequence collections in different orders, all same_order values will be false!

Copy link
Collaborator Author

@sveinugu sveinugu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for spamming you with single comments, I should have started the batch review straight away.

So I think we are on the same page except for the last example in the "Compare two sequence collections" (under "limitations"). I suggest to change the example to make the sentence "This is the same output as a comparison of two sequence collections in different orders" correct, which I do not think it is now, and all the small changes are just a consequence of that major change. However, there might be a simpler way to get the message across, as my suggestion makes the whole text slightly more complidated. Either way, perhaps adding an explicit JSON example will simplify things?

@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com

## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays

One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names?
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com

## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays

One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names?
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sveinugu and others added 2 commits February 25, 2025 15:25
Changes from @nsheff I agree with

Co-authored-by: Nathan Sheffield <[email protected]>
@@ -66,5 +66,5 @@ This is the same output as a comparison of two sequence collections in different

In this particular example, these results can be distinguished by the `sorted_name_length_pairs` attribute, because this would yield a perfect match for the second example, where all the pairs are intact but in a different order -- but it would NOT yield a match for the example with swapped names, because the name-length pairs would be different.

This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent array, similar to `sorted_name_length_pairs`, but including *all* attributes for each element rather than just the names and lengths. The comparison function would then immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections.
This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent collated attribute, similar to `sorted_name_length_pairs`, but including *all* collated attributes for each element rather than just the names and lengths. The comparison function would then immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsheff The Github review process confused me. Here, I approved my own new suggestion, please review this independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants