Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last hour fixes #91

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/compare_collections.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Two collections meet the criteria for order-relaxed identity if:

1. the value of the `elements.total.a` and `elements.total.b` match, (the collections have the same number of elements).
2. this value is the same as `elements.a_and_b.<attribute>` for all attributes (the content is the same)
3. any entries in `elements.a_and_b-same-order.<attribute>` may be true (indicating the order matches) or false (indicating the order differs)
3. entries in `elements.a_and_b-same-order.<attribute>` may be either all true (indicating the order matches) or all false (indicating the order differs)

Then, we know the sequence collection content is identical, without controlling for order.

Expand All @@ -56,15 +56,15 @@ Now that you know the basics, and once you have an understanding of what the com

## Limitation of the comparison function: distinguishing out-of-order from mismatched arrays

One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that are exact duplicates except for two component sequences that have swapped names?
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This correctly is not correct. This example was not about different orders, just about a name swap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but in different orders, AND where in addition two of the sequences have swapped names?
One limitation of the comparison function is that it does comparisons at the level of arrays, not at the level of individual elements. What would the comparison function return for two sequence collections that have the same content, but where two of the sequences have swapped names?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Because the sequence array would contain the same sequences, the comparison function will count them all as matching.
Similarly, the names arrays contain the same names and so all will be counted as a match.
However, the same_order will *not* be true, so it will yield false.
However, the same_order will *not* be true; it will yield false for all attributes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is again incorrect. Here's an example:

https://refget.databio.org/comparison/fLf5M0BOIPIqcfbE6R8oYwxsy-PnoV32/E6zGtGuc8wKYmCMw5gaLW3ppyXsoO6p4

it does not yield false for all attributes. The lengths and sequences are in the same order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it will yield false only for the names attribute.

Copy link
Collaborator Author

@sveinugu sveinugu Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

EDIT: I agree, based on your original example. My idea was to change the example slightly (which also unfortunately makes it more complex) so that the line "This is the same output as a comparison of two sequence collections in different orders" becomes true (see comment below)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, the same_order will *not* be true; it will yield false for all attributes.
However, the same_order will *not* necessarily be true; it will yield false for at least one attribute (the `names`).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another consequence of changing the example slightly


This is the same output as a comparison of two sequence collections in different orders, without the name swap. This is a fundamental limitation of the array-based method of comparing.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same output as a comparison of two sequence collections in different orders, without the name swap.

My main problem is with this line. In the example where everything is the same except for a name swap, then the output will NOT be the same as a comparison of two sequence collections in different order. As you mention above, in the main example, only the same_order value for the names attribute will be false. In the comparison of two sequence collections in different orders, all same_order values will be false!


In this particular example, these results can be distinguished by the `sorted_name_length_pairs` attribute, because this would yield a perfect match for the second example, where all the pairs are intact but in a different order -- but it would NOT yield a match for the example with swapped names, because the name-length pairs would be different.

This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent array, similar to `sorted_name_length_pairs`, but including *all* inherent arrays for each element rather than just the names and lengths. As this would be automatically picked up by the comparison function, it would immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections.
This solves the issue for swapped names, but there is still potential for problems with other arrays or custom attributes. Therefore, we warn users that when the `_same_order` is flagged as false, this *does not imply that the pairs are intact*, and if this is a requirement, further investigation would be necessary. If distinguishing these scenarios is important, one possible solution would be to add another non-inherent array, similar to `sorted_name_length_pairs`, but including *all* attributes for each element rather than just the names and lengths. The comparison function would then immediately provide an answer as to whether the annotated sequence elements match *as units* between two collections.

Loading