Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify inclusivity of slice-indexing #51

Closed
johnkerl opened this issue Nov 30, 2022 · 3 comments
Closed

Clarify inclusivity of slice-indexing #51

johnkerl opened this issue Nov 30, 2022 · 3 comments
Assignees

Comments

@johnkerl
Copy link
Member

The spec says things like

slice - per-dimension slice, expressed as a scalar, a range, an Arrow array or chunked array of scalar, or a list of both.

-- we need to make this clear. I believe doubly inclusive is the right way to go but this issue will involve getting consensus on a soma-spec PR.

@johnkerl
Copy link
Member Author

johnkerl commented Dec 2, 2022

Background and proposals from today's weekly spec meeting.

Background

  • A nice-to-have is to conform with users' expectations (Principle of Least Surprise at the language level)
    • Python:
      • Arrays are indexed 0-up
      • Slices are singly inclusive: [2:4] means indices 2 and 3
      • More generally [0:n] specifies n things
      • Negative indices are aliased by n: x[-2] is the same as x[n-2] where n is the array length
    • R:
      • Arrays are indexed 1-up
      • Slices are doubly inclusive: [2:4] means indices 2, 3, and 4
      • More generally [1:n] specifies n things
      • Negative indices are for exclusion: seq(1,5)[c(-2,-4)] results in 1, 3, 5
  • TileDB-Py precedents:
    • TileDB-Py arrays support singly exclusive ranges when using the [...] (__getitem__/__setitem__) indexer.
    • The multi_index/df indexer uses doubly inclusive ranges
    • The multi_index indexer works the way it does for the same reason that pandas .sel is doubly inclusive -- in particular, What would the semantics of a half-open query look like for floating-point or string coordinate queries?
  • SOMA:
    • Our NdArray classes are always int-indexed
    • Our DataFrame classes are by default int-indexed, but will be able to be indexed by other types
    • This opens up an opportunity to use doubly inclusive for SOMA DataFrame objects and singly inclusive for SOMA NdArray objects -- tying the exclusivity choice to the array type
  • Error-prone-ness:
    • Users will make off-by-one errors, and off-by-ones can be hard to detect
    • In densely occupied data, users might ask for [0:4] and realize “Hey those are four things not five” (or vice versa). But for longer-length queries, and especially with sparse data, people might not even know they’re off by one -- doing [0:1000] might produce 164 things rather than 163 or 165 and there’s no clear indicator that they got it wrong
    • Tying the exclusivity choice to the array type will be error-prone

Proposals

  • Language-dependence
    • We are not going to have indexing 0-up or 1-up depending on whether the user use TileDB-SOMA-Py or TileDB-SOMA-R
    • If we stick with always being 0-up (which we will) but let the implementation language choose the right-endpoint inclusivity then we lose the invariant that indexing up to n results in n things
    • In short since we're not willing to accommodate all Python/R indexing conventions, it would be worse to accommodate some of them
    • Thus we want to have the indexing semantics be the same for TileDB-SOMA-Py and TileDB-SOMA-R (and any future implementation languages)
    • Which means in particular that indexing semantics belong in the language-independent SOMA spec
  • Conventions:
    • Arrays are 0-up for any implementation language
    • Negative indexes are explicitly disallowed for any implementation language
    • Slice left endpoints are inclusive
    • Slice right endpoint are inclusive or exclusive depending on the indexer method presented to the user
      • E.g. not experiment.obs[m:n] but rather experiment.obs.method_name_goes_here[m:n]
      • These names will allow users to choose the semantics
      • If we ever need to change anyting over time in a migration path, we will have old method names which we deprecate and new ones we can introduce

@johnkerl
Copy link
Member Author

Pending review of #52

@johnkerl
Copy link
Member Author

johnkerl commented Feb 15, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant