Clarify inclusivity of slice-indexing #51

johnkerl · 2022-11-30T20:51:28Z

The spec says things like

slice - per-dimension slice, expressed as a scalar, a range, an Arrow array or chunked array of scalar, or a list of both.

-- we need to make this clear. I believe doubly inclusive is the right way to go but this issue will involve getting consensus on a soma-spec PR.

The text was updated successfully, but these errors were encountered:

johnkerl · 2022-12-02T20:04:08Z

Background and proposals from today's weekly spec meeting.

Background

A nice-to-have is to conform with users' expectations (Principle of Least Surprise at the language level)
- Python:
  - Arrays are indexed 0-up
  - Slices are singly inclusive: [2:4] means indices 2 and 3
  - More generally [0:n] specifies n things
  - Negative indices are aliased by n: x[-2] is the same as x[n-2] where n is the array length
- R:
  - Arrays are indexed 1-up
  - Slices are doubly inclusive: [2:4] means indices 2, 3, and 4
  - More generally [1:n] specifies n things
  - Negative indices are for exclusion: seq(1,5)[c(-2,-4)] results in 1, 3, 5
TileDB-Py precedents:
- TileDB-Py arrays support singly exclusive ranges when using the [...] (__getitem__/__setitem__) indexer.
- The multi_index/df indexer uses doubly inclusive ranges
- The multi_index indexer works the way it does for the same reason that pandas .sel is doubly inclusive -- in particular, What would the semantics of a half-open query look like for floating-point or string coordinate queries?
SOMA:
- Our NdArray classes are always int-indexed
- Our DataFrame classes are by default int-indexed, but will be able to be indexed by other types
- This opens up an opportunity to use doubly inclusive for SOMA DataFrame objects and singly inclusive for SOMA NdArray objects -- tying the exclusivity choice to the array type
Error-prone-ness:
- Users will make off-by-one errors, and off-by-ones can be hard to detect
- In densely occupied data, users might ask for [0:4] and realize “Hey those are four things not five” (or vice versa). But for longer-length queries, and especially with sparse data, people might not even know they’re off by one -- doing [0:1000] might produce 164 things rather than 163 or 165 and there’s no clear indicator that they got it wrong
- Tying the exclusivity choice to the array type will be error-prone

Proposals

Language-dependence
- We are not going to have indexing 0-up or 1-up depending on whether the user use TileDB-SOMA-Py or TileDB-SOMA-R
- If we stick with always being 0-up (which we will) but let the implementation language choose the right-endpoint inclusivity then we lose the invariant that indexing up to n results in n things
- In short since we're not willing to accommodate all Python/R indexing conventions, it would be worse to accommodate some of them
- Thus we want to have the indexing semantics be the same for TileDB-SOMA-Py and TileDB-SOMA-R (and any future implementation languages)
- Which means in particular that indexing semantics belong in the language-independent SOMA spec
Conventions:
- Arrays are 0-up for any implementation language
- Negative indexes are explicitly disallowed for any implementation language
- Slice left endpoints are inclusive
- Slice right endpoint are inclusive or exclusive depending on the indexer method presented to the user
  - E.g. not experiment.obs[m:n] but rather experiment.obs.method_name_goes_here[m:n]
  - These names will allow users to choose the semantics
  - If we ever need to change anyting over time in a migration path, we will have old method names which we deprecate and new ones we can introduce

johnkerl · 2022-12-12T15:34:55Z

Pending review of #52

johnkerl · 2023-02-15T13:34:03Z

Specify double inclusivity for slices #52 is long-since merged
The desired language is already in place at https://github.com/single-cell-data/SOMA/blob/python-v0.0.0a14/abstract_specification.md?plain=1#L629
The typo Slices define a closed, is corrected to Slices define a closed interval, on Make backticks et al. consistent in the abstract spec #127

johnkerl self-assigned this Nov 30, 2022

johnkerl added the sprint-2022-11-28-2022-12-12 label Nov 30, 2022

This was referenced Nov 30, 2022

Specify double inclusivity for slices #52

Merged

[python] Implement __getitem__/__setitem__ for foundational classes single-cell-data/TileDB-SOMA#573

Open

This was referenced Dec 2, 2022

Consider negative slicing single-cell-data/TileDB-SOMA#519

Closed

Fix DenseNdArray slicing single-cell-data/TileDB-SOMA#581

Merged

johnkerl added the sprint-2022-12-12-2022-12-26 label Dec 9, 2022

johnkerl added the sprint-2023-01-03-2023-01-16 label Jan 3, 2023

johnkerl removed sprint-2022-11-28-2022-12-12 labels Jan 12, 2023

johnkerl closed this as completed Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify inclusivity of slice-indexing #51

Clarify inclusivity of slice-indexing #51

johnkerl commented Nov 30, 2022

johnkerl commented Dec 2, 2022 •

edited

Loading

johnkerl commented Dec 12, 2022

johnkerl commented Feb 15, 2023 •

edited

Loading

Clarify inclusivity of slice-indexing #51

Clarify inclusivity of slice-indexing #51

Comments

johnkerl commented Nov 30, 2022

johnkerl commented Dec 2, 2022 • edited Loading

Background

Proposals

johnkerl commented Dec 12, 2022

johnkerl commented Feb 15, 2023 • edited Loading

johnkerl commented Dec 2, 2022 •

edited

Loading

johnkerl commented Feb 15, 2023 •

edited

Loading