Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-40053: Add Datastore records to DatasetRef and use them in butler.get #875

Merged
merged 12 commits into from
Oct 17, 2023

Conversation

andy-slac
Copy link
Contributor

@andy-slac andy-slac commented Aug 4, 2023

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@andy-slac
Copy link
Contributor Author

@timj, @TallJimbo, here is the current state of my branch, if you want to check it. Not so much code changes, but there were a lot of experiments along the way. Any comments are welcome!

@timj
Copy link
Member

timj commented Sep 12, 2023

I finally had a chance to look at this and given,

  • We want datastores to have the option of not having to manage records at all.
  • Each DatasetRef should carry its own datastore records.

then this looks like a reasonable approach. We do have to decide whether it's only ephemeral datastores that don't have to return records. Does SasquatchDatastore have to return empty records or only do that if get() is implemented?

@andy-slac
Copy link
Contributor Author

Does SasquatchDatastore have to return empty records or only do that if get() is implemented?

I'm not familiar with SasquatchDatastore, but I can imagine that if it does not need datastore records stored in registry then it can return None or empty list/map for records.

@andy-slac andy-slac force-pushed the tickets/DM-40053 branch 2 times, most recently from 4c9f3eb to 5eb7f94 Compare September 13, 2023 20:24
@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Attention: 87 lines in your changes are missing coverage. Please review.

Comparison is base (73579f0) 87.73% compared to head (2a1b9c1) 87.58%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #875      +/-   ##
==========================================
- Coverage   87.73%   87.58%   -0.15%     
==========================================
  Files         271      271              
  Lines       36364    36506     +142     
  Branches     7573     7610      +37     
==========================================
+ Hits        31904    31974      +70     
- Misses       3278     3347      +69     
- Partials     1182     1185       +3     
Files Coverage Δ
python/lsst/daf/butler/_dataset_ref.py 82.53% <100.00%> (+1.67%) ⬆️
python/lsst/daf/butler/_formatter.py 89.63% <100.00%> (-0.22%) ⬇️
python/lsst/daf/butler/datastore/_datastore.py 95.03% <100.00%> (+0.23%) ⬆️
python/lsst/daf/butler/datastore/generic_base.py 92.10% <100.00%> (-2.73%) ⬇️
...ython/lsst/daf/butler/registry/_butler_registry.py 89.74% <100.00%> (+2.24%) ⬆️
python/lsst/daf/butler/registry/_registry.py 96.72% <ø> (ø)
tests/test_datasets.py 100.00% <100.00%> (ø)
tests/test_datastore.py 99.72% <ø> (-0.01%) ⬇️
tests/test_quantumBackedButler.py 99.14% <100.00%> (+<0.01%) ⬆️
python/lsst/daf/butler/_butler.py 79.21% <83.33%> (+0.08%) ⬆️
... and 8 more

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


__slots__ = {"table_spec", "record_class"}
table_spec: ddl.TableSpec
record_class: type[StoredDatastoreItemInfo]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a minor side issue, but I'd prefer not to assume anywhere that opaque tables are just for datastores, and I'd be great to cut down on the number of conversions (SQLAlchemy row <-> StoredDatastoreItemInfo <-> pydantic/json mapping).

So, could we put this OpaqueTableDefinition struct in registry.interfaces.opaque?

And could we make record_class just "some pydantic model type"? Or maybe even just use bare dict here? Dynamic class types aren't getting us any static-analysis advantages, and runtime checking is probably a waste of cycles (except at the JSON boundary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be OK to move OpaqueTableDefinition to registry.interfaces. I was worried that this may cause core to be (logically) dependent on registry, but we can do it as mypy-only dependency, as we do for a couple of other classes already. We do need then to replace StoredDatastoreItemInfo with registry-side type (or possibly use dict), I need to look at the code and figure out a reasonable option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of completely abandoning typing and using dict for arbitrary datastore records. The StoredDatastoreItemInfo dataclass definition is seemingly very close to defining the database schema directly through introspection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have typed datastore records, but it makes life a bit harder for registry. But it's probably unavoidable.

python/lsst/daf/butler/registries/sql.py Show resolved Hide resolved
@TallJimbo
Copy link
Member

I think this is broadly consistent with what I think I'd have put in DMTN-249 if I hadn't been overzealous about optimizing the back-and-forth in put in client-server, but I need to go read that again (and maybe attempt to fix it) to be sure, at least if my question about datastores fetching missing records is resolved in favor of that being transitional.

@andy-slac andy-slac force-pushed the tickets/DM-40053 branch 2 times, most recently from 87a138c to e10be0f Compare October 4, 2023 18:00
@andy-slac andy-slac force-pushed the tickets/DM-40053 branch 13 times, most recently from 6a3b2e9 to 1c5ee3e Compare October 11, 2023 17:15
@andy-slac andy-slac marked this pull request as ready for review October 11, 2023 18:12
@andy-slac andy-slac changed the title DM-40053: Draft PR for changes on this ticket so far DM-40053: Initial registry-free Datastore with butler.get Oct 11, 2023
@timj timj changed the title DM-40053: Initial registry-free Datastore with butler.get DM-40053: Add Datastore records to DatasetRef and use them in butler.get Oct 11, 2023
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a lot of little comments and questions but I have no big-picture concerns; it's all a step in the right direction.

FWIW, your put_new does not match the one I have been prototyping with transaction support over on DMTN-249 anymore, and I think it'll probably end up changing again when we do transactions and/or client-server put, but my prototyping doesn't include ChainedDatastore so I'm not confident it will work as-is either.

python/lsst/daf/butler/datastore/_datastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/_butler.py Show resolved Hide resolved
python/lsst/daf/butler/_dataset_ref.py Show resolved Hide resolved
python/lsst/daf/butler/datastore/stored_file_info.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastore/stored_file_info.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/inMemoryDatastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registries/sql.py Show resolved Hide resolved
andy-slac and others added 12 commits October 17, 2023 13:21
There are few changes in Registry and Datastore classes that reduce use of
the bridge and opaque tables by datastores:
- New method `Datastore.get_opaque_table_definitions` returns opaque table
  definitoons which are used by Registry method `make_datastore_tables`
  to create those tables.
- `Registry.findDataset` has an option to retrieve datastore records
  and attach them to returned ref.
- `Datastore.get` can use the datastore records from the ref instead of
  using opaque tables.

Note that `Datastore.get` can still work with refs that don't have datastore
records, this is mostly to keep the unit tests running, a future commit will
remove that option and update all unit tests.
Adds `Datastore.put_new`, we want to keep old `Datastore.put` to keep
unit tests running (no unit tests for `put_new()` yet). `put_new()`
returns DatasetRef with datastore records, and new Registry method
is called by Butler to store those records in database. Better
approach would be for Registry to store datastore records in
`insertDatasets`/`_importDatasets` but that needs a bit of code
re-arrangement, will do that later.
This is to make it more similar to the `OpaqueTableValues` in DMTN-249
prototype. The `DatasetDatastoreRecords` type alias is similar to
`OpaqueTableBatch` from prototype. `StoredDatastoreItemInfo` is always
used with its corresponding `DatasetRef` or `DatasetId`, so keeping
dataset_id in each record is wasteful.
This is only to suppress errors in the tests.
Old name was too generic, in reality this is very datastore-specific.
I do not think we need generic structure of the same kind, at least not
right now.
Jim suggested that we use new `OpaqueRecordSet` class to keep datastore
records, but after discussion we figured it would requre lots of additional
structure to support it. For now there is no clear need for that class and
we can continue using `StoredDatastoreItemInfo` in DatasetRefs, but want to
make records private to give us freedom to change it later.
Datastore records in DatasetRef can become invalidated in some cases,
this is mostly true for our unit tests that do crazy things to test
consistency. This patch adds a flag parameter to some methods that
tells them to ignore ref-supplied datastore records.
After trying to re-implement datastore unit tests I realized that the
`Datastore.put_new` and `Registry.store_datastore_records` combination
(used to implement `Butler.put`) is not very compatible with transaction
rollback system that we now have in place. So I decided to keep
`Butler.put` unchanged for now until we do something more drastic with
transactions. I keep `Datastore.put_new` and `Registry.store_datastore_records`
implementations, they are not used but may be useful in the future.
Co-authored-by: Jim Bosch <[email protected]>
@andy-slac andy-slac merged commit 828e697 into main Oct 17, 2023
14 of 16 checks passed
@andy-slac andy-slac deleted the tickets/DM-40053 branch October 17, 2023 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants