asdf "super dictionaries" for lazy tree access #1705

braingram · 2023-12-14T15:43:52Z

braingram
Dec 14, 2023
Maintainer

For an as-yet undetermined upcoming asdf version (possibly 3.x) we're considering adding a new way of interacting with the tree. The general idea is that when an ASDF file is opened the components/nodes of the tree will not be immediately converted to custom objects. Conversion of a node from a tagged to a custom object would only occur when the node is accessed (except for the top level AsdfObject which is accessed as part of opening the file.

Example

For example, suppose you have an ASDF file with a tree with a custom object (a ndarray instance) at tree['a']['b'] where 'a' is a dict instance. With the super dictionary approach something like the following would occur when the file is opened:

with asdf.open('test.asdf', tree_type="super") as af:
    # at this point af['a']['b'] is still a tagged object and the ndarray or NDArrayType for af['a']['b'] has not been converted
    a = af['a']  # as this is only taking a reference to the dict at 'a' the ndarray instance is still not converted
    print(a['b'] * 2)   # accessing the array here both causes the conversion and loading of the ndarray

Note that the tree_type argument above is only an example, this may not be the final API.

Similarity with lazy_load

There are some ways in which this is similar to the lazy loading currently used for blocks (and NDArrayType) but is a more generic lazy loading mechanism that would work for objects that are difficult to lazy load (such as astropy Quantity instances which force array loading when they are created). For example, consider a tree with a quantity as tree['science']['data']:

with asdf.open('test.asdf', tree_type="super") as af:
    # af['science']['data'] is not yet loaded
    print(af['science']['data'].unit)  # this causes af['science']['data'] to be converted to an astropy Quantity

Super dictionary tree does not use builtin `dict` and `list` for mappings and sequences

One change to using super dictionaries will be that previous dict and list instances in the ASDF tree will be returned as dict-like and list-like objects (possibly subclasses of dict mapping or UserDict). Any code that checks specifically for dict (or list) instances may need to be updated or continue to use the non-super dictionary interface (at the moment we're planning to keep both).

Performance tradeoffs

It also seems likely that all accesses will be slightly slower. Loading files should be significantly faster (as not all objects will need to be converted) but first access for nodes will be slower (when the object is converted and cached).

Additional features

Although not planned for the initial implementation, super-dictionaries might allow for per-node or per-branch methods and options. Some examples would be:

validating a portion of the tree (this wouldn't work for all validation checks and instances as some validation depends on higher-level schema)
providing additional loading options for conversion of nodes (for example passing a 'memmap' option when accessing an array to trigger any converted array to be memory mapped for that particular node/branch, this might allow for the removal of NDArrayType entirely)

perrygreenfield · 2024-01-02T17:29:25Z

perrygreenfield
Jan 2, 2024
Maintainer

In the case of arrays, will this allow obtaining information about the array (size, shape, type, etc) without loading it?

2 replies

braingram Jan 2, 2024
Maintainer Author

The short answer is hopefully yes. I am currently of the mind that by default, accessing a tree node that contains an array should return a fully-fledged numpy array. If this becomes the default then accessing the array size will result in loading the entire array. For a user that knows they do not want the array data we can hopefully provide an api for querying array metadata without loading the array. This could be:

have the tree node access return a NDArrayType (most similar to what we do not) that can have it's size/shape etc inspected without loading the array data
fetch the tagged dictionary for the node without converting it to an array (in most cases this would contains the discussed array metadata although for certain cases, like streamed arrays, the size/shape would not yet be known).

To spitball a possible API, let's assume we have a file af with an array at data. Currently af['data'] returns a NDArrayType instance (which was created when the file was opened). For the super-dictionary interface, I'm proposing that af['data'] trigger the conversion of the tagged dictionary backing data to be converted to a normal in-memory numpy array (so af['data'] would return a ndarray instance). To control the conversion (to perhaps a NDArrayType or memmap) one possible API might be to add a publicconvert method to the AsdfDictNode that implements the super-dictionary interface and use it to control how a given node is deserialized.

af.convert('data', astype=NDArrayType)   # or np.memmap

This would require providing astype to the NDArrayConverter and expanding the Converter API to support optional arguments. This may also be cumbersome for a file with a large number of objects and nested objects (e.g. arrays inside quantities) would be difficult to control.
Alternatively a configuration context (similar to and perhaps an extension of AsdfConfig) could be used to control the conversion.

with converter_options(...):
    arr = af['data']

Accessing a node as the 'raw' tagged structure seems general enough that I don't think the above proposed API will be necessary. Perhaps something as simple as the following could work for accessing the tagged structure:

af.as_tagged('data')

I think it's easiest and safest to only have this work for non-converted objects (nodes that haven't yet be deserialized) because once an object is converterted/deserialized there is no guarantee that the tagged object matches the now deserialized custom object.

perrygreenfield Jan 2, 2024
Maintainer

I think I would prefer a means of retrieving the raw tagged structure as both generic and allowing us to eliminate supporting the existing NDArrayType if possible.

perrygreenfield · 2024-01-02T20:01:09Z

perrygreenfield
Jan 2, 2024
Maintainer

Also, the current functionality of the .info() method should allow displaying such information about arrays without loading them.

1 reply

braingram Jan 2, 2024
Maintainer Author

Excellent point! .info does seem like a prime candidate for a function that should display as much as possible without expensive loading.

braingram · 2024-06-13T20:00:53Z

braingram
Jun 13, 2024
Maintainer Author

FYI: this feature is implemented in #1733 which is currently targeting 3.3

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asdf "super dictionaries" for lazy tree access #1705

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

asdf "super dictionaries" for lazy tree access #1705

braingram Dec 14, 2023 Maintainer

Example

Similarity with lazy_load

Super dictionary tree does not use builtin dict and list for mappings and sequences

Performance tradeoffs

Additional features

Replies: 3 comments · 3 replies

perrygreenfield Jan 2, 2024 Maintainer

braingram Jan 2, 2024 Maintainer Author

perrygreenfield Jan 2, 2024 Maintainer

perrygreenfield Jan 2, 2024 Maintainer

braingram Jan 2, 2024 Maintainer Author

braingram Jun 13, 2024 Maintainer Author

braingram
Dec 14, 2023
Maintainer

Super dictionary tree does not use builtin `dict` and `list` for mappings and sequences

Replies: 3 comments 3 replies

perrygreenfield
Jan 2, 2024
Maintainer

braingram Jan 2, 2024
Maintainer Author

perrygreenfield Jan 2, 2024
Maintainer

perrygreenfield
Jan 2, 2024
Maintainer

braingram Jan 2, 2024
Maintainer Author

braingram
Jun 13, 2024
Maintainer Author