Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Converter.to_info to customize info and search #1884

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

braingram
Copy link
Contributor

@braingram braingram commented Dec 20, 2024

Description

This PR adds support for a new Converter method to_info. This new method is for converting a custom object to an "info" container (list, tuple, dict) to provide additional information during info (and as a side effect during search since this uses common code in _node_info.py).

To provide an example where this is useful. Currently an AsdfFile with an astropy Table renders with largely unhelpful info:

>>> from astropy.table import Table
>>> import asdf
>>> af = asdf.AsdfFile({"my_table": Table({"col_a": [1, 2, 3], "col_b": [4, 5, 6]})})
>>> af.info()
root (AsdfObject)
└─my_table (Table)

With this PR the Table (and Column) Converters in asdf-astropy can be updated (only the TableConverter update shown here).

    def to_info(self, obj):
        return {
            "columns": {n: obj[n] for n in obj.colnames},
            "meta": dict(obj.meta),
            "n_rows": len(obj),
        }

to improve the output:

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column): shape=(3,), dtype=int64
  │ │ ├─name (str): col_a
  │ │ └─dtype (dtype[int64]): int64
  │ └─col_b (Column): shape=(3,), dtype=int64
  │   ├─name (str): col_b
  │   └─dtype (dtype[int64]): int64
  ├─meta (dict): {}
  └─n_rows (int): 3

This has the added benefit of allowing search to find column names for the above example:

>>> af.search("col_a")
root (AsdfObject)
└─my_table (Table)
  └─columns (dict)
    └─col_a (Column): shape=(3,), dtype=int64

This PR uses the Converter.to_info feature to implement custom information of ndarray/NDArrayType. On main the following:

import asdf, numpy as np
asdf.AsdfFile({"arr": np.arange(42)}).info()

produces:

root (AsdfObject)
└─arr (ndarray): shape=(42,), dtype=int64

with this PR the output is different:

root (AsdfObject)
└─arr (ndarray)
  ├─shape (tuple)
  │ └─[0] (int): 42
  └─dtype (Int64DType): int64

During working on this feature several bugs were uncovered and a few were fixed:

Fixes #1882 #1889 #1890 #1892

Tasks

  • run pre-commit on your machine
  • run pytest on your machine
  • Does this PR add new features and / or change user-facing code / API? (if not, label with no-changelog-entry-needed)
    • write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
    • update relevant docstrings and / or docs/ page
    • for any new features, add unit tests
news fragment change types...
  • changes/<PR#>.feature.rst: new feature
  • changes/<PR#>.bugfix.rst: bug fix
  • changes/<PR#>.doc.rst: documentation change
  • changes/<PR#>.removal.rst: deprecation or removal of public API
  • changes/<PR#>.general.rst: infrastructure or miscellaneous change

@braingram braingram force-pushed the info_v2 branch 2 times, most recently from 412664d to c1713c3 Compare December 24, 2024 14:02
@braingram braingram changed the title WIP: info/seach/schema_info fixes Add Converter.to_info to customize info and search Dec 24, 2024
@braingram
Copy link
Contributor Author

The stdatamodels downstream failure is unrelated: spacetelescope/stdatamodels#371

@braingram braingram marked this pull request as ready for review December 24, 2024 19:21
@braingram braingram requested a review from a team as a code owner December 24, 2024 19:21
@perrygreenfield
Copy link
Contributor

This is a nice feature. I wonder if there should be a keyword option to suppress this option in the cases it will really blow up the output (tables with lots of columns, or a very complex GWCS).

@braingram
Copy link
Contributor Author

This is a nice feature. I wonder if there should be a keyword option to suppress this option in the cases it will really blow up the output (tables with lots of columns, or a very complex GWCS).

Thanks for giving it a look.

At the moment the output is subject to the max_rows (and max_cols) options. For the example output if the total number of rows exceedsmax_rows (if max_rows is an integer) the output will begin to be truncated starting with the most distal branches of the tree, pruning the output to:

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column) ...
  │ └─col_b (Column) ...
  ├─meta (dict): {}
  └─n_rows (int): 3

and then to:

└─my_table (Table) ...

There is also an existing show_values option. If that's disabled (and max_rows=None):

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column)
  │ │ ├─name (str)
  │ │ └─dtype (dtype[int64])
  │ └─col_b (Column)
  │   ├─name (str)
  │   └─dtype (dtype[int64])
  ├─meta (dict)
  └─n_rows (int)

That output doesn't look too helpful so one option might be to not use to_info when show_values=False. That seems a little hacky and may require some additional code refactoring (as to_info is used during the NodeInfo collection whereas show_values is used in the post-collection display of the NodeInfo instance).

We could add a flag use_to_info to NodeInfo to control if to_info is used to expand nodes and then expose it for control via:

  • AsdfFile.info
  • asdf.info
  • asdftool info
  • etc...

At the moment I'm leaning towards relying on max_rows as the way to control the verbosity of the output but don't object to adding more API to control this output. Let me know what works best for you.

@perrygreenfield
Copy link
Contributor

I worry that if the very lengthy item comes before items of interest, max_rows may not be that much help. For your example, while the column info is attenuated somewhat, there may still be a lot of lines for big tables. This may work better for GWCS if that shows a deep expression tree (though some representations are more text based; it depends on how the to_info is implemented). I wonder if another approach is truncate a node's output based on how many lines it contains (max_node_info?), though I would probably exclude meta from that.

@braingram
Copy link
Contributor Author

I worry that if the very lengthy item comes before items of interest, max_rows may not be that much help. For your example, while the column info is attenuated somewhat, there may still be a lot of lines for big tables. This may work better for GWCS if that shows a deep expression tree (though some representations are more text based; it depends on how the to_info is implemented). I wonder if another approach is truncate a node's output based on how many lines it contains (max_node_info?), though I would probably exclude meta from that.

While working on this and #1875 I noticed that max_rows is rather more complicated than I expected. It starts hiding the deepest/most-distal rows/branches first. Since the to_info generated subtree will be 1 deeper than the item this can help (for some trees) to keep things tidy when limiting the number of rows. If I modify the above example adding a nested structure prior to my_table and call af.info (with the default max_rows=24) the output includes:

─aaa (dict)
│ ├─bbb (dict)
│ │ ├─ccc (dict) ...
│ │ └─ccc2 (int): 1
│ └─bbb2 (list)
│   ├─[0] (int): 5
│   ├─[1] (int): 6
│   └─[2] (int): 7
└─my_table (Table)
  ├─columns (dict) ...
  ├─meta (dict): {}
  └─n_rows (int): 3

In this case the columns subtree was hidden (because of it's depth) but so was ccc in the nested aaa structure. If I use max_rows=12 the output condenses to:

├─aaa (dict) ...
└─my_table (Table) ...
Some nodes not shown.

Because aaa and my_table are the most shallow structure they will be hidden last (put another way is that info has a preference for displaying more root-level objects). max_rows can also be a tuple of rows per depth. Calling af.info(max_rows=(24, None, 3)) outputs:

├─aaa (dict)
│ ├─bbb (dict)
│ │ ├─ccc (dict) ...
│ │ └─ccc2 (int): 1
│ └─bbb2 (list)
│   ├─[0] (int): 5
│   ├─[1] (int): 6
│   └─[2] (int): 7
└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column) ...
  │ └─col_b (Column) ...
  ├─meta (dict): {}
  └─n_rows (int): 3
Some nodes not shown.

I think the existing max_rows options are pretty useful here and if a user wants further filtering they can always select an item/branch/etc from the tree asdf.info(af['aaa']):

root (AsdfObject)
├─bbb (dict)
│ ├─ccc (dict)
│ │ └─ddd (list)
│ │   ├─[0] (int): 1
│ │   ├─[1] (int): 2
│ │   ├─[2] (int): 3
│ │   └─[3] (int): 4
│ └─ccc2 (int): 1
└─bbb2 (list)
  ├─[0] (int): 5
  ├─[1] (int): 6
  └─[2] (int): 7

As you noted the details about how to_info is implemented for Table, Column, WCS etc will help to inform where and how we may want to control the output. I suggest that we hold off adding any other options to info until after we have some experience with to_info.

@perrygreenfield
Copy link
Contributor

Yes, it (max_rows) may already doing mostly what we want.

Copy link
Contributor

@perrygreenfield perrygreenfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for a clarification on a code comment

if root_info is None:
root_info = info

if parent is not None:
if parent.schema is not None and not cls.traversable(node):
# Why check that __asdf_traverse__ doesn't exist here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the comment regarding transversable. It seems to be arguing against the check, but retains it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to remove this note?
I added it while trying to understand the code and I think it leads to an issue:
#1892

It still doesn't make sense to me why not cls.traversable (or with this PR not traversable) is checked. This PR at the moment doesn't change the behavior (to try an keep things consistent). However I think that consistency is enforcing a bug.

If I remove the traversable check no tests fail (in asdf or roman_datamodels) so I doubt anything is relying on this bug. We could fix it in this PR by removing the check. Any ideas why not cls.traversable(node) was added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I can recall. Sounds like we should try just getting rid of that and see what the consequences are. Yes, something may be relying on that, but there is a benefit to getting rid of it if nothing is depending on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I modified one of the existing tests to also cover the issue:
8f95e65
and removed the not traversable:
b8bd0eb

@braingram
Copy link
Contributor Author

The jwst downstream failure is unrelated. It's due to the use of pytest-xdist and CRDS not file locking and not verifying local files (spacetelescope/crds#930) leading to one test process downloading a partial file and another process attempting to use it before it's complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

info search and schema_info do not use the AsdfFile.extension_manager
2 participants