Add `Converter.to_info` to customize `info` and `search` #1884

braingram · 2024-12-20T17:48:11Z

Description

This PR adds support for a new Converter method to_info. This new method is for converting a custom object to an "info" container (list, tuple, dict) to provide additional information during info (and as a side effect during search since this uses common code in _node_info.py).

To provide an example where this is useful. Currently an AsdfFile with an astropy Table renders with largely unhelpful info:

>>> from astropy.table import Table
>>> import asdf
>>> af = asdf.AsdfFile({"my_table": Table({"col_a": [1, 2, 3], "col_b": [4, 5, 6]})})
>>> af.info()
root (AsdfObject)
└─my_table (Table)

With this PR the Table (and Column) Converters in asdf-astropy can be updated (only the TableConverter update shown here).

    def to_info(self, obj):
        return {
            "columns": {n: obj[n] for n in obj.colnames},
            "meta": dict(obj.meta),
            "n_rows": len(obj),
        }

to improve the output:

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column): shape=(3,), dtype=int64
  │ │ ├─name (str): col_a
  │ │ └─dtype (dtype[int64]): int64
  │ └─col_b (Column): shape=(3,), dtype=int64
  │   ├─name (str): col_b
  │   └─dtype (dtype[int64]): int64
  ├─meta (dict): {}
  └─n_rows (int): 3

This has the added benefit of allowing search to find column names for the above example:

>>> af.search("col_a")
root (AsdfObject)
└─my_table (Table)
  └─columns (dict)
    └─col_a (Column): shape=(3,), dtype=int64

This PR uses the Converter.to_info feature to implement custom information of ndarray/NDArrayType. On main the following:

import asdf, numpy as np
asdf.AsdfFile({"arr": np.arange(42)}).info()

produces:

root (AsdfObject)
└─arr (ndarray): shape=(42,), dtype=int64

with this PR the output is different:

root (AsdfObject)
└─arr (ndarray)
  ├─shape (tuple)
  │ └─[0] (int): 42
  └─dtype (Int64DType): int64

During working on this feature several bugs were uncovered and a few were fixed:

info/search/schema_info doesn't use the AsdfFile extension manager: info search and schema_info do not use the AsdfFile.extension_manager #1882
info output is garbled if the output doesn't support ANSI formatting codes: info output garbled when terminal does not support formatting #1889
info/search/schema_info infinitely loops and consumes unbounded memory for recursive lists: info can loop infinitely leaking memory if passed a recursive list #1890

Fixes #1882 #1889 #1890 #1892

Tasks

run pre-commit on your machine
run pytest on your machine
Does this PR add new features and / or change user-facing code / API? (if not, label with no-changelog-entry-needed)
- write news fragment(s) in changes/: echo "changed something" > changes/<PR#>.<changetype>.rst (see below for change types)
- update relevant docstrings and / or docs/ page
- for any new features, add unit tests

news fragment change types...

changes/<PR#>.feature.rst: new feature
changes/<PR#>.bugfix.rst: bug fix
changes/<PR#>.doc.rst: documentation change
changes/<PR#>.removal.rst: deprecation or removal of public API
changes/<PR#>.general.rst: infrastructure or miscellaneous change

braingram · 2024-12-24T19:21:43Z

The stdatamodels downstream failure is unrelated: spacetelescope/stdatamodels#371

perrygreenfield · 2025-01-02T23:20:29Z

This is a nice feature. I wonder if there should be a keyword option to suppress this option in the cases it will really blow up the output (tables with lots of columns, or a very complex GWCS).

braingram · 2025-01-03T15:20:46Z

This is a nice feature. I wonder if there should be a keyword option to suppress this option in the cases it will really blow up the output (tables with lots of columns, or a very complex GWCS).

Thanks for giving it a look.

At the moment the output is subject to the max_rows (and max_cols) options. For the example output if the total number of rows exceedsmax_rows (if max_rows is an integer) the output will begin to be truncated starting with the most distal branches of the tree, pruning the output to:

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column) ...
  │ └─col_b (Column) ...
  ├─meta (dict): {}
  └─n_rows (int): 3

and then to:

└─my_table (Table) ...

There is also an existing show_values option. If that's disabled (and max_rows=None):

└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column)
  │ │ ├─name (str)
  │ │ └─dtype (dtype[int64])
  │ └─col_b (Column)
  │   ├─name (str)
  │   └─dtype (dtype[int64])
  ├─meta (dict)
  └─n_rows (int)

That output doesn't look too helpful so one option might be to not use to_info when show_values=False. That seems a little hacky and may require some additional code refactoring (as to_info is used during the NodeInfo collection whereas show_values is used in the post-collection display of the NodeInfo instance).

We could add a flag use_to_info to NodeInfo to control if to_info is used to expand nodes and then expose it for control via:

AsdfFile.info
asdf.info
asdftool info
etc...

At the moment I'm leaning towards relying on max_rows as the way to control the verbosity of the output but don't object to adding more API to control this output. Let me know what works best for you.

perrygreenfield · 2025-01-03T15:37:40Z

I worry that if the very lengthy item comes before items of interest, max_rows may not be that much help. For your example, while the column info is attenuated somewhat, there may still be a lot of lines for big tables. This may work better for GWCS if that shows a deep expression tree (though some representations are more text based; it depends on how the to_info is implemented). I wonder if another approach is truncate a node's output based on how many lines it contains (max_node_info?), though I would probably exclude meta from that.

braingram · 2025-01-03T17:54:20Z

I worry that if the very lengthy item comes before items of interest, max_rows may not be that much help. For your example, while the column info is attenuated somewhat, there may still be a lot of lines for big tables. This may work better for GWCS if that shows a deep expression tree (though some representations are more text based; it depends on how the to_info is implemented). I wonder if another approach is truncate a node's output based on how many lines it contains (max_node_info?), though I would probably exclude meta from that.

While working on this and #1875 I noticed that max_rows is rather more complicated than I expected. It starts hiding the deepest/most-distal rows/branches first. Since the to_info generated subtree will be 1 deeper than the item this can help (for some trees) to keep things tidy when limiting the number of rows. If I modify the above example adding a nested structure prior to my_table and call af.info (with the default max_rows=24) the output includes:

─aaa (dict)
│ ├─bbb (dict)
│ │ ├─ccc (dict) ...
│ │ └─ccc2 (int): 1
│ └─bbb2 (list)
│   ├─[0] (int): 5
│   ├─[1] (int): 6
│   └─[2] (int): 7
└─my_table (Table)
  ├─columns (dict) ...
  ├─meta (dict): {}
  └─n_rows (int): 3

In this case the columns subtree was hidden (because of it's depth) but so was ccc in the nested aaa structure. If I use max_rows=12 the output condenses to:

├─aaa (dict) ...
└─my_table (Table) ...
Some nodes not shown.

Because aaa and my_table are the most shallow structure they will be hidden last (put another way is that info has a preference for displaying more root-level objects). max_rows can also be a tuple of rows per depth. Calling af.info(max_rows=(24, None, 3)) outputs:

├─aaa (dict)
│ ├─bbb (dict)
│ │ ├─ccc (dict) ...
│ │ └─ccc2 (int): 1
│ └─bbb2 (list)
│   ├─[0] (int): 5
│   ├─[1] (int): 6
│   └─[2] (int): 7
└─my_table (Table)
  ├─columns (dict)
  │ ├─col_a (Column) ...
  │ └─col_b (Column) ...
  ├─meta (dict): {}
  └─n_rows (int): 3
Some nodes not shown.

I think the existing max_rows options are pretty useful here and if a user wants further filtering they can always select an item/branch/etc from the tree asdf.info(af['aaa']):

root (AsdfObject)
├─bbb (dict)
│ ├─ccc (dict)
│ │ └─ddd (list)
│ │   ├─[0] (int): 1
│ │   ├─[1] (int): 2
│ │   ├─[2] (int): 3
│ │   └─[3] (int): 4
│ └─ccc2 (int): 1
└─bbb2 (list)
  ├─[0] (int): 5
  ├─[1] (int): 6
  └─[2] (int): 7

As you noted the details about how to_info is implemented for Table, Column, WCS etc will help to inform where and how we may want to control the output. I suggest that we hold off adding any other options to info until after we have some experience with to_info.

perrygreenfield · 2025-01-06T14:10:58Z

Yes, it (max_rows) may already doing mostly what we want.

perrygreenfield

Asking for a clarification on a code comment

perrygreenfield · 2025-01-09T13:51:51Z

asdf/_node_info.py

                    if root_info is None:
                        root_info = info

                    if parent is not None:
-                        if parent.schema is not None and not cls.traversable(node):
+                        # Why check that __asdf_traverse__ doesn't exist here?


I'm not sure I understand the comment regarding transversable. It seems to be arguing against the check, but retains it?

Maybe it makes sense to remove this note?
I added it while trying to understand the code and I think it leads to an issue:
#1892

It still doesn't make sense to me why not cls.traversable (or with this PR not traversable) is checked. This PR at the moment doesn't change the behavior (to try an keep things consistent). However I think that consistency is enforcing a bug.

If I remove the traversable check no tests fail (in asdf or roman_datamodels) so I doubt anything is relying on this bug. We could fix it in this PR by removing the check. Any ideas why not cls.traversable(node) was added?

Not that I can recall. Sounds like we should try just getting rid of that and see what the consequences are. Yes, something may be relying on that, but there is a benefit to getting rid of it if nothing is depending on it.

Sounds good to me. I modified one of the existing tests to also cover the issue:
8f95e65
and removed the not traversable:
b8bd0eb

Leave search usage as a hasattr(__asdf_traverse__) check since search doesn't have access to an extension manager Update info usage to allow use of an extension manager

braingram · 2025-01-09T18:29:24Z

The jwst downstream failure is unrelated. It's due to the use of pytest-xdist and CRDS not file locking and not verifying local files (spacetelescope/crds#930) leading to one test process downloading a partial file and another process attempting to use it before it's complete.

braingram added the Downstream CI label Dec 20, 2024

braingram force-pushed the info_v2 branch 2 times, most recently from 412664d to c1713c3 Compare December 24, 2024 14:02

braingram changed the title ~~WIP: info/seach/schema_info fixes~~ Add Converter.to_info to customize info and search Dec 24, 2024

braingram marked this pull request as ready for review December 24, 2024 19:21

braingram requested a review from a team as a code owner December 24, 2024 19:21

braingram requested review from perrygreenfield and nden January 2, 2025 19:32

perrygreenfield reviewed Jan 9, 2025

View reviewed changes

braingram added 15 commits January 9, 2025 12:30

reducing some redundancy

c8e4498

Fix issue asdf-format#1882: info not using custom extensions

e767db0

removed unused variable

bdbd7af

reorganize "traversable" detection

375675e

Leave search usage as a hasattr(__asdf_traverse__) check since search doesn't have access to an extension manager Update info usage to allow use of an extension manager

check isatty before using ANSI codes in info

f4bf586

only ANSI format diff if supported

4c11b9a

actually fix info formatting

0dcbb76

prevent infinite loop on info

3a1a824

remove traversable class method

d2e5e5e

WIP use extension_manager for info

fca8d30

remove unused code, add comments

8cd1e28

add to_info to Converter

e9ff632

use to_info for ndarray

b898f60

document using to_info to extend search and info

a4606b2

add changelog entries

df5b7c9

braingram added 3 commits January 9, 2025 12:42

assign parent schema if provided

d74b551

walk schema for traversable nodes

b8bd0eb

modify test to cover issue

8f95e65

braingram force-pushed the info_v2 branch from 68146ab to 8f95e65 Compare January 9, 2025 17:51

add another changelog fragment

e7b3f5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `Converter.to_info` to customize `info` and `search` #1884

Add `Converter.to_info` to customize `info` and `search` #1884

braingram commented Dec 20, 2024 •

edited

Loading

braingram commented Dec 24, 2024

perrygreenfield commented Jan 2, 2025

braingram commented Jan 3, 2025

perrygreenfield commented Jan 3, 2025

braingram commented Jan 3, 2025

perrygreenfield commented Jan 6, 2025

perrygreenfield left a comment

perrygreenfield Jan 9, 2025

braingram Jan 9, 2025

perrygreenfield Jan 9, 2025

braingram Jan 9, 2025

braingram commented Jan 9, 2025

Add Converter.to_info to customize info and search #1884

Are you sure you want to change the base?

Add Converter.to_info to customize info and search #1884

Conversation

braingram commented Dec 20, 2024 • edited Loading

Description

Tasks

braingram commented Dec 24, 2024

perrygreenfield commented Jan 2, 2025

braingram commented Jan 3, 2025

perrygreenfield commented Jan 3, 2025

braingram commented Jan 3, 2025

perrygreenfield commented Jan 6, 2025

perrygreenfield left a comment

Choose a reason for hiding this comment

perrygreenfield Jan 9, 2025

Choose a reason for hiding this comment

braingram Jan 9, 2025

Choose a reason for hiding this comment

perrygreenfield Jan 9, 2025

Choose a reason for hiding this comment

braingram Jan 9, 2025

Choose a reason for hiding this comment

braingram commented Jan 9, 2025

Add `Converter.to_info` to customize `info` and `search` #1884

Add `Converter.to_info` to customize `info` and `search` #1884

braingram commented Dec 20, 2024 •

edited

Loading