Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Table output in Data.read_many #11546

Open
wants to merge 60 commits into
base: develop
Choose a base branch
from

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Nov 13, 2024

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    TypeScript,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • Unit tests have been written where possible.
  • If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
    or the Snowflake database integration, a run of the Extra Tests has been scheduled.
    • If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

@radeusgd radeusgd added the CI: No changelog needed Do not require a changelog entry for this PR. label Nov 13, 2024
@radeusgd radeusgd self-assigned this Nov 13, 2024
@radeusgd radeusgd force-pushed the wip/radeusgd/11311-read-many-to-table branch 2 times, most recently from 4c4b4e0 to 68df118 Compare November 14, 2024 13:44
@radeusgd radeusgd marked this pull request as ready for review November 14, 2024 13:58
Comment on lines 11 to 21
type Return_As_Table
## Returns a table with a new column `Value` containing the objects loaded
from each file.

When the source for files to load was a table, all columns from the
original table are also retained. In case of name clashes, the newly
added columns will get a suffix.

When the source was a simple Vector, the returned table will also contain
a `File Name` column.
Table_Of_Objects
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name Table_Of_Objects was a working name. I'm not sure if we want to keep it or come up with something better?

Definitely Table_Of_Tables in Excel sounded much better. But here it may not necessarily be tables, it may be various objects. So Table_Of_Objects is quite precise, but it sounds weird.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change approach altogether and try something like Added_Column? (meaning that the loaded files are added as a column to the input)

  • Values_Column
  • Table_With_New_Column

any thoughts on these names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Table_With_New_Column or Table_Of_Objects seem the best to me, Values_Column I think could end up being confusing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about more of a boolean? Merged/UnMerged. Merged True/False? I feel like it would be hard for someone to guess what table_of_objects is going to do without reading the help or trying it

Copy link
Member

@jdunkerley jdunkerley Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if:

type Return_As
   ## Single table with values expanded
   Merged_Table

   ## PRIVATE
       Add this for backwards compatiblity then hide in widget?
   Table_Of_Tables

   ## 
   Individual_Cells

   ## Objects returned as a raw vector allowing zipping to input
   Vector

Not sure on name yet either - will think some more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about more of a boolean? Merged/UnMerged. Merged True/False?

But note that Merged_Table has additional arguments (columns_to_keep and match) that are only relevant in this mode. Switching to a boolean would make this problematic.

Plus dropdown options are clearer in the GUI generally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry that wasn't clear. When I said boolean I didn't mean a boolean type. More text options that were Merged or Not Merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector is handled by Return_As_Base - the type needs to be split up so that the Vector return works even if Table is not imported. It is opaque to the user - the dropdown displays all available types and autoscoping works as shown in tests.

I'd not add Table_Of_Tables to Data.read_many. It is supported in Excel_Workbook.read_many to retain backwards compatibility. But adding it to a brand new function seems unnecessary - why add an obsolete option?

I was in my head thinking of it as a single type used by both functions but I appreciate we are actually building this dynamically. Just for working out how it might look in the dropdown thinking as a single type is a bit easier!

If the ..Table_Of_Tables was an option not shown in the dropdown but in one of the mix in types then no special handling would be needed for the two reads to share the same object was my thought.

I wonder if:

  • ..As_Vector
  • ..With_New_Column
  • ..As_Merged_Table

Might be cleanest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ..Table_Of_Tables was an option not shown in the dropdown but in one of the mix in types then no special handling would be needed for the two reads to share the same object was my thought.

But then more code would be affected by the legacy options, I'd rather not grow places that take legacy options when not needed. Unless there's any problem with the current approach, I'd keep the compatibility conversion as-is.

I wonder if:

  • ..As_Vector
  • ..With_New_Column
  • ..As_Merged_Table

Might be cleanest?

That sounds quite clear to me 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I like those options

@@ -16,146 +7,3 @@ type Match_Columns

Note: column names are not compared.
By_Position

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No meaningful code changes in this file, just moving the helpers to separate module - Match_Columns_Helpers.

@radeusgd radeusgd force-pushed the wip/radeusgd/11311-read-many-to-table branch from 771a52c to eead84a Compare November 15, 2024 13:23
@radeusgd
Copy link
Member Author

radeusgd commented Nov 15, 2024

I've got a one remaining failing test:
image

I'm wondering what to do about it - at the moment of union, Sheet Name is not more special than any other column - the order in which the columns appear depends on the order in which they appear in the inputs - so if the first files are .tsv and not .xls, it appears after these columns. What do you think? Is that an issue?

Ideally the column should be first always. In most cases, when loading only Excel files (or just enough that an .xls file goes first) it will already be first. I'm wondering if we should introduce additional logic to handle this edge case where a .tsv comes first and possibly changes the ordering.

What's more important I guess is that when matching By_Position, the Sheet Name column could get merged with the first column of the .tsv - that is actually problematic and probably needs for some special logic. I will later add a test for it and try to address it.


Edit: This has been addressed.

I split columns into 3 kinds: input table, metadata and data; each are handled separately. Metadata columns are always matched by name and all are kept (In_Any); the matching settings on As_Matched_Table only affects the data columns. Tests were added (mostly to Excel_Spec as currently only Excel has 'metadata' columns) to illustrate this and check some edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: No changelog needed Do not require a changelog entry for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Data.read_many
4 participants