Support for `Table` output in `Data.read_many` #11546

radeusgd · 2024-11-13T13:21:21Z

Pull Request Description

Closes Implement Data.read_many #11311

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
TypeScript,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
Unit tests have been written where possible.
If meaningful changes were made to logic or tests affecting Enso Cloud integration in the libraries,
or the Snowflake database integration, a run of the Extra Tests has been scheduled.
- If applicable, it is suggested to paste a link to a successful run of the Extra Tests.

radeusgd · 2024-11-14T14:01:03Z

distribution/lib/Standard/Table/0.0.0-dev/src/Return_As_Table.enso

+type Return_As_Table
+    ## Returns a table with a new column `Value` containing the objects loaded
+       from each file.
+
+       When the source for files to load was a table, all columns from the
+       original table are also retained. In case of name clashes, the newly
+       added columns will get a suffix.
+
+       When the source was a simple Vector, the returned table will also contain
+       a `File Name` column.
+    Table_Of_Objects


The name Table_Of_Objects was a working name. I'm not sure if we want to keep it or come up with something better?

Definitely Table_Of_Tables in Excel sounded much better. But here it may not necessarily be tables, it may be various objects. So Table_Of_Objects is quite precise, but it sounds weird.

Maybe we should change approach altogether and try something like Added_Column? (meaning that the loaded files are added as a column to the input)

Values_Column

Table_With_New_Column

any thoughts on these names?

I think Table_With_New_Column or Table_Of_Objects seem the best to me, Values_Column I think could end up being confusing

How about more of a boolean? Merged/UnMerged. Merged True/False? I feel like it would be hard for someone to guess what table_of_objects is going to do without reading the help or trying it

I wonder if:

type Return_As ## Single table with values expanded Merged_Table ## PRIVATE Add this for backwards compatiblity then hide in widget? Table_Of_Tables ## Individual_Cells ## Objects returned as a raw vector allowing zipping to input Vector

Not sure on name yet either - will think some more

How about more of a boolean? Merged/UnMerged. Merged True/False?

But note that Merged_Table has additional arguments (columns_to_keep and match) that are only relevant in this mode. Switching to a boolean would make this problematic.

Plus dropdown options are clearer in the GUI generally.

Yes sorry that wasn't clear. When I said boolean I didn't mean a boolean type. More text options that were Merged or Not Merged.

Vector is handled by Return_As_Base - the type needs to be split up so that the Vector return works even if Table is not imported. It is opaque to the user - the dropdown displays all available types and autoscoping works as shown in tests.

I'd not add Table_Of_Tables to Data.read_many. It is supported in Excel_Workbook.read_many to retain backwards compatibility. But adding it to a brand new function seems unnecessary - why add an obsolete option?

I was in my head thinking of it as a single type used by both functions but I appreciate we are actually building this dynamically. Just for working out how it might look in the dropdown thinking as a single type is a bit easier!

If the ..Table_Of_Tables was an option not shown in the dropdown but in one of the mix in types then no special handling would be needed for the two reads to share the same object was my thought.

I wonder if:

..As_Vector

..With_New_Column

..As_Merged_Table

Might be cleanest?

If the ..Table_Of_Tables was an option not shown in the dropdown but in one of the mix in types then no special handling would be needed for the two reads to share the same object was my thought.

But then more code would be affected by the legacy options, I'd rather not grow places that take legacy options when not needed. Unless there's any problem with the current approach, I'd keep the compatibility conversion as-is.

I wonder if:

..As_Vector

..With_New_Column

..As_Merged_Table

Might be cleanest?

That sounds quite clear to me 👍

Yes I like those options

radeusgd · 2024-11-14T14:14:45Z

distribution/lib/Standard/Table/0.0.0-dev/src/Match_Columns.enso

@@ -16,146 +7,3 @@ type Match_Columns

       Note: column names are not compared.
    By_Position
-


No meaningful code changes in this file, just moving the helpers to separate module - Match_Columns_Helpers.

radeusgd · 2024-11-15T15:39:17Z

I've got a one remaining failing test:

I'm wondering what to do about it - at the moment of union, Sheet Name is not more special than any other column - the order in which the columns appear depends on the order in which they appear in the inputs - so if the first files are .tsv and not .xls, it appears after these columns. What do you think? Is that an issue?

Ideally the column should be first always. In most cases, when loading only Excel files (or just enough that an .xls file goes first) it will already be first. I'm wondering if we should introduce additional logic to handle this edge case where a .tsv comes first and possibly changes the ordering.

What's more important I guess is that when matching By_Position, the Sheet Name column could get merged with the first column of the .tsv - that is actually problematic and probably needs for some special logic. I will later add a test for it and try to address it.

Edit: This has been addressed.

I split columns into 3 kinds: input table, metadata and data; each are handled separately. Metadata columns are always matched by name and all are kept (In_Any); the matching settings on As_Matched_Table only affects the data columns. Tests were added (mostly to Excel_Spec as currently only Excel has 'metadata' columns) to illustrate this and check some edge cases.

…ashing names with Excel Sheet Name

radeusgd added the CI: No changelog needed Do not require a changelog entry for this PR. label Nov 13, 2024

radeusgd self-assigned this Nov 13, 2024

radeusgd force-pushed the wip/radeusgd/11311-read-many-to-table branch 2 times, most recently from 4c4b4e0 to 68df118 Compare November 14, 2024 13:44

radeusgd marked this pull request as ready for review November 14, 2024 13:58

radeusgd requested review from jdunkerley, GregoryTravis, AdRiley and marthasharkey as code owners November 14, 2024 13:58

radeusgd commented Nov 14, 2024

View reviewed changes

radeusgd added 19 commits November 15, 2024 14:23

WIP merging tables new type

8326daf

basic logic for returning tables and merging

e50e0b0

WIP dependent defaults

ff3d6ab

updating tests

73ee0e0

big merge test

a7026cd

more tests

bb202ce

fixes to make it compile

faf4c8e

checkpoint

02fa764

adding/fixing widget

813f8cb

fix test

8af415a

better to_text for Excel Workbook

7cd6f71

move excel tests, but keep for later

4825597

checkpoint widgets, adding args, moving towards union

806733a

WIP

6281296

WIP

9d1751d

compat

da7b165

ensure backwards compatibility for Excel read_many

6fe5f33

adding tests for union behaviour, empty tables, mixing types etc.

2c59261

spec for Excel

73f906f

radeusgd added 15 commits November 15, 2024 14:23

yet another union edge case

48cf53f

empty tables working

6f92c9c

WIP empty arrays

1b903ba

refactor

67e1dbb

test union params

50be1a5

syntax

c0a793d

union edge case on empty table

7a62de4

update spec after discussion

909df75

WIP

49b79e3

DRY

008ae5d

more edge cases

5262bed

fix edge case with no / empty tables

c942a78

yet another edge case for #6281

8352acb

handling weird objects

6d74597

workaround for #11570

eead84a

radeusgd force-pushed the wip/radeusgd/11311-read-many-to-table branch from 771a52c to eead84a Compare November 15, 2024 13:23

radeusgd added 5 commits November 15, 2024 14:42

CR: rename according to discussion

fd0f5fe

fix forgotten to rename

7e0c6a4

fix typo

3e2fe23

improve tests - TODO: expand excel!

55ba25c

custom handling to expand excel

17e2cab

radeusgd added 8 commits November 18, 2024 15:56

rename

b23687a

WIP adding tests for edge cases of metadata columns in read many - cl…

183cf8e

…ashing names with Excel Sheet Name

edge cases with metadata in Excel

0879f85

separate metadata for merging

dff447e

fix test

22f5a91

imports

9d987cd

fix too private ctor

c719992

remove file that shouldve been ignored

d1e7ee4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for `Table` output in `Data.read_many` #11546

Support for `Table` output in `Data.read_many` #11546

radeusgd commented Nov 13, 2024 •

edited

Loading

radeusgd Nov 14, 2024

radeusgd Nov 14, 2024

marthasharkey Nov 14, 2024

AdRiley Nov 14, 2024

jdunkerley Nov 14, 2024 •

edited

Loading

jdunkerley Nov 15, 2024

AdRiley Nov 15, 2024

jdunkerley Nov 15, 2024

radeusgd Nov 15, 2024

AdRiley Nov 15, 2024

radeusgd Nov 14, 2024

radeusgd commented Nov 15, 2024 •

edited

Loading

		@@ -16,146 +7,3 @@ type Match_Columns

		Note: column names are not compared.
		By_Position

Support for Table output in Data.read_many #11546

Are you sure you want to change the base?

Support for Table output in Data.read_many #11546

Conversation

radeusgd commented Nov 13, 2024 • edited Loading

Pull Request Description

Important Notes

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdunkerley Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

radeusgd commented Nov 15, 2024 • edited Loading

Support for `Table` output in `Data.read_many` #11546

Support for `Table` output in `Data.read_many` #11546

radeusgd commented Nov 13, 2024 •

edited

Loading

jdunkerley Nov 14, 2024 •

edited

Loading

radeusgd commented Nov 15, 2024 •

edited

Loading