Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ enhance dataset.read_table(...) method with type normalisation #3277

Open
Marigold opened this issue Sep 12, 2024 · 1 comment
Open

✨ enhance dataset.read_table(...) method with type normalisation #3277

Marigold opened this issue Sep 12, 2024 · 1 comment
Labels

Comments

@Marigold
Copy link
Collaborator

Marigold commented Sep 12, 2024

Motivation

Historically, we've been using function dataset["my_table"] to access table from a dataset. Recently, a new helper method dataset.read_table(reset_index: bool = False) has been added that lets us read the table with reset index which is significantly faster for large dimensional datasets.

Concept

We could add more functionality to read_table and make it de facto standard to read tables. These could be:

  • Retype all columns to "standard" types (e.g. uint8 -> int64, Float16 -> float64) and categorical to string type
  • underscore column names etc. (see .format method)
    • Should already be in snake case by the time we're talking tables
@larsyencken larsyencken changed the title ✨ enhance dataset.read_table(...) method ✨ enhance dataset.read_table(...) method with type normalisation Oct 17, 2024
@larsyencken
Copy link
Collaborator

larsyencken commented Oct 17, 2024

We think type standardising would remove some common footguns for data folk:

  • categoricalstring
  • Float* -> Float64
  • Int* -> Int64

Maybe this should live in the repack module.

It would be nice to turn this on by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants