Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset types should stand alone as a module #12

Open
mahiki opened this issue Aug 9, 2024 · 1 comment
Open

Dataset types should stand alone as a module #12

mahiki opened this issue Aug 9, 2024 · 1 comment

Comments

@mahiki
Copy link
Owner

mahiki commented Aug 9, 2024

When you are working off of local datastores only its a bit clunky to have to define the connection the Prefect API and define the names of remote and local Prefect blocks.

Until you can define flow code in julia scripts there's no upside to the prefect integration, since you are writing your flow code in python and calling a julia process.

Example local julia exploratory use-case:

using DataFrames, UnicodePlots, PrefectInterfaces
ENV["PREFECT_API_URL"] = "http://127.0.0.1:4204/api"    # dev environment

# need to define both of these to use the `read(Dataset)` function
ENV["PREFECT_DATA_BLOCK_LOCAL"] = "local-file-system/datastore"
ENV["PREFECT_DATA_BLOCK_REMOTE"] = "s3-bucket/datastore"


dsz = Dataset(dataset_name = "my_cool_data_extract", datastore_type = "local")
dfz = read(dsz)
#   404×4 DataFrame
#   ..etc

If 'Dataset' module (name already taken) could stand alone from PrefectInterfaces, you could bring it on as an extention when needed. In stand alone mode, you'll need to define the filesystem block instead of calling the API url to get that:

using Dataset.local-datastore
dstore = Dataset.local-datastore()
dstore.basepath = "$HOME/toodata/templisher/dev"

And thats all you need to find datasets in your local system. You are working in julia outside of any prefect orchestration.

@mahiki
Copy link
Owner Author

mahiki commented Aug 28, 2024

A key part of this is that currently read(::Dataset) is defined as a read_path function attached to a prefect block, which has a very Object Oriented structure.

The way I'm using Dataset is its just a metadata reference, mostly carrying filepath locations and local/remote labels.

I do not want to define a 'dataset' with a block, the only prefect block reference needed is the base path to the data store.

  • remove that read_path/write_path functionality. read(::Dataset) should take the datatype reader as an arguement, and a dataset has "csv" for example as data type. So default would be CSV.read, but can be override with a keyword argument or dispatched based on Dataset type somehow.
  • prefect_block.block.read_path(path_key) - this just sucks, the function is defined as a struct.function as part of the block definition.

Again, this was borrowed from the way Prefect file blocks included a read_path/write_path object method which creates too much linkage between Prefect internal object-oriented structure and the structure of my data application.

I guess this is called a 'leaky abstraction'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant