-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Dataset
Class to Load Datasets from Snowflake
#71
base: main
Are you sure you want to change the base?
Conversation
@@ -58,3 +58,4 @@ typing-extensions==3.10.0.2 | |||
urllib3==1.26.7 | |||
wandb==0.12.21 | |||
tqdm | |||
snowflake-connector-python[pandas]==3.12.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add this library to communicate with Snowflake
.
wicker/schema/schema.py
Outdated
@abc.abstractmethod | ||
def process_sf_variant_field(self, field: ObjectField) -> _T: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to add a function to handle new field type? Or better to expand existing function to handle str
instance?
wicker/schema/codecs.py
Outdated
@@ -58,7 +58,7 @@ def validate_and_encode_object(self, obj: Any) -> bytes: | |||
pass | |||
|
|||
@abc.abstractmethod | |||
def decode_object(self, data: bytes) -> Any: | |||
def decode_object(self, data: bytes | str) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative approach to accept str
values to decode object.
8ef2cc5
to
3aeb118
Compare
3aeb118
to
2fbf2ed
Compare
dataset_version: str, | ||
dataset_partition_name: str, | ||
table_name: str, | ||
connection_parameters: Optional[Dict[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the consistent perspective, prob need to move this to the wicker config as the s3 dataset does
for name, type in zip(schema_table["lowercase_name"], schema_table["type"]): | ||
schema = self._get_schema_type(type.as_py()) | ||
if schema == SfNumpyField: | ||
schema_instance = schema(name=name.as_py(), shape=(1, -1), dtype="float32") # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only float32?
@@ -318,3 +318,6 @@ def process_object_field(self, field: schema.ObjectField) -> Any: | |||
return data | |||
cbf_info = ColumnBytesFileLocationV1.from_bytes(data) | |||
return self.cbf_cache.read(cbf_info) | |||
|
|||
def process_sf_variant_field(self, field: schema.VariantField) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to store the reference in the s3 backed column files? I don't think we should store numpy in the column files, that would generated tons of small files that both kill the data governance and loading performance?
) | ||
|
||
@property | ||
def connection(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming this property should be private, WDYT?
return self._schema | ||
|
||
def arrow_table(self) -> pyarrow.Table: | ||
"""Returns a table of the dataset as pyarrow table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General speaking, I am not sure whether we should wrap the sf database table as a pyarrow table.
- loading performance, every data loading process need a copy of this dataset , aka a connection to the snowflake and a remote arrow table loading. Even Snowflake is designed as a OLAP instead of OLTP, but theoretically, the DB service overhead should be much slower than a direct S3 connection.
- Data volume, pyarrow loading from local file(downloaded from s3 although) utilize the mmap, theoretically support larger data volume than the memory, but your approach needs loading into memory , no local file based cache.
The design principle of wicker is utilize a ETL stage(data dumping piple) to transform the format ,here parquet+column files(aka, heavy duty field, should only for the image like large size field) for more IO friendly for the gpu instances to shorten the loading time to save money.
Based on the above, I would request for changes for your PR and let us have f2f discuss.
@@ -404,6 +404,147 @@ def __init__( | |||
) | |||
|
|||
|
|||
class VariantField(SchemaField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduce of the schema is for clear data type definition, so personally dont think we should add a new VariantField
@@ -87,6 +87,12 @@ def process_object_field(self, field: schema.ObjectField) -> Optional[Any]: | |||
return data | |||
return field.codec.decode_object(data) | |||
|
|||
def process_sf_variant_field(self, field: schema.VariantField) -> Optional[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field type should not be tied with storage, here sf. I would suggest make more abstraction.
we can add a backend field to the schema definition, just as whether heavy duty. default s3, could be sf. if sf, we should call the SF specific decoder even for the same data type, aka numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kds1010 , could we close this PR ?
Draft
to consult better way to implement the dataset class1. Overview
This PR adds a
Datasets
class (SFDataset
) to load data fromSnowflake
.2. Implementation details
SFDataset
to load data from SnowflakeVARIANT
field tondarray
SfNumpyField
to handleVARIANT
column with the codecThe main difficulty is to map types of
Snowflake
andWicker
.Especially
ObjectField
values (e.g. ndarray).Wicker
handles these value by encoding these values and dump to__COLUMN_CONCATENATED_FILES__
and each column stores the pointer to the values asbytes
inObjectField
.Snowflake
stores these labels (e.g. bbox coordinates) asVARIANT
class, it likesJsonString
and we need to parse it bynp.array(json.loads(value))
. Since the input itstring
and it's rejected by following code.Then, is it better to add new field to handle
str
or expand it to acceptstr
since the function to decode the data is up to the implementation of theCodec
class.