Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve handling of prefixs in io #4083

Open
ajpotts opened this issue Feb 5, 2025 · 1 comment
Open

improve handling of prefixs in io #4083

ajpotts opened this issue Feb 5, 2025 · 1 comment

Comments

@ajpotts
Copy link
Contributor

ajpotts commented Feb 5, 2025

This example illustrates some weakness of using prefixes to identify arkouda data.

The following code executes correctly.

df = ak.DataFrame({"x1":ak.arange(100)})
df2 = ak.DataFrame({"x1":ak.arange(111)})
df3 = ak.DataFrame({"x1":ak.arange(119)})

df.to_parquet("df")
df2.to_parquet("df2")
df3.to_parquet("d")

df = ak.DataFrame(ak.read_parquet("df_*"))
df2 = ak.DataFrame(ak.read_parquet("df2*"))
df3 = ak.DataFrame(ak.read_parquet("d_*"))

However, notice the addition of the _ to the prefix in read_parquet. Removing the _ results in the similarly named files being identified together:

In [7]: df = ak.DataFrame(ak.read_parquet("df*"))

In [8]: df.size
Out[8]: 211

and

In [9]: df3 = ak.DataFrame(ak.read_parquet("d*"))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 df3 = ak.DataFrame(ak.read_parquet("d*"))

File ~/git/arkouda/arkouda/io.py:863, in read_parquet(filenames, datasets, iterative, strict_types, allow_errors, tag_data, read_nested, has_non_float_nulls, fixed_len)
    849     return {
    850         dset: read_parquet(
    851             filenames,
   (...)
    860         for dset in datasets
    861     }
    862 else:
--> 863     rep_msg = generic_msg(
    864         cmd="readAllParquet",
    865         args={
    866             "strict_types": strict_types,
    867             "dset_size": len(datasets),
    868             "filename_size": len(filenames),
    869             "allow_errors": allow_errors,
    870             "dsets": datasets,
    871             "filenames": filenames,
    872             "tag_data": tag_data,
    873             "has_non_float_nulls": has_non_float_nulls,
    874             "fixed_len": fixed_len,
    875         },
    876     )
    877     rep = json.loads(rep_msg)  # See GenSymIO._buildReadAllMsgJson for json structure
    878     _parse_errors(rep, allow_errors)

File ~/git/arkouda/arkouda/client.py:1012, in generic_msg(cmd, args, payload, send_binary, recv_binary)
   1010     else:
   1011         assert payload is None
-> 1012         return cast(Channel, channel).send_string_message(
   1013             cmd=cmd, args=msg_args, size=size, recv_binary=recv_binary
   1014         )
   1015 except KeyboardInterrupt as e:
   1016     # if the user interrupts during command execution, the socket gets out
   1017     # of sync reset the socket before raising the interrupt exception
   1018     cast(Channel, channel).connect(timeout=0)

File ~/git/arkouda/arkouda/client.py:534, in ZmqChannel.send_string_message(self, cmd, recv_binary, args, size, request_id)
    532 # raise errors or warnings sent back from the server
    533 if return_message.msgType == MessageType.ERROR:
--> 534     raise RuntimeError(return_message.msg)
    535 elif return_message.msgType == MessageType.WARNING:
    536     warnings.warn(return_message.msg)

RuntimeError: Other error in accessing file dep: ParquetError 380 getArrSize:ParquetMsg Cannot open for reading: path 'dep' is a directory
@e-kayrakli
Copy link
Contributor

Thanks for exploring this, @ajpotts!

Copy/pasting my comments under the PR above for better visibility:

I am worried about removing files based on name-matching. IOW, the user could just have a file with a given name that matches the pattern purely by coincidence. I think there are two things that can mitigate that:

  • Save per-locale files in a directory. This doesn't eliminate the problem as the user may just create a problematic file in the given directory, which would, again, be removed.
  • Like above, but also add a metadata. The directory could contain a metadata file that lists all the files that represent chunks of an array. Instead of matching files by name, we can read that metadata, and delete files based on the names stored there.

To elaborate more, I prefer:

df.to_parquet("df")
df2.to_parquet("df2")
df3.to_parquet("d")

to result in df.metadata, df2.metadata and d.metadata files (I am not wedded to extensions, I use md in #3915, but that means "markdown"). These metadata files can then store the information about actual datafiles that may or may not be in the same path as they exist. This also can allow storing metadata and the actual data in different file systems if need be. In that world, you wouldn't need to glob as in

df = ak.DataFrame(ak.read_parquet("df_*"))
df2 = ak.DataFrame(ak.read_parquet("df2*"))
df3 = ak.DataFrame(ak.read_parquet("d_*"))

which could simply be

df = ak.DataFrame(ak.read_parquet("df"))
df2 = ak.DataFrame(ak.read_parquet("df2"))
df3 = ak.DataFrame(ak.read_parquet("d"))

where the arguments represent the metadata names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants