improve handling of prefixs in io #4083

ajpotts · 2025-02-05T18:36:02Z

This example illustrates some weakness of using prefixes to identify arkouda data.

The following code executes correctly.

df = ak.DataFrame({"x1":ak.arange(100)})
df2 = ak.DataFrame({"x1":ak.arange(111)})
df3 = ak.DataFrame({"x1":ak.arange(119)})

df.to_parquet("df")
df2.to_parquet("df2")
df3.to_parquet("d")

df = ak.DataFrame(ak.read_parquet("df_*"))
df2 = ak.DataFrame(ak.read_parquet("df2*"))
df3 = ak.DataFrame(ak.read_parquet("d_*"))

However, notice the addition of the _ to the prefix in read_parquet. Removing the _ results in the similarly named files being identified together:

In [7]: df = ak.DataFrame(ak.read_parquet("df*"))

In [8]: df.size
Out[8]: 211

and

In [9]: df3 = ak.DataFrame(ak.read_parquet("d*"))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[9], line 1
----> 1 df3 = ak.DataFrame(ak.read_parquet("d*"))

File ~/git/arkouda/arkouda/io.py:863, in read_parquet(filenames, datasets, iterative, strict_types, allow_errors, tag_data, read_nested, has_non_float_nulls, fixed_len)
    849     return {
    850         dset: read_parquet(
    851             filenames,
   (...)
    860         for dset in datasets
    861     }
    862 else:
--> 863     rep_msg = generic_msg(
    864         cmd="readAllParquet",
    865         args={
    866             "strict_types": strict_types,
    867             "dset_size": len(datasets),
    868             "filename_size": len(filenames),
    869             "allow_errors": allow_errors,
    870             "dsets": datasets,
    871             "filenames": filenames,
    872             "tag_data": tag_data,
    873             "has_non_float_nulls": has_non_float_nulls,
    874             "fixed_len": fixed_len,
    875         },
    876     )
    877     rep = json.loads(rep_msg)  # See GenSymIO._buildReadAllMsgJson for json structure
    878     _parse_errors(rep, allow_errors)

File ~/git/arkouda/arkouda/client.py:1012, in generic_msg(cmd, args, payload, send_binary, recv_binary)
   1010     else:
   1011         assert payload is None
-> 1012         return cast(Channel, channel).send_string_message(
   1013             cmd=cmd, args=msg_args, size=size, recv_binary=recv_binary
   1014         )
   1015 except KeyboardInterrupt as e:
   1016     # if the user interrupts during command execution, the socket gets out
   1017     # of sync reset the socket before raising the interrupt exception
   1018     cast(Channel, channel).connect(timeout=0)

File ~/git/arkouda/arkouda/client.py:534, in ZmqChannel.send_string_message(self, cmd, recv_binary, args, size, request_id)
    532 # raise errors or warnings sent back from the server
    533 if return_message.msgType == MessageType.ERROR:
--> 534     raise RuntimeError(return_message.msg)
    535 elif return_message.msgType == MessageType.WARNING:
    536     warnings.warn(return_message.msg)

RuntimeError: Other error in accessing file dep: ParquetError 380 getArrSize:ParquetMsg Cannot open for reading: path 'dep' is a directory

The text was updated successfully, but these errors were encountered:

e-kayrakli · 2025-02-06T18:09:36Z

Thanks for exploring this, @ajpotts!

Copy/pasting my comments under the PR above for better visibility:

I am worried about removing files based on name-matching. IOW, the user could just have a file with a given name that matches the pattern purely by coincidence. I think there are two things that can mitigate that:

Save per-locale files in a directory. This doesn't eliminate the problem as the user may just create a problematic file in the given directory, which would, again, be removed.

Like above, but also add a metadata. The directory could contain a metadata file that lists all the files that represent chunks of an array. Instead of matching files by name, we can read that metadata, and delete files based on the names stored there.

To elaborate more, I prefer:

df.to_parquet("df")
df2.to_parquet("df2")
df3.to_parquet("d")

to result in df.metadata, df2.metadata and d.metadata files (I am not wedded to extensions, I use md in #3915, but that means "markdown"). These metadata files can then store the information about actual datafiles that may or may not be in the same path as they exist. This also can allow storing metadata and the actual data in different file systems if need be. In that world, you wouldn't need to glob as in

df = ak.DataFrame(ak.read_parquet("df_*"))
df2 = ak.DataFrame(ak.read_parquet("df2*"))
df3 = ak.DataFrame(ak.read_parquet("d_*"))

which could simply be

df = ak.DataFrame(ak.read_parquet("df"))
df2 = ak.DataFrame(ak.read_parquet("df2"))
df3 = ak.DataFrame(ak.read_parquet("d"))

where the arguments represent the metadata names.

ajpotts mentioned this issue Feb 5, 2025

Closes #4076: bug in reading and writing to/from parquet when the locales change #4077

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve handling of prefixs in io #4083

improve handling of prefixs in io #4083

ajpotts commented Feb 5, 2025

e-kayrakli commented Feb 6, 2025

improve handling of prefixs in io #4083

improve handling of prefixs in io #4083

Comments

ajpotts commented Feb 5, 2025

e-kayrakli commented Feb 6, 2025