Race conditions when writing / reading parquet datasets #3021

pvieito · 2024-11-15T08:12:30Z

Hi! Currently when writting a parquet dataset with mode overwrite / overwrite_partitions it creates a race condition between the writter and any reader (aws-wranlger / Spark / Athena for example) as aws-wrangler first removes the files in each partition and then it creates objects with new random UUID-based names.

This behaviour is quite unsafe as any reader listing the object in the overwrite moment and then trying to read them will fail with some of these errors (or worse, it will fail silently because it just listed the path after aws-wrangler removed all the files, and sees and empty dataset):

botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
Athena: HIVE_CANNOT_OPEN_SPLIT errors
etc.

We would like a new option to ensure that in overwrite & overwrite_partitions modes aws-wrangler does a safe, deterministic & atomical replacement of the destinations object, this could be done using this method:

Having deterministic output names (for example part-0.parquet, part-1.parquet).
Atomically replacing any existing files in the output path.
Finally doing the clean-up of any extra files that are not expected in the output path (if in this new upload there are less part files, for example).

This would avoid the vast majority of race-conditions as in most cases the number of parts would stay the same or increment in case of a typical overwrite.

// cc. @jack-dell

The text was updated successfully, but these errors were encountered:

pvieito added the enhancement New feature or request label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race conditions when writing / reading parquet datasets #3021

Race conditions when writing / reading parquet datasets #3021

pvieito commented Nov 15, 2024 •

edited

Loading

Race conditions when writing / reading parquet datasets #3021

Race conditions when writing / reading parquet datasets #3021

Comments

pvieito commented Nov 15, 2024 • edited Loading

pvieito commented Nov 15, 2024 •

edited

Loading