Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to store casting errors in a separate field #723

Open
yruslan opened this issue Oct 25, 2024 · 0 comments
Open

Add an option to store casting errors in a separate field #723

yruslan opened this issue Oct 25, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@yruslan
Copy link
Collaborator

yruslan commented Oct 25, 2024

Background

Currently, if an EBCDIC data fails to cast to the proper type, for example, when wrong bytes are provided for COMP-3 decoding, Cobrix will silently return null.

It would be great if such casting errors are gathered in a special column in the returned dataset.

spark-csv adds '_corrupted_record' column. when it can't parse the CSC record.

In Cobrix case, the column name can be chosen by the user, and it should be an array of issues.

Feature

Add an option to store casting errors in a separate field.

Example

.option("decode_error_column", "errors")

Which might return something like:

{ 
   /*...*/
   "errors": [
      "Decoding error for COMP-3, bytes: 0x01231A",
      "Decoding error for COMP, 4 digits, overflow, number=12345, bytes: 0x011223"
   ]
}

Proposed Solution

Add errors only if the setting is enabled. This might have performance and output size inpact.

@yruslan yruslan added the enhancement New feature or request label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant