-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Apache.Arrow.Types.Decimal128Type #7094
base: main
Are you sure you want to change the base?
Add support for Apache.Arrow.Types.Decimal128Type #7094
Conversation
@asmirnov82 Can you please help review this? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7094 +/- ##
==========================================
- Coverage 68.82% 68.47% -0.35%
==========================================
Files 1255 1262 +7
Lines 250358 254272 +3914
Branches 25550 26237 +687
==========================================
+ Hits 172310 174117 +1807
- Misses 71438 73473 +2035
- Partials 6610 6682 +72
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@@ -124,6 +124,15 @@ private static void AppendDataFrameColumnFromArrowArray(Field field, IArrowArray | |||
} | |||
break; | |||
case ArrowTypeId.Decimal128: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to include a test for this?
As far as I understand initialy Data.Analysis.DataFrame was designed to support Apache.Arrow format and allow constracting columns from Arrow array without memory copying (and least for reading, on any attempt to edit data – Arrow buffers are copied anyway). But at some point new columns were added to the DataFrame that didn’t follow this rule. For example StringDataFrameColumn, DateTimeDataFrameColumn and DecimalDataFrameColumn. These columns have different from Arrow format of underlying data. For Arrow strings DataFrame provides alternative implementation - One possible issue that we have to pay attention is that Arrow Decimal128 uses 4 bytes for store data and user is able to configure precision and scale. As far as I know .Net Decimal has no concept of precision. It uses 3 DWORDs (12 bytes) to store the actual data, and therefore has a maximum precision of 28 (not 34). So converting Arrow Decimal128 to .Net decimal may happen with information loss As according to Decimal256 implementation – I see 2 possibilities: using @luisquintanilla @JakeRadMSFT @ericstj could you please provide your thoughts? |
decimal128DataFrameColumn[i] = arrowDecimal128Array.GetValue(i); | ||
} | ||
} | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code allows to convert from Arrow to DecimalDataFrameColumn. Could you please also add backward conversion from DecimalDataFrameColumn to Arrow? It's done in PrimitiveDataFrameColumn.ToArrowArray(long startIndex, int numberOfRows)
method
I don't think that proposal is something that's being actively worked on at this time, so probably not something that would happen in the .NET 9 timeframe. |
That's worthwhile to call out in the docs. We should update the API docs here:
FWIW it does look like Arrow has precedent for this conversion, though also offers SqlDecimal. Neither can be done with zero-copy, but the latter wouldn't lose precision (I think): https://github.com/apache/arrow/blob/2babda0ba22740c092166b5c5d5d7aa9b4797953/csharp/src/Apache.Arrow/Arrays/Decimal128Array.cs#L42-L63 Another option would be to expose some types unique to this assembly that's meant to represent this data without copying. Not saying we should - but that's a possibility. Probably those types would be better to live in the Arrow project if at all. @tannergooding and @eerhardt might have more thoughts on this. |
If there's need for the IEEE 754 It just isn't in scope to add the entirety of the approved API surface area, including operators. |
Yes, SqlDecimal wouldn't lose precision for Arrow's decimal128, but that's at the cost of occupying 160 bits instead of 128 . The IEEE Decimal128 isn't wide enough to hold the 38 digits of decimal precision supported by Arrow, though it is (as noted) better than the CLR decimal. (I think there were some backwards-compatibility considerations in CLR decimal for OLEDB.) There's an interesting tension between traditional database representations of decimal values and the requirements of general-purpose programming languages. In a database, you're always operating in the context of a particular column so you can factor out properties like precision and scale to optimize storage. In a programming language, individual values need to carry that context with them because most type systems aren't designed to support that kind of parameterization. So you either end up needing extra bytes (like SqlDecimal) or potentially losing precision (like IEEE Decimal128) when working with database decimals. And because most database decimal values don't actually need 38 digits of precision, the latter often ends up looking more attractive. I like to think that if you squint a little, you can almost see the historical split between FORTRAN and COBOL (and their associated use cases) in this topic. One nice thing about incorporating precision and scale into the type is that the values are automatically normalized. This is super important in a database representation because comparisons and simple computation (like sums) are going to be the majority of operations performed on these values. Anyway, sorry for the digression. Representation of numbers in a computer is a fun topic. |
Fixes #7082