Add support for Apache.Arrow.Types.Decimal128Type #7094

piyushdubey · 2024-03-21T13:35:12Z

piyushdubey · 2024-03-21T13:40:06Z

@asmirnov82 Can you please help review this?
Additionally, I'd love your comments on what the difference between Decimal128 and Decimal256 Arrow type handling would be in a DataFrame?

codecov · 2024-03-21T15:06:40Z

Codecov Report

Attention: Patch coverage is 0% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 68.47%. Comparing base (8b483f4) to head (b00af0e).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7094      +/-   ##
==========================================
- Coverage   68.82%   68.47%   -0.35%     
==========================================
  Files        1255     1262       +7     
  Lines      250358   254272    +3914     
  Branches    25550    26237     +687     
==========================================
+ Hits       172310   174117    +1807     
- Misses      71438    73473    +2035     
- Partials     6610     6682      +72

Flag	Coverage Δ
Debug	`68.47% <0.00%> (-0.35%)`	⬇️
production	`62.87% <0.00%> (-0.39%)`	⬇️
test	`88.56% <ø> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/Microsoft.Data.Analysis/DataFrame.Arrow.cs	`86.18% <0.00%> (-5.43%)`	⬇️

... and 64 files with indirect coverage changes

ericstj · 2024-03-21T16:45:36Z

src/Microsoft.Data.Analysis/DataFrame.Arrow.cs

@@ -124,6 +124,15 @@ private static void AppendDataFrameColumnFromArrowArray(Field field, IArrowArray
                    }
                    break;
                case ArrowTypeId.Decimal128:


Is it possible to include a test for this?

asmirnov82 · 2024-03-21T22:00:35Z

@asmirnov82 Can you please help review this? Additionally, I'd love your comments on what the difference between Decimal128 and Decimal256 Arrow type handling would be in a DataFrame?

As far as I understand initialy Data.Analysis.DataFrame was designed to support Apache.Arrow format and allow constracting columns from Arrow array without memory copying (and least for reading, on any attempt to edit data – Arrow buffers are copied anyway).

But at some point new columns were added to the DataFrame that didn’t follow this rule. For example StringDataFrameColumn, DateTimeDataFrameColumn and DecimalDataFrameColumn. These columns have different from Arrow format of underlying data.

For Arrow strings DataFrame provides alternative implementation - ArrowStringDataFrameColumn (but it’s immutable).
DateTimeDataFrameColumn converts Arrow data to .Net DateTime struct (the same approch that you use in this PR).
As on any attempt to edit data coping is happen anyway I don't see any drawback in this.

One possible issue that we have to pay attention is that Arrow Decimal128 uses 4 bytes for store data and user is able to configure precision and scale. As far as I know .Net Decimal has no concept of precision. It uses 3 DWORDs (12 bytes) to store the actual data, and therefore has a maximum precision of 28 (not 34). So converting Arrow Decimal128 to .Net decimal may happen with information loss

As according to Decimal256 implementation – I see 2 possibilities: using SqlDecimal from System.Data.SqlTypes namespace or waiting support of Decimal256 by .Net framework. There is a proposal for supporting new IEEE compliant decimal types - dotnet/runtime#81376

@luisquintanilla @JakeRadMSFT @ericstj could you please provide your thoughts?

asmirnov82 · 2024-03-21T22:06:12Z

src/Microsoft.Data.Analysis/DataFrame.Arrow.cs

+                            decimal128DataFrameColumn[i] = arrowDecimal128Array.GetValue(i);
+                        }
+                    }
+                    break;


This code allows to convert from Arrow to DecimalDataFrameColumn. Could you please also add backward conversion from DecimalDataFrameColumn to Arrow? It's done in PrimitiveDataFrameColumn.ToArrowArray(long startIndex, int numberOfRows) method

luisquintanilla · 2024-03-22T14:32:11Z

As according to Decimal256 implementation – I see 2 possibilities: using SqlDecimal from System.Data.SqlTypes namespace or waiting support of Decimal256 by .Net framework. There is a proposal for supporting new IEEE compliant decimal types - dotnet/runtime#81376

I don't think that proposal is something that's being actively worked on at this time, so probably not something that would happen in the .NET 9 timeframe.

ericstj · 2024-03-22T18:24:04Z

But at some point new columns were added to the DataFrame that didn’t follow this rule. For example StringDataFrameColumn, DateTimeDataFrameColumn and DecimalDataFrameColumn. These columns have different from Arrow format of underlying data.
So converting Arrow Decimal128 to .Net decimal may happen with information loss

That's worthwhile to call out in the docs. We should update the API docs here:

machinelearning/src/Microsoft.Data.Analysis/DataFrame.Arrow.cs

Line 151 in 19fb805

    
           /// Wraps a <see cref="DataFrame"/> around an Arrow <see cref="RecordBatch"/> without copying data

machinelearning/src/Microsoft.Data.Analysis/DataFrame.Arrow.cs

Line 171 in 19fb805

    
           /// Returns an <see cref="IEnumerable{RecordBatch}"/> mostly without copying data

FWIW it does look like Arrow has precedent for this conversion, though also offers SqlDecimal. Neither can be done with zero-copy, but the latter wouldn't lose precision (I think): https://github.com/apache/arrow/blob/2babda0ba22740c092166b5c5d5d7aa9b4797953/csharp/src/Apache.Arrow/Arrays/Decimal128Array.cs#L42-L63

Another option would be to expose some types unique to this assembly that's meant to represent this data without copying. Not saying we should - but that's a possibility. Probably those types would be better to live in the Arrow project if at all.

@tannergooding and @eerhardt might have more thoughts on this.

tannergooding · 2024-03-22T18:35:49Z

If there's need for the IEEE 754 Decimal128 type purely as an interchange type, that is itself feasible to add in .NET 9

It just isn't in scope to add the entirety of the approved API surface area, including operators.

CurtHagenlocher · 2024-03-23T13:45:01Z

FWIW it does look like Arrow has precedent for this conversion, though also offers SqlDecimal. Neither can be done with zero-copy, but the latter wouldn't lose precision (I think): https://github.com/apache/arrow/blob/2babda0ba22740c092166b5c5d5d7aa9b4797953/csharp/src/Apache.Arrow/Arrays/Decimal128Array.cs#L42-L63

Yes, SqlDecimal wouldn't lose precision for Arrow's decimal128, but that's at the cost of occupying 160 bits instead of 128 . The IEEE Decimal128 isn't wide enough to hold the 38 digits of decimal precision supported by Arrow, though it is (as noted) better than the CLR decimal. (I think there were some backwards-compatibility considerations in CLR decimal for OLEDB.)

There's an interesting tension between traditional database representations of decimal values and the requirements of general-purpose programming languages. In a database, you're always operating in the context of a particular column so you can factor out properties like precision and scale to optimize storage. In a programming language, individual values need to carry that context with them because most type systems aren't designed to support that kind of parameterization. So you either end up needing extra bytes (like SqlDecimal) or potentially losing precision (like IEEE Decimal128) when working with database decimals. And because most database decimal values don't actually need 38 digits of precision, the latter often ends up looking more attractive.

I like to think that if you squint a little, you can almost see the historical split between FORTRAN and COBOL (and their associated use cases) in this topic.

One nice thing about incorporating precision and scale into the type is that the values are automatically normalized. This is super important in a database representation because comparisons and simple computation (like sums) are going to be the majority of operations performed on these values.

Anyway, sorry for the digression. Representation of numbers in a computer is a fun topic.

Add support for Apache.Arrow.Types.Decimal128Type

b00af0e

dotnet-policy-service bot added the community-contribution label Mar 21, 2024

build-analysis bot mentioned this pull request Mar 21, 2024

AutoML tests writing test output outside of a test which causes test process to fail with no failing test cases #7067

Open

ericstj reviewed Mar 21, 2024

View reviewed changes

asmirnov82 reviewed Mar 21, 2024

View reviewed changes

asmirnov82 mentioned this pull request Apr 4, 2024

DataFrame doesn't decode boolean arrays correctly from Arrow #7115

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Apache.Arrow.Types.Decimal128Type #7094

Add support for Apache.Arrow.Types.Decimal128Type #7094

piyushdubey commented Mar 21, 2024

piyushdubey commented Mar 21, 2024 •

edited

Loading

codecov bot commented Mar 21, 2024

ericstj Mar 21, 2024

asmirnov82 commented Mar 21, 2024

asmirnov82 Mar 21, 2024

luisquintanilla commented Mar 22, 2024

ericstj commented Mar 22, 2024

tannergooding commented Mar 22, 2024

CurtHagenlocher commented Mar 23, 2024

Add support for Apache.Arrow.Types.Decimal128Type #7094

Are you sure you want to change the base?

Add support for Apache.Arrow.Types.Decimal128Type #7094

Conversation

piyushdubey commented Mar 21, 2024

piyushdubey commented Mar 21, 2024 • edited Loading

codecov bot commented Mar 21, 2024

Codecov Report

ericstj Mar 21, 2024

Choose a reason for hiding this comment

asmirnov82 commented Mar 21, 2024

asmirnov82 Mar 21, 2024

Choose a reason for hiding this comment

luisquintanilla commented Mar 22, 2024

ericstj commented Mar 22, 2024

tannergooding commented Mar 22, 2024

CurtHagenlocher commented Mar 23, 2024

piyushdubey commented Mar 21, 2024 •

edited

Loading