Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Delta Kernel FFI support for read path and expose Apache.Arrow.Table and Microsoft.Data.Analysis.DataFrame read interface #89

Merged
merged 107 commits into from
Oct 12, 2024

Conversation

mdrakiburrahman
Copy link
Collaborator

@mdrakiburrahman mdrakiburrahman commented Oct 6, 2024

Why this change is needed

This PR adds Delta Kernel FFI based read support.

Closes issue: #82

How

Delta Kernel integration

pr

  • Adds delta-kernel-rs as a pinned submodule. Uses the same structure as Bridge to generate the FFI + Rust Binary

  • Interfaces the user facing entrypoints and sets up some simple model relationships

    • DeltaEngine as IEngine and DeltaTable as ITable
    • Sets up Kernel to override Bridge and fall back when implementation is missing
      • Models Bridge Runtime/Table as base class for Kernel, overrides with the subset of read methods Kernel exposes
  • Implements Kernel FFI InterOp - most of which is Pointer management

    Note that Kernel read operations are currently synchronous - we will make it asynchronous in a follow up PR: Make Kernel FFI calls async #91 - by wrapping it in a TaskCompletionSource

  • Adds an Arrow.Table and DataFrame method to expose the Kernel scanned RecordBatches** as a queryable

  • Adds write concurrency and read concurrency tests - to find any memory management problems and test resiliency etc.

Misc

  • Adds a bootstrap-dev-env.sh idempotent script to quickly get a dev env up and running that can run Unit Tests + Example project (e.g. using a throwaway WSL box)
  • Updates README and architecture diagram

Test

  • Add a unit test that tests read/write concurrency - 27 concurrent writers, 50 readers.

  • Stress tested the new read/write concurrency unit test across 2800+ loops on Windows + Linux overnight:

    Windows
    image

    Linux
    image

  • Ran cloud write example project (Azure Storage Account)

  • Tested Nuget Package pipeline and all targes with cross - sample run

    image

Using cargo to recursively generate FFI headers from the repo was near impossible for my Rust skill level 😄
README.md Outdated Show resolved Hide resolved
@mdrakiburrahman mdrakiburrahman marked this pull request as ready for review October 8, 2024 14:56
@mightyshazam
Copy link
Collaborator

The methods don't have to be synchronous. Technically, C is always synchronous. However, any method that takes a callback function can be wrapped with a TaskCompletionSource to make it asynchronous in C#. tokio only matters if the underlying rust method is async. If that were the case, then, like the Bridge code, it would require a runtime or create it upon the method call.

@mdrakiburrahman
Copy link
Collaborator Author

Makes sense @mightyshazam - I had the incorrect understanding that this works through and through due to the Tokio wrapper. But I'm guessing the idea is the Cancellation Token would be respected by that TaskCompletionSource and abandon whatever C FFI call was being made that's taking too long.

Also, do you prefer I get this TSC wrapper done before merging this PR, so we change the ITable interface to be completely async?

Or should we have the synchronous methods as the PR currently has, and expose additional async methods?

@mightyshazam
Copy link
Collaborator

Makes sense @mightyshazam - I had the incorrect understanding that this works through and through due to the Tokio wrapper. But I'm guessing the idea is the Cancellation Token would be respected by that TaskCompletionSource and abandon whatever C FFI call was being made that's taking too long.

Also, do you prefer I get this TSC wrapper done before merging this PR, so we change the ITable interface to be completely async?

Or should we have the synchronous methods as the PR currently has, and expose additional async methods?

We can change the methods to async later. As long as we don't increment the version of the package, it won't push a new one.

@mdrakiburrahman
Copy link
Collaborator Author

mdrakiburrahman commented Oct 8, 2024

@mightyshazam - yeap good point, sounds good, I added this issue to track this:

#91

Will get it done in a separate smaller PR

@mdrakiburrahman mdrakiburrahman linked an issue Oct 8, 2024 that may be closed by this pull request
@mdrakiburrahman mdrakiburrahman merged commit 6ecd720 into main Oct 12, 2024
5 checks passed
@mdrakiburrahman mdrakiburrahman deleted the dev/mdrrahman/kernel-ffi-submodule branch October 12, 2024 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add delta-kernel FFI support in read path
3 participants