-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] How dictionaries work - roundtrip Java-Python #327
base: main
Are you sure you want to change the base?
Changes from 6 commits
3704302
71877c8
0b029e4
740a139
d2b0491
c35eefd
46bba74
8508059
9329cda
23617f0
59ac6bb
33e2cbf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,201 @@ | ||||||||||
.. _arrow-python-java: | ||||||||||
|
||||||||||
======================== | ||||||||||
PyArrow Java Integration | ||||||||||
======================== | ||||||||||
|
||||||||||
The PyArrow library offers a powerful API for Python that can be integrated with Java applications. | ||||||||||
This document provides a guide on how to enable seamless data exchange between Python and Java components using PyArrow. | ||||||||||
|
||||||||||
.. contents:: | ||||||||||
|
||||||||||
Dictionary Data Roundtrip | ||||||||||
========================= | ||||||||||
|
||||||||||
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java, | ||||||||||
and finally re-accessed and validated in Python for data consistency. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This description is misleading. You cannot mutate data in the C Data Interface. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed the wording. This section demonstrates a data roundtrip where C Data interface is being used to provide
the seamless access to data across language boundaries. |
||||||||||
|
||||||||||
|
||||||||||
Python Component: | ||||||||||
----------------- | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
The Python code uses jpype to start the JVM and make the Java class MapValuesConsumer available to Python. | ||||||||||
Data is generated in PyArrow and exported through C Data to Java. | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
.. code-block:: python | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Run your code through a formatter so that it's consistent. (You can use Sphinx directives that include code from files instead of having to inline the code here, to make it easier.) |
||||||||||
|
||||||||||
import jpype | ||||||||||
import jpype.imports | ||||||||||
from jpype.types import * | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
import pyarrow as pa | ||||||||||
from pyarrow.cffi import ffi as arrow_c | ||||||||||
|
||||||||||
# Init the JVM and make MapValuesConsumer class available to Python. | ||||||||||
jpype.startJVM(classpath=[ "../target/*"]) | ||||||||||
java_c_package = jpype.JPackage("org").apache.arrow.c | ||||||||||
MapValuesConsumer = JClass('MapValuesConsumer') | ||||||||||
CDataDictionaryProvider = JClass('org.apache.arrow.c.CDataDictionaryProvider') | ||||||||||
|
||||||||||
# Starting from Python and generating data | ||||||||||
|
||||||||||
# Create a Python DictionaryArray | ||||||||||
|
||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
dictionary = pa.dictionary(pa.int64(), pa.utf8()) | ||||||||||
array = pa.array(["A", "B", "C", "A", "D"], dictionary) | ||||||||||
print("From Python") | ||||||||||
print("Dictionary Created: ", array) | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
# create the CDataDictionaryProvider instance which is | ||||||||||
# required to create dictionary array precisely | ||||||||||
c_provider = CDataDictionaryProvider() | ||||||||||
|
||||||||||
consumer = MapValuesConsumer(c_provider) | ||||||||||
|
||||||||||
# Export the Python array through C Data | ||||||||||
c_array = arrow_c.new("struct ArrowArray*") | ||||||||||
c_array_ptr = int(arrow_c.cast("uintptr_t", c_array)) | ||||||||||
array._export_to_c(c_array_ptr) | ||||||||||
|
||||||||||
# Export the Schema of the Array through C Data | ||||||||||
c_schema = arrow_c.new("struct ArrowSchema*") | ||||||||||
c_schema_ptr = int(arrow_c.cast("uintptr_t", c_schema)) | ||||||||||
array.type._export_to_c(c_schema_ptr) | ||||||||||
|
||||||||||
# Send Array and its Schema to the Java function | ||||||||||
# that will update the dictionary | ||||||||||
consumer.update(c_array_ptr, c_schema_ptr) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. misleading wording/naming There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume this is about the comment, how about # update values in Java There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is about both. See above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. |
||||||||||
|
||||||||||
# Importing updated values from Java to Python | ||||||||||
|
||||||||||
# Export the Python array through C Data | ||||||||||
updated_c_array = arrow_c.new("struct ArrowArray*") | ||||||||||
updated_c_array_ptr = int(arrow_c.cast("uintptr_t", updated_c_array)) | ||||||||||
|
||||||||||
# Export the Schema of the Array through C Data | ||||||||||
updated_c_schema = arrow_c.new("struct ArrowSchema*") | ||||||||||
updated_c_schema_ptr = int(arrow_c.cast("uintptr_t", updated_c_schema)) | ||||||||||
|
||||||||||
java_wrapped_array = java_c_package.ArrowArray.wrap(updated_c_array_ptr) | ||||||||||
java_wrapped_schema = java_c_package.ArrowSchema.wrap(updated_c_schema_ptr) | ||||||||||
|
||||||||||
java_c_package.Data.exportVector( | ||||||||||
consumer.getAllocatorForJavaConsumer(), | ||||||||||
consumer.getVector(), | ||||||||||
c_provider, | ||||||||||
java_wrapped_array, | ||||||||||
java_wrapped_schema | ||||||||||
) | ||||||||||
|
||||||||||
print("From Java back to Python") | ||||||||||
updated_array = pa.Array._import_from_c(updated_c_array_ptr, updated_c_schema_ptr) | ||||||||||
|
||||||||||
# In Java and Python, the same memory is being accessed through the C Data interface. | ||||||||||
# Since the array from Java and array created in Python should have same data. | ||||||||||
assert updated_array.equals(array) | ||||||||||
print("Updated Array: ", updated_array) | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
del updated_array | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Explicit del should be unnecessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get the following warning when I remove that line (I added it for this reason, but I maybe missing something in Java end). WARNING: Failed to release Java C Data resource: Failed to attach the current thread to a Java VM There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case document why it is necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, I have one question since this API is pretty new to me. So what happens here is we call Java from Python. So Python VM is up first, then from Python VM we up another JVM. Then we access the memory from Java and from that we create a Python object. So the Python object and Java object points to the same memory. Is this statement correct? Then what could happen is, the Python shutsdown its VM and in the process it would try to shutdown JVM first. The Further according to a comment in the // It is possible for the JVM to be shut down when this is called;
// guard against that. Example: Python code using JPype may shut
// down the JVM before releasing the stream. I believe this above warning could cause when attempting to delete global references? |
||||||||||
|
||||||||||
.. code-block:: shell | ||||||||||
|
||||||||||
From Python | ||||||||||
Dictionary Created: | ||||||||||
-- dictionary: | ||||||||||
[ | ||||||||||
"A", | ||||||||||
"B", | ||||||||||
"C", | ||||||||||
"D" | ||||||||||
] | ||||||||||
-- indices: | ||||||||||
[ | ||||||||||
0, | ||||||||||
1, | ||||||||||
2, | ||||||||||
0, | ||||||||||
3 | ||||||||||
] | ||||||||||
Doing work in Java | ||||||||||
From Java back to Python | ||||||||||
Updated Array: | ||||||||||
-- dictionary: | ||||||||||
[ | ||||||||||
"A", | ||||||||||
"B", | ||||||||||
"C", | ||||||||||
"D" | ||||||||||
] | ||||||||||
-- indices: | ||||||||||
[ | ||||||||||
2, | ||||||||||
1, | ||||||||||
2, | ||||||||||
0, | ||||||||||
3 | ||||||||||
] | ||||||||||
|
||||||||||
In the Python component, the following steps are executed to demonstrate the data roundtrip: | ||||||||||
|
||||||||||
1. Create data in Python | ||||||||||
2. Export data to Java | ||||||||||
3. Import updated data from Java | ||||||||||
4. Validate the data consistency | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
|
||||||||||
Java Component: | ||||||||||
--------------- | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
||||||||||
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | ||||||||||
It then updates the data and sends it back to the Python component. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, misleading wording There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed the wording. Thanks for catching this. |
||||||||||
|
||||||||||
.. code-block:: java | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Java component is tested by custom directives created with During internal works, Java code is executed by Jshell and output defined by system.out.print on the testcode is compared to the value defined on the test output, such as:
|
||||||||||
|
||||||||||
import org.apache.arrow.c.ArrowArray; | ||||||||||
import org.apache.arrow.c.ArrowSchema; | ||||||||||
import org.apache.arrow.c.Data; | ||||||||||
import org.apache.arrow.c.CDataDictionaryProvider; | ||||||||||
import org.apache.arrow.memory.BufferAllocator; | ||||||||||
import org.apache.arrow.memory.RootAllocator; | ||||||||||
import org.apache.arrow.vector.FieldVector; | ||||||||||
import org.apache.arrow.vector.BigIntVector; | ||||||||||
|
||||||||||
|
||||||||||
public class MapValuesConsumer { | ||||||||||
private final static BufferAllocator allocator = new RootAllocator(); | ||||||||||
private final CDataDictionaryProvider provider; | ||||||||||
private FieldVector vector; | ||||||||||
|
||||||||||
public MapValuesConsumer(CDataDictionaryProvider provider) { | ||||||||||
this.provider = provider; | ||||||||||
} | ||||||||||
|
||||||||||
public static BufferAllocator getAllocatorForJavaConsumer() { | ||||||||||
return allocator; | ||||||||||
} | ||||||||||
|
||||||||||
public FieldVector getVector() { | ||||||||||
return this.vector; | ||||||||||
} | ||||||||||
|
||||||||||
public void update(long c_array_ptr, long c_schema_ptr) { | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could be an option to also validate Python call as as part of Java testing using Something like this:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the insight, @davisusanibar. I will work on this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @davisusanibar this code doesn't seem to be working, but I get your idea. |
||||||||||
ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr); | ||||||||||
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr); | ||||||||||
this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider); | ||||||||||
this.doWorkInJava(vector); | ||||||||||
} | ||||||||||
|
||||||||||
private void doWorkInJava(FieldVector vector) { | ||||||||||
System.out.println("Doing work in Java"); | ||||||||||
BigIntVector bigIntVector = (BigIntVector)vector; | ||||||||||
bigIntVector.setSafe(0, 2); | ||||||||||
} | ||||||||||
} | ||||||||||
|
||||||||||
The Java component performs the following actions: | ||||||||||
|
||||||||||
1. Receives data from the Python component. | ||||||||||
2. Updates the data. | ||||||||||
3. Exports the updated data back to Python. | ||||||||||
|
||||||||||
By integrating PyArrow in Python and Java components, this example demonstrates that | ||||||||||
a system can be created where data is shared and updated across both languages seamlessly. | ||||||||||
vibhatha marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would "c data interface" potentially be a better name? The section below uses pyarrow/jpype as a python example, but in the future we could also add other language interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danepitkin
Thank you for the quick review and suggestions. And I agree with your comment. Will make the changed.