Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] How dictionaries work - roundtrip Java-Python #327

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
1 change: 1 addition & 0 deletions java/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This cookbook is tested with Apache Arrow |version|.
data
avro
jdbc
python_java
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "c data interface" potentially be a better name? The section below uses pyarrow/jpype as a python example, but in the future we could also add other language interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danepitkin
Thank you for the quick review and suggestions. And I agree with your comment. Will make the changed.


Indices and tables
==================
Expand Down
201 changes: 201 additions & 0 deletions java/source/python_java.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
.. _arrow-python-java:

========================
PyArrow Java Integration
========================

The PyArrow library offers a powerful API for Python that can be integrated with Java applications.
This document provides a guide on how to enable seamless data exchange between Python and Java components using PyArrow.

.. contents::

Dictionary Data Roundtrip
=========================

This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is misleading. You cannot mutate data in the C Data Interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the wording.

This section demonstrates a data roundtrip where C Data interface is being used to provide
the seamless access to data across language boundaries.



Python Component:
-----------------
vibhatha marked this conversation as resolved.
Show resolved Hide resolved

The Python code uses jpype to start the JVM and make the Java class MapValuesConsumer available to Python.
Data is generated in PyArrow and exported through C Data to Java.
vibhatha marked this conversation as resolved.
Show resolved Hide resolved
vibhatha marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run your code through a formatter so that it's consistent. (You can use Sphinx directives that include code from files instead of having to inline the code here, to make it easier.)


import jpype
import jpype.imports
from jpype.types import *
vibhatha marked this conversation as resolved.
Show resolved Hide resolved
import pyarrow as pa
from pyarrow.cffi import ffi as arrow_c

# Init the JVM and make MapValuesConsumer class available to Python.
jpype.startJVM(classpath=[ "../target/*"])
java_c_package = jpype.JPackage("org").apache.arrow.c
MapValuesConsumer = JClass('MapValuesConsumer')
CDataDictionaryProvider = JClass('org.apache.arrow.c.CDataDictionaryProvider')

# Starting from Python and generating data

# Create a Python DictionaryArray

vibhatha marked this conversation as resolved.
Show resolved Hide resolved
dictionary = pa.dictionary(pa.int64(), pa.utf8())
array = pa.array(["A", "B", "C", "A", "D"], dictionary)
print("From Python")
print("Dictionary Created: ", array)
vibhatha marked this conversation as resolved.
Show resolved Hide resolved

# create the CDataDictionaryProvider instance which is
# required to create dictionary array precisely
c_provider = CDataDictionaryProvider()

consumer = MapValuesConsumer(c_provider)

# Export the Python array through C Data
c_array = arrow_c.new("struct ArrowArray*")
c_array_ptr = int(arrow_c.cast("uintptr_t", c_array))
array._export_to_c(c_array_ptr)

# Export the Schema of the Array through C Data
c_schema = arrow_c.new("struct ArrowSchema*")
c_schema_ptr = int(arrow_c.cast("uintptr_t", c_schema))
array.type._export_to_c(c_schema_ptr)

# Send Array and its Schema to the Java function
# that will update the dictionary
consumer.update(c_array_ptr, c_schema_ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misleading wording/naming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is about the comment, how about

# update values in Java

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is about both. See above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.


# Importing updated values from Java to Python

# Export the Python array through C Data
updated_c_array = arrow_c.new("struct ArrowArray*")
updated_c_array_ptr = int(arrow_c.cast("uintptr_t", updated_c_array))

# Export the Schema of the Array through C Data
updated_c_schema = arrow_c.new("struct ArrowSchema*")
updated_c_schema_ptr = int(arrow_c.cast("uintptr_t", updated_c_schema))

java_wrapped_array = java_c_package.ArrowArray.wrap(updated_c_array_ptr)
java_wrapped_schema = java_c_package.ArrowSchema.wrap(updated_c_schema_ptr)

java_c_package.Data.exportVector(
consumer.getAllocatorForJavaConsumer(),
consumer.getVector(),
c_provider,
java_wrapped_array,
java_wrapped_schema
)

print("From Java back to Python")
updated_array = pa.Array._import_from_c(updated_c_array_ptr, updated_c_schema_ptr)

# In Java and Python, the same memory is being accessed through the C Data interface.
# Since the array from Java and array created in Python should have same data.
assert updated_array.equals(array)
print("Updated Array: ", updated_array)
vibhatha marked this conversation as resolved.
Show resolved Hide resolved

del updated_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit del should be unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following warning when I remove that line (I added it for this reason, but I maybe missing something in Java end).

WARNING: Failed to release Java C Data resource: Failed to attach the current thread to a Java VM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case document why it is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I have one question since this API is pretty new to me.

So what happens here is we call Java from Python. So Python VM is up first, then from Python VM we up another JVM. Then we access the memory from Java and from that we create a Python object. So the Python object and Java object points to the same memory. Is this statement correct?

Then what could happen is, the Python shutsdown its VM and in the process it would try to shutdown JVM first. The exportVector function call to Java would call a function called release_exported. This is where we see that warning.

Further according to a comment in the release_exported in jni_wrapper.cc

// It is possible for the JVM to be shut down when this is called;
// guard against that.  Example: Python code using JPype may shut
// down the JVM before releasing the stream.

I believe this above warning could cause when attempting to delete global references?
Please correct me if I am wrong. And if there is a better and accurate explanation, would appreciate to learn a few things about it.


.. code-block:: shell

From Python
Dictionary Created:
-- dictionary:
[
"A",
"B",
"C",
"D"
]
-- indices:
[
0,
1,
2,
0,
3
]
Doing work in Java
From Java back to Python
Updated Array:
-- dictionary:
[
"A",
"B",
"C",
"D"
]
-- indices:
[
2,
1,
2,
0,
3
]

In the Python component, the following steps are executed to demonstrate the data roundtrip:

1. Create data in Python
2. Export data to Java
3. Import updated data from Java
4. Validate the data consistency
vibhatha marked this conversation as resolved.
Show resolved Hide resolved


Java Component:
---------------
vibhatha marked this conversation as resolved.
Show resolved Hide resolved

In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, misleading wording

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording. Thanks for catching this.


.. code-block:: java
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java component is tested by custom directives created with testcode and testoutput.

During internal works, Java code is executed by Jshell and output defined by system.out.print on the testcode is compared to the value defined on the test output, such as:

.. testcode::

    System.out.print("testme");

.. testoutput::

    testme


import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.Data;
import org.apache.arrow.c.CDataDictionaryProvider;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.BigIntVector;


public class MapValuesConsumer {
private final static BufferAllocator allocator = new RootAllocator();
private final CDataDictionaryProvider provider;
private FieldVector vector;

public MapValuesConsumer(CDataDictionaryProvider provider) {
this.provider = provider;
}

public static BufferAllocator getAllocatorForJavaConsumer() {
return allocator;
}

public FieldVector getVector() {
return this.vector;
}

public void update(long c_array_ptr, long c_schema_ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an option to also validate Python call as as part of Java testing using .. testcode:: and .. testoutput:: directives?

Something like this:

import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.CDataDictionaryProvider;
import org.apache.arrow.c.Data;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BigIntVector;
import org.apache.arrow.vector.FieldVector;


public class MapValuesConsumer {
    private final static BufferAllocator allocator = new RootAllocator();
    private final CDataDictionaryProvider provider;
    private FieldVector vector;

    public MapValuesConsumer(CDataDictionaryProvider provider) {
        this.provider = provider;
    }

    public MapValuesConsumer() {
        this.provider = null;
    }

    public static BufferAllocator getAllocatorForJavaConsumer() {
        return allocator;
    }

    public FieldVector getVector() {
        return this.vector;
    }

    public void update(long c_array_ptr, long c_schema_ptr) {
        ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
        ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
        this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider);
        this.doWorkInJava(vector);
    }

    public void update2(long c_array_ptr, long c_schema_ptr) {
        ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
        ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
        this.vector = Data.importVector(allocator, arrow_array, arrow_schema, null);
        this.doWorkInJava(vector);
    }

    private void doWorkInJava(FieldVector vector) {
        System.out.println("Doing work in Java");
        BigIntVector bigIntVector = (BigIntVector)vector;
        bigIntVector.setSafe(0, 2);
    }

    public static void main(String[] args) {
        simulateAsAJavaConsumers();
    }

    final static BigIntVector intVector =
            new BigIntVector("internal_test", allocator);

    public static BigIntVector getIntVectorForJavaConsumers() {
        intVector.allocateNew(3);
        intVector.set(0, 1);
        intVector.set(1, 7);
        intVector.set(2, 93);
        intVector.setValueCount(3);
        return intVector;
    }

    public static void simulateAsAJavaConsumers() {
        MapValuesConsumer mvc = new MapValuesConsumer();//FIXME! Use constructor with dictionary provider
        try (
            ArrowArray arrowArray = ArrowArray.allocateNew(allocator);
            ArrowSchema arrowSchema = ArrowSchema.allocateNew(allocator)
        ) {
            //FIXME! Add custo  logic to emulate a dictionary provider adding
            Data.exportVector(allocator, getIntVectorForJavaConsumers(), null, arrowArray, arrowSchema);
            mvc.update2(arrowArray.memoryAddress(), arrowSchema.memoryAddress());
            try (FieldVector valueVectors = Data.importVector(allocator, arrowArray, arrowSchema, null);) {
                System.out.print(valueVectors); //FIXME! Validate on .. testoutput::
            }
        }
        intVector.close(); //FIXME! Expose this method also to be called by end python program
        allocator.close();
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the insight, @davisusanibar. I will work on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davisusanibar this code doesn't seem to be working, but I get your idea.

ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider);
this.doWorkInJava(vector);
}

private void doWorkInJava(FieldVector vector) {
System.out.println("Doing work in Java");
BigIntVector bigIntVector = (BigIntVector)vector;
bigIntVector.setSafe(0, 2);
}
}

The Java component performs the following actions:

1. Receives data from the Python component.
2. Updates the data.
3. Exports the updated data back to Python.

By integrating PyArrow in Python and Java components, this example demonstrates that
a system can be created where data is shared and updated across both languages seamlessly.
vibhatha marked this conversation as resolved.
Show resolved Hide resolved