Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] How dictionaries work - roundtrip Java-Python #327

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

vibhatha
Copy link
Contributor

@vibhatha vibhatha commented Sep 5, 2023

This PR closes #213

@davisusanibar
Copy link
Contributor

Hi @vibhatha, just reviewing error logs and error mention that arrow-memory-netty is not available on the nightly packages, and that is true, as you can see at https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-memory-netty/, which is very rare, but it needs to be reviewed.

Alternatively, you could start using the 14.0.0-SNAPSHOT version which can be configured in the arrow-cookbook/java/source/conf.py file as follows:

if arrow_nightly and arrow_nightly != '0':
    version = "14.0.0-SNAPSHOT"
else:
    version = "13.0.0"

@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 6, 2023

Hi @vibhatha, just reviewing error logs and error mention that arrow-memory-netty is not available on the nightly packages, and that is true, as you can see at https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-memory-netty/, which is very rare, but it needs to be reviewed.

Alternatively, you could start using the 14.0.0-SNAPSHOT version which can be configured in the arrow-cookbook/java/source/conf.py file as follows:

if arrow_nightly and arrow_nightly != '0':
    version = "14.0.0-SNAPSHOT"
else:
    version = "13.0.0"

If I understand correctly, should I make that change in this PR itself? Or should we create a separate PR for that?

@davisusanibar
Copy link
Contributor

davisusanibar commented Sep 6, 2023

Hi @vibhatha, just reviewing error logs and error mention that arrow-memory-netty is not available on the nightly packages, and that is true, as you can see at https://nightlies.apache.org/arrow/java/org/apache/arrow/arrow-memory-netty/, which is very rare, but it needs to be reviewed.
Alternatively, you could start using the 14.0.0-SNAPSHOT version which can be configured in the arrow-cookbook/java/source/conf.py file as follows:

if arrow_nightly and arrow_nightly != '0':
    version = "14.0.0-SNAPSHOT"
else:
    version = "13.0.0"

If I understand correctly, should I make that change in this PR itself? Or should we create a separate PR for that?

This change is being added to #320 as well, but it is true, this must have been created independently.

@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 6, 2023

@davisusanibar sounds good and I will rebase once this PR is merged. Thank you 👍

@vibhatha vibhatha force-pushed the feat-dictionary-python-java-rndtrp branch from 7befe38 to c35eefd Compare September 11, 2023 00:33
@vibhatha vibhatha marked this pull request as ready for review September 11, 2023 17:12
@vibhatha
Copy link
Contributor Author

@danepitkin @davisusanibar This PR is ready for review, please take a look.

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vibhatha ! This will be a great addition to the cook book. I'm not super familiar with the C Data interface / jpype so reviewing this was a good learning opportunity for me.

java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
@@ -43,6 +43,7 @@ This cookbook is tested with Apache Arrow |version|.
data
avro
jdbc
python_java
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "c data interface" potentially be a better name? The section below uses pyarrow/jpype as a python example, but in the future we could also add other language interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danepitkin
Thank you for the quick review and suggestions. And I agree with your comment. Will make the changed.

In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.

.. code-block:: java
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java component is tested by custom directives created with testcode and testoutput.

During internal works, Java code is executed by Jshell and output defined by system.out.print on the testcode is compared to the value defined on the test output, such as:

.. testcode::

    System.out.print("testme");

.. testoutput::

    testme

return this.vector;
}

public void update(long c_array_ptr, long c_schema_ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an option to also validate Python call as as part of Java testing using .. testcode:: and .. testoutput:: directives?

Something like this:

import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.CDataDictionaryProvider;
import org.apache.arrow.c.Data;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BigIntVector;
import org.apache.arrow.vector.FieldVector;


public class MapValuesConsumer {
    private final static BufferAllocator allocator = new RootAllocator();
    private final CDataDictionaryProvider provider;
    private FieldVector vector;

    public MapValuesConsumer(CDataDictionaryProvider provider) {
        this.provider = provider;
    }

    public MapValuesConsumer() {
        this.provider = null;
    }

    public static BufferAllocator getAllocatorForJavaConsumer() {
        return allocator;
    }

    public FieldVector getVector() {
        return this.vector;
    }

    public void update(long c_array_ptr, long c_schema_ptr) {
        ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
        ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
        this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider);
        this.doWorkInJava(vector);
    }

    public void update2(long c_array_ptr, long c_schema_ptr) {
        ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
        ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
        this.vector = Data.importVector(allocator, arrow_array, arrow_schema, null);
        this.doWorkInJava(vector);
    }

    private void doWorkInJava(FieldVector vector) {
        System.out.println("Doing work in Java");
        BigIntVector bigIntVector = (BigIntVector)vector;
        bigIntVector.setSafe(0, 2);
    }

    public static void main(String[] args) {
        simulateAsAJavaConsumers();
    }

    final static BigIntVector intVector =
            new BigIntVector("internal_test", allocator);

    public static BigIntVector getIntVectorForJavaConsumers() {
        intVector.allocateNew(3);
        intVector.set(0, 1);
        intVector.set(1, 7);
        intVector.set(2, 93);
        intVector.setValueCount(3);
        return intVector;
    }

    public static void simulateAsAJavaConsumers() {
        MapValuesConsumer mvc = new MapValuesConsumer();//FIXME! Use constructor with dictionary provider
        try (
            ArrowArray arrowArray = ArrowArray.allocateNew(allocator);
            ArrowSchema arrowSchema = ArrowSchema.allocateNew(allocator)
        ) {
            //FIXME! Add custo  logic to emulate a dictionary provider adding
            Data.exportVector(allocator, getIntVectorForJavaConsumers(), null, arrowArray, arrowSchema);
            mvc.update2(arrowArray.memoryAddress(), arrowSchema.memoryAddress());
            try (FieldVector valueVectors = Data.importVector(allocator, arrowArray, arrowSchema, null);) {
                System.out.print(valueVectors); //FIXME! Validate on .. testoutput::
            }
        }
        intVector.close(); //FIXME! Expose this method also to be called by end python program
        allocator.close();
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the insight, @davisusanibar. I will work on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davisusanibar this code doesn't seem to be working, but I get your idea.

@vibhatha vibhatha force-pushed the feat-dictionary-python-java-rndtrp branch from 68d140b to 46bba74 Compare September 13, 2023 04:40
@vibhatha
Copy link
Contributor Author

@danepitkin I am addressing the reviews and thank you for reviewing this PR.

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @vibhatha ! A few small comments, but overall LGTM otherwise.

java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
@vibhatha
Copy link
Contributor Author

@danepitkin I am working on updating this PR for Java doctests. I will address the reviews as well.

java/source/c_data.rst Outdated Show resolved Hide resolved
java/source/c_data.rst Outdated Show resolved Hide resolved
Comment on lines 15 to 16
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is misleading. You cannot mutate data in the C Data Interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the wording.

This section demonstrates a data roundtrip where C Data interface is being used to provide
the seamless access to data across language boundaries.

java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
java/source/python_java.rst Outdated Show resolved Hide resolved
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.

.. testcode::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format the code here as well.

Comment on lines 148 to 149
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, misleading wording

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording. Thanks for catching this.

java/source/python_java.rst Outdated Show resolved Hide resolved
@vibhatha
Copy link
Contributor Author

@lidavidm thanks a lot for your feedback. I will address these.

@lidavidm
Copy link
Member

Given we have no roundtrip examples at all, I don't see why we are starting with dictionaries vs a simpler type

@davisusanibar
Copy link
Contributor

Given we have no roundtrip examples at all, I don't see why we are starting with dictionaries vs a simpler type

Could it be consider as a simpler type #325?

@vibhatha
Copy link
Contributor Author

Given we have no roundtrip examples at all, I don't see why we are starting with dictionaries vs a simpler type

Good point. Although, I am merely picking up an existing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] How dictionaries work - roundtrip Java-Python
4 participants