Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow-csv #646

Closed
wants to merge 1 commit into from
Closed

Conversation

gadomski
Copy link
Contributor

To get more familiar with things around here, I'm taking a stab at #613. Opening up this draft PR to ask two questions:

  • Is Use arrow-csv for CSV operations #613 still useful? Don't want to work something that's OBE
  • If Use arrow-csv for CSV operations #613 is still useful, I'd appreciate some guidance about the correct use of MixedGeometryArray::into_arrow. Specifically, into_arrow produces a "minimal" union (only the types actually present in the mixed array) while ChunkedMixedGeometryArray::data_type produces a superset union of all possible types. I've provided some test output below for more info.

Related issues

More information

The code in the PR works, but feels wrong to me. The "correct" code (to my naive eyes) produces schema mismatch.

Code and test output
table.set_column(
    geometry_column_index,
    geometries
        .data_type()
        .to_field(geometry_field.name(), geometry_field.is_nullable())
        .into(),
    geometries.array_refs(),
)?;

Test output:

test io::csv::reader::tests::read ... FAILED

failures:

---- io::csv::reader::tests::read stdout ----
thread 'io::csv::reader::tests::read' panicked at src/io/csv/reader.rs:127:72:
called `Result::unwrap()` on an `Err` value: Arrow(InvalidArgumentError("column types must match schema types, expected Union([(1, Field { name: \"\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.point\"} }), (2, Field { name: \"\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.linestring\"} }), (3, Field { name: \"\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.polygon\"} }), (4, Field { name: \"\", data_type: List(Field { name: \"points\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipoint\"} }), (5, Field { name: \"\", data_type: List(Field { name: \"linestrings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multilinestring\"} }), (6, Field { name: \"\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) but found Union([(1, Field { name: \"geometry\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:metadata\": \"{\\\"crs\\\":null,\\\"edges\\\":null}\", \"ARROW:extension:name\": \"geoarrow.point\"} })], Dense) at column index 3"))

@kylebarron
Copy link
Member

Yes! I think in general it's better to use arrow tools where possible, and add the spatial part on top, rather than relying on a row-based geozero and having to do the conversion to Arrow ourselves.

@kylebarron
Copy link
Member

kylebarron commented May 30, 2024

Specifically, into_arrow produces a "minimal" union (only the types actually present in the mixed array) while ChunkedMixedGeometryArray::data_type produces a superset union of all possible types. I've provided some test output below for more info.

Yeah... we have some known issues around the union types. E.g. I got stuck in #498 around something quite similar.

I should check with @paleolimbot and @jorisvandenbossche how we want to handle union types canonically. Should we expect that GeoArrow Geometry arrays have all fields or just the minimal fields? I added a comment to geoarrow/geoarrow#43 (comment) about this.

@gadomski
Copy link
Contributor Author

@kylebarron any movement on the union type issue? I ask because I'm running into this issue w/ parquet writing ATM. It looks like you have some skeleton code here:

// let mut geom_types = HashSet::new();
// arr.as_mixed_2d().chunks().iter().for_each(|chunk| {
// if chunk.has_points() {
// geom_types.insert("Point".to_string());
// }
// if chunk.has_line_string_2ds() {
// geom_types.insert("LineString".to_string());
// }
// if chunk.has_polygon_2ds() {
// geom_types.insert("Polygon".to_string());
// }
// if chunk.has_multi_point_2ds() {
// geom_types.insert("MultiPoint".to_string());
// }
// if chunk.has_multi_line_string_2ds() {
// geom_types.insert("MultiLineString".to_string());
// }
// if chunk.has_multi_polygon_2ds() {
// geom_types.insert("MultiPolygon".to_string());
// }
// });
// geom_types.into_iter().collect()
}

Do you reckon that's valid, at least as a guess on where things will settle?

@kylebarron
Copy link
Member

It looks like you have some skeleton code here:

I don't think that itself is a relevant part of the code. You're probably hitting union type issues somewhere else. Do you have a specific traceback?

@gadomski
Copy link
Contributor Author

gadomski commented Aug 23, 2024

Do you have a specific traceback?

Yeah:

called `Result::unwrap()` on an `Err` value: GeoArrow(Arrow(CastError("Invalid argument error: column types must match schema types, expected Union([(1, Field { name: \"\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.point\"} }), (2, Field { name: \"\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.linestring\"} }), (3, Field { name: \"\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.polygon\"} }), (4, Field { name: \"\", data_type: List(Field { name: \"points\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipoint\"} }), (5, Field { name: \"\", data_type: List(Field { name: \"linestrings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multilinestring\"} }), (6, Field { name: \"\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) but found Union([(6, Field { name: \"geometry\", data_type: List(Field { name: \"polygons\", data_type: List(Field { name: \"rings\", data_type: List(Field { name: \"vertices\", data_type: FixedSizeList(Field { name: \"xy\", data_type: Float64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 2), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {\"ARROW:extension:name\": \"geoarrow.multipolygon\"} })], Dense) at column index 8")))

This is happening on a simple roundtrip test in stac-geoparquet (write a mixed geometry table to some bytes, then read it back).

@kylebarron
Copy link
Member

But can you remind me what line of code that's coming from? It's not in the writer code you linked to

@kylebarron kylebarron mentioned this pull request Aug 23, 2024
5 tasks
@kylebarron
Copy link
Member

#714 that'll be a big PR to work on next week.

@gadomski
Copy link
Contributor Author

Sweet ... guessing you've got it, but if need be a minimum reproducible example is in a draft PR here: #717.

kylebarron added a commit that referenced this pull request Aug 26, 2024
### Change list

- Fix DataType creation to match the spec, with hard-coded type ids.
- Don't include geoarrow metadata on underlying arrays when exporting to
arrow-rs. Only include geoarrow metadata on top-level
`geoarrow.geometry` array
- We no longer need a `map` attribute in the struct because the ordering
of the fields is guaranteed by the spec now.
- Don't store underlying arrays under an `Option`

Closes #717, closes
#714

Unblocks #646

Progress towards #679
@kylebarron
Copy link
Member

@gadomski can you try again from latest main?

@gadomski
Copy link
Contributor Author

@kylebarron looks like it's fixed for me over at stac-geoparquet! I'll see if I can update this PR as well 🙇🏼

@kylebarron kylebarron mentioned this pull request Oct 10, 2024
2 tasks
@gadomski
Copy link
Contributor Author

Closing in favor of #826

@gadomski gadomski closed this Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use arrow-csv for CSV operations
2 participants