-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: Add readers and writers for the internal object model #11904
Conversation
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestInternalWriter.java
Outdated
Show resolved
Hide resolved
772f5c2
to
233a00b
Compare
|
||
@Override | ||
public UUID read(UUID reuse) { | ||
return UUIDUtil.convert(column.nextBinary().toByteBuffer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me.
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
return new ParquetValueReaders.UnboxedReader<>(desc); | ||
} | ||
|
||
private static class ParquetStructReader extends StructReader<StructLike, StructLike> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also, there's not much value in using Parquet
in the class name. Since this will produce GenericRecord
instances, how about RecordReader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When checking that name (RecordReader
) for consistency, I noticed that there's already a RecordReader
in GenericParquetReaders
. You can reuse that class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot reuse the class from GenericParquetReaders
as it is based on Record
interface, we need a class based on StructLike
interface.
I will rename to StructLikeReader
, just like the StructLikeWriter
from InternalWriter
class.
@Override | ||
protected ParquetValueReaders.PrimitiveReader<?> int96Reader(ColumnDescriptor desc) { | ||
// normal handling as int96 | ||
return new ParquetValueReaders.UnboxedReader<>(desc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. The unboxed reader will return a Binary
for int96 columns. Instead, this needs to use the same logic as the Spark reader (which also uses the internal representation):
private static class TimestampInt96Reader extends UnboxedReader<Long> {
TimestampInt96Reader(ColumnDescriptor desc) {
super(desc);
}
@Override
public Long read(Long ignored) {
return readLong();
}
@Override
public long readLong() {
final ByteBuffer byteBuffer =
column.nextBinary().toByteBuffer().order(ByteOrder.LITTLE_ENDIAN);
return ParquetUtil.extractTimestampInt96(byteBuffer);
}
}
You can move that class into the parquet
package to share it.
@@ -359,10 +250,10 @@ public ParquetValueReader<?> primitive( | |||
|
|||
ColumnDescriptor desc = type.getColumnDescription(currentPath()); | |||
|
|||
if (primitive.getOriginalType() != null) { | |||
if (primitive.getLogicalTypeAnnotation() != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this change, but please point these kinds of changes out for reviewers.
The old version worked because all of the supported logical type annotations had an equivalent ConvertedType
(which is what OriginalType
is called in Parquet format and the logical type docs).
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Show resolved
Hide resolved
@@ -76,6 +64,16 @@ protected ParquetValueReader<T> createReader( | |||
protected abstract ParquetValueReader<T> createStructReader( | |||
List<Type> types, List<ParquetValueReader<?>> fieldReaders, Types.StructType structType); | |||
|
|||
protected abstract LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<ParquetValueReader<?>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes sense to have the subclasses provide this visitor.
private static final OffsetDateTime EPOCH = Instant.ofEpochSecond(0).atOffset(ZoneOffset.UTC); | ||
private static final LocalDate EPOCH_DAY = EPOCH.toLocalDate(); | ||
|
||
private static class DateReader extends ParquetValueReaders.PrimitiveReader<LocalDate> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with moving the date/time reader classes here.
@Override | ||
public Optional<ParquetValueReader<?>> visit( | ||
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation timestampLogicalType) { | ||
return Optional.of(new ParquetValueReaders.UnboxedReader<>(desc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct. The unit of the incoming timestamp value still needs to be handled, even if the in-memory representation of the value is the same (a long
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the Spark implementations for this should work well, just like the int96 cases.
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java
Outdated
Show resolved
Hide resolved
import org.junit.jupiter.api.Test; | ||
import org.junit.jupiter.api.io.TempDir; | ||
|
||
public class TestInternalWriter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with the Avro tests, I think this should extend DataTest
. It is probably easier to do the Avro work first and then reuse it here.
.palantir/revapi.yml
Outdated
\ org.apache.iceberg.data.parquet.BaseParquetReaders<T>::logicalTypeReaderVisitor(org.apache.parquet.column.ColumnDescriptor,\ | ||
\ org.apache.iceberg.types.Type.PrimitiveType, org.apache.parquet.schema.PrimitiveType)" | ||
justification: "{Refactor Parquet reader and writer}" | ||
- code: "java.method.abstractMethodAdded" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR should not introduce revapi failures. Instead, the new methods should have default implementations that match the previous behavior (returning the generic representations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New methods are abstract and abstract method cannot have default implementation. So, I think we have to handle revapi failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think what you mean is don't add it as abstract. Add it as methods with default implementation. I got it. I will update it today.
@@ -850,4 +919,42 @@ private TripleIterator<?> firstNonNullColumn(List<TripleIterator<?>> columns) { | |||
return NullReader.NULL_COLUMN; | |||
} | |||
} | |||
|
|||
private static class RecordReader<T extends StructLike> extends StructReader<T, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns Record
so I don't think it needed to be modified. It doesn't return any other subclass of StructLike
.
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java
Outdated
Show resolved
Hide resolved
|
||
protected ParquetValueWriter<?> uuidWriter(ColumnDescriptor desc) { | ||
// Use primitive-type writer (as FIXED_LEN_BYTE_ARRAY); no special writer needed. | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I commented on this in the last round of reviews. This isn't correct. Incoming values are of type UUID
so this needs a writer that can convert UUID into a byte array. This should return ParquetValueWriters.uuids(desc)
.
There's also no need to add a method for this because it is the same between the generic and internal object models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the existing testcases was the reason for confusion as I mentioned in #11904 (comment)
I will update the existing testcases of Arrow too in this PR.
parquet/src/main/java/org/apache/iceberg/data/parquet/BaseParquetWriter.java
Outdated
Show resolved
Hide resolved
return new ParquetValueReaders.TimestampMillisReader(desc); | ||
} | ||
|
||
public static <T extends StructLike> StructReader<T, T> recordReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be ParquetValueReader<Record>
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalWriter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetReaders.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/data/parquet/GenericParquetWriter.java
Outdated
Show resolved
Hide resolved
@rdblue: Thanks for giving additional context for unresolved comments. I think I understood all the comments this time. PR is ready. It also fixed base code issues and testcases. |
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
|
||
@Override | ||
public long readLong() { | ||
return 1000L * column.nextInteger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is valid for time
but not for timestamp
. I may have mixed up the timestamp
reader and time
reader in an earlier comment. This needs to be nextLong
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion was from this comment: #11904 (comment)
I was talking about the time type, but the code I pasted had the wrong class name, TimestampMillisReader
should have been TimeMillisReader
. Timestamps (millis) should use nextLong
and time (millis) should use nextInteger
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in ajantha-bhat#74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of test coverage in the base code for these milliseconds time, timestamp and int96 timestamps is the reason for back and forth. Tests would have caught this. Will try to add in a follow up.
parquet/src/main/java/org/apache/iceberg/data/parquet/InternalReader.java
Outdated
Show resolved
Hide resolved
rebasing the PR as Flink hit a flaky test #11833 (comment) |
f3d9245
to
20f7c26
Compare
Thanks, @ajantha-bhat! Good to get this in. |
BaseParquetWriter
andBaseParquetReaders
to reuse for internal writers and readers.InternalWriter
andInternalReader
class for parquet that consumes and produces the Iceberg in-memory data model.