Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Add readers and writers for the internal object model #11904

Merged
merged 9 commits into from
Jan 24, 2025

Conversation

ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Jan 3, 2025

  • Refactor BaseParquetWriter and BaseParquetReaders to reuse for internal writers and readers.
  • Added InternalWriter and InternalReader class for parquet that consumes and produces the Iceberg in-memory data model.
  • Fixed some bugs in genric readers like UUID, timestamp millis, fixed length validatation etc.

@ajantha-bhat ajantha-bhat marked this pull request as draft January 3, 2025 16:27
@ajantha-bhat ajantha-bhat requested a review from rdblue January 3, 2025 16:30
@ajantha-bhat ajantha-bhat reopened this Jan 4, 2025
@ajantha-bhat ajantha-bhat force-pushed the parquet_internal_writer branch from 772f5c2 to 233a00b Compare January 6, 2025 11:33
.palantir/revapi.yml Outdated Show resolved Hide resolved
@rdblue rdblue changed the title Parquet: Internal writer and reader Parquet: Add readers and writers for the internal object model Jan 7, 2025

@Override
public UUID read(UUID reuse) {
return UUIDUtil.convert(column.nextBinary().toByteBuffer());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me.

return new ParquetValueReaders.UnboxedReader<>(desc);
}

private static class ParquetStructReader extends StructReader<StructLike, StructLike> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also, there's not much value in using Parquet in the class name. Since this will produce GenericRecord instances, how about RecordReader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When checking that name (RecordReader) for consistency, I noticed that there's already a RecordReader in GenericParquetReaders. You can reuse that class.

Copy link
Member Author

@ajantha-bhat ajantha-bhat Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot reuse the class from GenericParquetReaders as it is based on Record interface, we need a class based on StructLike interface.

I will rename to StructLikeReader, just like the StructLikeWriter from InternalWriter class.

@Override
protected ParquetValueReaders.PrimitiveReader<?> int96Reader(ColumnDescriptor desc) {
// normal handling as int96
return new ParquetValueReaders.UnboxedReader<>(desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. The unboxed reader will return a Binary for int96 columns. Instead, this needs to use the same logic as the Spark reader (which also uses the internal representation):

  private static class TimestampInt96Reader extends UnboxedReader<Long> {
    TimestampInt96Reader(ColumnDescriptor desc) {
      super(desc);
    }

    @Override
    public Long read(Long ignored) {
      return readLong();
    }
    @Override
    public long readLong() {
      final ByteBuffer byteBuffer =
          column.nextBinary().toByteBuffer().order(ByteOrder.LITTLE_ENDIAN);
      return ParquetUtil.extractTimestampInt96(byteBuffer);
    }
  }

You can move that class into the parquet package to share it.

@@ -359,10 +250,10 @@ public ParquetValueReader<?> primitive(

ColumnDescriptor desc = type.getColumnDescription(currentPath());

if (primitive.getOriginalType() != null) {
if (primitive.getLogicalTypeAnnotation() != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this change, but please point these kinds of changes out for reviewers.

The old version worked because all of the supported logical type annotations had an equivalent ConvertedType (which is what OriginalType is called in Parquet format and the logical type docs).

@@ -76,6 +64,16 @@ protected ParquetValueReader<T> createReader(
protected abstract ParquetValueReader<T> createStructReader(
List<Type> types, List<ParquetValueReader<?>> fieldReaders, Types.StructType structType);

protected abstract LogicalTypeAnnotation.LogicalTypeAnnotationVisitor<ParquetValueReader<?>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to have the subclasses provide this visitor.

private static final OffsetDateTime EPOCH = Instant.ofEpochSecond(0).atOffset(ZoneOffset.UTC);
private static final LocalDate EPOCH_DAY = EPOCH.toLocalDate();

private static class DateReader extends ParquetValueReaders.PrimitiveReader<LocalDate> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with moving the date/time reader classes here.

@Override
public Optional<ParquetValueReader<?>> visit(
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation timestampLogicalType) {
return Optional.of(new ParquetValueReaders.UnboxedReader<>(desc));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't correct. The unit of the incoming timestamp value still needs to be handled, even if the in-memory representation of the value is the same (a long).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the Spark implementations for this should work well, just like the int96 cases.

import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

public class TestInternalWriter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the Avro tests, I think this should extend DataTest. It is probably easier to do the Avro work first and then reuse it here.

\ org.apache.iceberg.data.parquet.BaseParquetReaders<T>::logicalTypeReaderVisitor(org.apache.parquet.column.ColumnDescriptor,\
\ org.apache.iceberg.types.Type.PrimitiveType, org.apache.parquet.schema.PrimitiveType)"
justification: "{Refactor Parquet reader and writer}"
- code: "java.method.abstractMethodAdded"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should not introduce revapi failures. Instead, the new methods should have default implementations that match the previous behavior (returning the generic representations).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New methods are abstract and abstract method cannot have default implementation. So, I think we have to handle revapi failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think what you mean is don't add it as abstract. Add it as methods with default implementation. I got it. I will update it today.

@@ -850,4 +919,42 @@ private TripleIterator<?> firstNonNullColumn(List<TripleIterator<?>> columns) {
return NullReader.NULL_COLUMN;
}
}

private static class RecordReader<T extends StructLike> extends StructReader<T, T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns Record so I don't think it needed to be modified. It doesn't return any other subclass of StructLike.


protected ParquetValueWriter<?> uuidWriter(ColumnDescriptor desc) {
// Use primitive-type writer (as FIXED_LEN_BYTE_ARRAY); no special writer needed.
return null;
Copy link
Contributor

@rdblue rdblue Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented on this in the last round of reviews. This isn't correct. Incoming values are of type UUID so this needs a writer that can convert UUID into a byte array. This should return ParquetValueWriters.uuids(desc).

There's also no need to add a method for this because it is the same between the generic and internal object models.

Copy link
Member Author

@ajantha-bhat ajantha-bhat Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the existing testcases was the reason for confusion as I mentioned in #11904 (comment)

I will update the existing testcases of Arrow too in this PR.

return new ParquetValueReaders.TimestampMillisReader(desc);
}

public static <T extends StructLike> StructReader<T, T> recordReader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ParquetValueReader<Record>

@ajantha-bhat
Copy link
Member Author

@rdblue: Thanks for giving additional context for unresolved comments. I think I understood all the comments this time. PR is ready. It also fixed base code issues and testcases.


@Override
public long readLong() {
return 1000L * column.nextInteger();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid for time but not for timestamp. I may have mixed up the timestamp reader and time reader in an earlier comment. This needs to be nextLong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the confusion was from this comment: #11904 (comment)

I was talking about the time type, but the code I pasted had the wrong class name, TimestampMillisReader should have been TimeMillisReader. Timestamps (millis) should use nextLong and time (millis) should use nextInteger.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in ajantha-bhat#74

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lack of test coverage in the base code for these milliseconds time, timestamp and int96 timestamps is the reason for back and forth. Tests would have caught this. Will try to add in a follow up.

@ajantha-bhat
Copy link
Member Author

rebasing the PR as Flink hit a flaky test #11833 (comment)

@ajantha-bhat ajantha-bhat force-pushed the parquet_internal_writer branch from f3d9245 to 20f7c26 Compare January 24, 2025 16:42
@rdblue rdblue merged commit 67c52b5 into apache:main Jan 24, 2025
47 checks passed
@rdblue
Copy link
Contributor

rdblue commented Jan 24, 2025

Thanks, @ajantha-bhat! Good to get this in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants