Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

master the converter should only read all data once #31

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

NiklasMolin
Copy link

Issue #, if available: N/A

Description of changes:
The current implementation reads all data twice as far as I can see.
The dynamicframe dropNull causes recomputeSchema to be triggered in
the toDF call.
Guess there is a thousand ways of achieving it. Just added something that solves the matter, in a quick way
cause I don't know if this project is still alive.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cols = data_frame.schema.fields
for col in cols:
if isinstance(col.dataType,NullType):
data_frame = data_frame.drop(col.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is dropping the whole column? DropNullFields is intended to convert missing values to null values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It is dropping the whole column. Just as DropNullFields as far as I know. The first row in the doc link you posted:
"Drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame dataset."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if you remember it but the rationale for dropping them is that the writer doesn't handle NoneType columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya know, I think I entirely misunderstood the purpose of the DropNullFields function. 🤦‍♂️

If that's the case, this makes total sense and I don't even know if we want to drop null columns...will have to think about that some more. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yepp, do it. But can add, that when you're outputting parquet, you need to remove NoneType columns since there is no such datatype and the writer will fail.
There is no impact while you're reading though. So if you have all the columns in the table schema but lack some of the coulmn in the parquet files they will be read as null.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants