master the converter should only read all data once #31

NiklasMolin · 2022-10-12T08:43:07Z

Issue #, if available: N/A

Description of changes:
The current implementation reads all data twice as far as I can see.
The dynamicframe dropNull causes recomputeSchema to be triggered in
the toDF call.
Guess there is a thousand ways of achieving it. Just added something that solves the matter, in a quick way
cause I don't know if this project is still alive.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

dacort · 2022-10-20T18:36:46Z

athena_glue_service_logs/converter.py

+        cols = data_frame.schema.fields
+        for col in cols:
+            if isinstance(col.dataType,NullType):
+                data_frame = data_frame.drop(col.name)


It looks like this is dropping the whole column? DropNullFields is intended to convert missing values to null values.

Yes, It is dropping the whole column. Just as DropNullFields as far as I know. The first row in the doc link you posted:
"Drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame dataset."

Don't know if you remember it but the rationale for dropping them is that the writer doesn't handle NoneType columns

Ya know, I think I entirely misunderstood the purpose of the DropNullFields function. 🤦‍♂️

If that's the case, this makes total sense and I don't even know if we want to drop null columns...will have to think about that some more. :)

Yepp, do it. But can add, that when you're outputting parquet, you need to remove NoneType columns since there is no such datatype and the writer will fail.
There is no impact while you're reading though. So if you have all the columns in the table schema but lack some of the coulmn in the parquet files they will be read as null.

master the converter should only read all data once

3f6a43e

dacort reviewed Oct 20, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master the converter should only read all data once #31

master the converter should only read all data once #31

NiklasMolin commented Oct 12, 2022

dacort Oct 20, 2022

NiklasMolin Oct 21, 2022

NiklasMolin Oct 21, 2022

dacort Oct 21, 2022

NiklasMolin Oct 24, 2022

master the converter should only read all data once #31

Are you sure you want to change the base?

master the converter should only read all data once #31

Conversation

NiklasMolin commented Oct 12, 2022

dacort Oct 20, 2022

Choose a reason for hiding this comment

NiklasMolin Oct 21, 2022

Choose a reason for hiding this comment

NiklasMolin Oct 21, 2022

Choose a reason for hiding this comment

dacort Oct 21, 2022

Choose a reason for hiding this comment

NiklasMolin Oct 24, 2022

Choose a reason for hiding this comment