Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

master the converter should only read all data once #31

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions athena_glue_service_logs/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def __init__(self, glue_context, data_catalog_ref, optimized_catalog_ref):

def run(self):
"""Extract data from the data catalog and convert it to parquet, partitioning it along the way"""
from awsglue.transforms import DropNullFields
from pyspark.sql.types import NullType

# Retrieve the source data from the Glue catalog
source_data = self.glue_context.create_dynamic_frame.from_catalog(
Expand All @@ -45,11 +45,14 @@ def run(self):
# Perform any data-source-specific conversions
optimized_transforms = self.optimized_catalog.conversion_actions(source_data)

# Remove nulls and convert to dataframe - dataframe is only needed for replacing the date partitions.
# convert to dataframe - dataframe is only needed for replacing the date partitions.
# It was previously used to repartition, but Glue supports that now.
drop_nulls = DropNullFields.apply(frame=optimized_transforms, transformation_ctx="drop_nulls")
data_frame = drop_nulls.toDF()

data_frame = optimized_transforms.toDF()
# Remove nulls and convert to dataframe
cols = data_frame.schema.fields
for col in cols:
if isinstance(col.dataType,NullType):
data_frame = data_frame.drop(col.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is dropping the whole column? DropNullFields is intended to convert missing values to null values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It is dropping the whole column. Just as DropNullFields as far as I know. The first row in the doc link you posted:
"Drops all null fields in a DynamicFrame whose type is NullType. These are fields with missing or null values in every record in the DynamicFrame dataset."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if you remember it but the rationale for dropping them is that the writer doesn't handle NoneType columns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya know, I think I entirely misunderstood the purpose of the DropNullFields function. 🤦‍♂️

If that's the case, this makes total sense and I don't even know if we want to drop null columns...will have to think about that some more. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yepp, do it. But can add, that when you're outputting parquet, you need to remove NoneType columns since there is no such datatype and the writer will fail.
There is no impact while you're reading though. So if you have all the columns in the table schema but lack some of the coulmn in the parquet files they will be read as null.

# We might have no data - if that's the case, short-circuit
if not data_frame.head(1):
LOGGER.info("No data returned, skipping conversion.")
Expand Down