-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chile #34
base: main
Are you sure you want to change the base?
Chile #34
Conversation
…pal sheets Note: Intersections between categories still exists.
…void clash with existing columns. Lower case to avoid missing variations of condition
…lumn. Changed type of year
…ng to align with subnational population.
…e discrepancies for the years 2007 and 2008, we omit them))
Chile/CHL_transform_load_dlt.py
Outdated
"XV": "Arica y Parinacota", | ||
"XVI": "Ñuble", | ||
} | ||
region_mapping_expr = create_map([lit(key) for pair in region_mapping.items() for key in pair]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested list comprehension isn't the most readable. Try something like
region_mapping_expr = create_map(
[lit(key), lit(val) for key, val in region_mapping.items()]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work since I need the alternating keys and values in a flattened list for create_map
.
I modified it to be the following:
region_mapping_expr = create_map(
[item for key, val in region_mapping.items() for item in (lit(key), lit(val))]
)
Does this read better?
Chile/CHL_transform_load_dlt.py
Outdated
df2 = (spark.read | ||
.format("csv") | ||
.options(**CSV_READ_OPTIONS) | ||
.option("inferSchema", "true") | ||
.load(f'{COUNTRY_MICRODATA_DIR}/Mun2.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have two bronze_municipal tables for loading from each Mun sheet, and then either two corresponding silver tables, or a single one that reads from both bronze, process then combine them. I feel like that would be cleaner and preserve data at each stage as defined by the medallion architecture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to read mun1 and mun2 into their own bronze tables (with no transformations there). Then they are merged into chl_boost_bronze_mun.
In a similar manner, cen is read into chl_boost_bronze_cen with no modifications. Then, transformations are done to produce chl_boost_silver_cen.
Finally, chl_boost_silver_cen and chl_boost_silver_mun are merged to produce chl_boost_silver
# Normalize cells | ||
df = df.applymap(normalize_cell) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The applymap
function is deprecated, use the map
function instead
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are multiple places across the countries where this has been used so I'll clean this up along with those. For now, I am leaving it as is instead of looping over the columns and using map.
…e Municipal bronze and silver stages
This contains the processing of the Chile BOOST data. In the raw data there are two sheets for the Municipal data (before 2017, and 2017 and after).
In the transform load stage, we process these separately and then merge them with the processed central sheet.
In the func categories 'Social protection' doesn't have any formulae to define it, so the tables are missing that category even though in the raw data these are manually coded as 0.0 for all years.
We omit the years 2007 and 2008 since there are some encoding issues with some line items (notable writing o as 0), and these also have some func categories missing (which start from 2009 onwards).