Chile #34

bhupatiraju · 2024-12-20T13:27:39Z

This contains the processing of the Chile BOOST data. In the raw data there are two sheets for the Municipal data (before 2017, and 2017 and after).

In the transform load stage, we process these separately and then merge them with the processed central sheet.

In the func categories 'Social protection' doesn't have any formulae to define it, so the tables are missing that category even though in the raw data these are manually coded as 0.0 for all years.

We omit the years 2007 and 2008 since there are some encoding issues with some line items (notable writing o as 0), and these also have some func categories missing (which start from 2009 onwards).

…pal sheets Note: Intersections between categories still exists.

…void clash with existing columns. Lower case to avoid missing variations of condition

…lumn. Changed type of year

…ng to align with subnational population.

…e discrepancies for the years 2007 and 2008, we omit them))

Chile/CHL_extract_microdata_excel_to_csv.py

Chile/CHL_transform_load_dlt.py

weilu · 2025-01-10T21:31:41Z

Chile/CHL_transform_load_dlt.py

+    "XV": "Arica y Parinacota",
+    "XVI": "Ñuble",
+}
+region_mapping_expr = create_map([lit(key) for pair in region_mapping.items() for key in pair])


nested list comprehension isn't the most readable. Try something like

region_mapping_expr = create_map( [lit(key), lit(val) for key, val in region_mapping.items()] )

This doesn't work since I need the alternating keys and values in a flattened list for create_map.
I modified it to be the following:

region_mapping_expr = create_map( [item for key, val in region_mapping.items() for item in (lit(key), lit(val))] )

Does this read better?

Chile/CHL_transform_load_dlt.py

weilu · 2025-01-10T21:44:33Z

Chile/CHL_transform_load_dlt.py

+    df2 = (spark.read
+           .format("csv")
+           .options(**CSV_READ_OPTIONS)
+           .option("inferSchema", "true")
+           .load(f'{COUNTRY_MICRODATA_DIR}/Mun2.csv')


We can have two bronze_municipal tables for loading from each Mun sheet, and then either two corresponding silver tables, or a single one that reads from both bronze, process then combine them. I feel like that would be cleaner and preserve data at each stage as defined by the medallion architecture.

I have updated the code to read mun1 and mun2 into their own bronze tables (with no transformations there). Then they are merged into chl_boost_bronze_mun.

In a similar manner, cen is read into chl_boost_bronze_cen with no modifications. Then, transformations are done to produce chl_boost_silver_cen.

Finally, chl_boost_silver_cen and chl_boost_silver_mun are merged to produce chl_boost_silver

elysenko · 2025-01-15T14:45:21Z

Chile/CHL_extract_microdata_excel_to_csv.py

+    # Normalize cells
+    df = df.applymap(normalize_cell)


The applymap function is deprecated, use the map function instead

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html

I think there are multiple places across the countries where this has been used so I'll clean this up along with those. For now, I am leaving it as is instead of looping over the columns and using map.

…e Municipal bronze and silver stages

bhupatiraju and others added 8 commits November 6, 2024 12:37

process input chile data to CSV

b9038d7

Added econ, func, admin, categories along with merging the two munici…

ff691bd

…pal sheets Note: Intersections between categories still exists.

Corrected nulls in econ columns. Added admin1_tmp and admin0_tmp to a…

49258a2

…void clash with existing columns. Lower case to avoid missing variations of condition

modified admin1 to admin1_tmp to not overwrite the original admin1 co…

dcace65

…lumn. Changed type of year

Added chl to pipeline

a3860cb

Merge branch 'main' into chile

f658a9c

Modifications to take into account eh discrepancies. Added name mappi…

2e3cfd1

…ng to align with subnational population.

Added a filter condition to extract years only after 2008 (due to som…

fdacf3b

…e discrepancies for the years 2007 and 2008, we omit them))

bhupatiraju requested a review from weilu December 20, 2024 13:27

weilu reviewed Jan 13, 2025

View reviewed changes

elysenko reviewed Jan 15, 2025

View reviewed changes

bhupatiraju added 2 commits January 16, 2025 21:36

Removed unnecessary comments

5babd77

Updated bronze to reflect raw data. Modifications to processing of th…

5b7770e

…e Municipal bronze and silver stages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chile #34

Chile #34

bhupatiraju commented Dec 20, 2024

weilu Jan 10, 2025

bhupatiraju Jan 16, 2025

weilu Jan 10, 2025

bhupatiraju Jan 16, 2025

elysenko Jan 15, 2025 •

edited

Loading

bhupatiraju Jan 17, 2025

Chile #34

Are you sure you want to change the base?

Chile #34

Conversation

bhupatiraju commented Dec 20, 2024

weilu Jan 10, 2025

Choose a reason for hiding this comment

bhupatiraju Jan 16, 2025

Choose a reason for hiding this comment

weilu Jan 10, 2025

Choose a reason for hiding this comment

bhupatiraju Jan 16, 2025

Choose a reason for hiding this comment

elysenko Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

bhupatiraju Jan 17, 2025

Choose a reason for hiding this comment

elysenko Jan 15, 2025 •

edited

Loading