Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chile #34

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Chile #34

wants to merge 10 commits into from

Conversation

bhupatiraju
Copy link
Contributor

This contains the processing of the Chile BOOST data. In the raw data there are two sheets for the Municipal data (before 2017, and 2017 and after).

In the transform load stage, we process these separately and then merge them with the processed central sheet.

In the func categories 'Social protection' doesn't have any formulae to define it, so the tables are missing that category even though in the raw data these are manually coded as 0.0 for all years.

We omit the years 2007 and 2008 since there are some encoding issues with some line items (notable writing o as 0), and these also have some func categories missing (which start from 2009 onwards).

bhupatiraju and others added 8 commits November 6, 2024 12:37
…pal sheets

Note: Intersections between categories still exists.
…void clash with existing columns. Lower case to avoid missing variations of condition
…e discrepancies for the years 2007 and 2008, we omit them))
@bhupatiraju bhupatiraju requested a review from weilu December 20, 2024 13:27
Chile/CHL_extract_microdata_excel_to_csv.py Outdated Show resolved Hide resolved
Chile/CHL_transform_load_dlt.py Outdated Show resolved Hide resolved
"XV": "Arica y Parinacota",
"XVI": "Ñuble",
}
region_mapping_expr = create_map([lit(key) for pair in region_mapping.items() for key in pair])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested list comprehension isn't the most readable. Try something like

region_mapping_expr = create_map(
    [lit(key), lit(val) for key, val in region_mapping.items()]
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work since I need the alternating keys and values in a flattened list for create_map.
I modified it to be the following:

region_mapping_expr = create_map(
    [item for key, val in region_mapping.items() for item in (lit(key), lit(val))]
)

Does this read better?

Chile/CHL_transform_load_dlt.py Outdated Show resolved Hide resolved
Chile/CHL_transform_load_dlt.py Outdated Show resolved Hide resolved
Chile/CHL_transform_load_dlt.py Outdated Show resolved Hide resolved
Chile/CHL_transform_load_dlt.py Outdated Show resolved Hide resolved
Chile/CHL_transform_load_dlt.py Outdated Show resolved Hide resolved
Comment on lines 98 to 102
df2 = (spark.read
.format("csv")
.options(**CSV_READ_OPTIONS)
.option("inferSchema", "true")
.load(f'{COUNTRY_MICRODATA_DIR}/Mun2.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have two bronze_municipal tables for loading from each Mun sheet, and then either two corresponding silver tables, or a single one that reads from both bronze, process then combine them. I feel like that would be cleaner and preserve data at each stage as defined by the medallion architecture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code to read mun1 and mun2 into their own bronze tables (with no transformations there). Then they are merged into chl_boost_bronze_mun.

In a similar manner, cen is read into chl_boost_bronze_cen with no modifications. Then, transformations are done to produce chl_boost_silver_cen.

Finally, chl_boost_silver_cen and chl_boost_silver_mun are merged to produce chl_boost_silver

Comment on lines 24 to 25
# Normalize cells
df = df.applymap(normalize_cell)
Copy link

@elysenko elysenko Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The applymap function is deprecated, use the map function instead

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are multiple places across the countries where this has been used so I'll clean this up along with those. For now, I am leaving it as is instead of looping over the columns and using map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants