Skip to content

Commit

Permalink
📊 wpp: new 2024 release (#53)
Browse files Browse the repository at this point in the history
* chore: update python deps

* feat: indicator-based population explorer mostly works

* feat: indicator-based population explorer works

* feat: get rid of table defs!!

* fix: fix up color scales

* feat: drop/rename explicit title and subtitle fields

I think these should all come from the etl now?

* feat: use 2024 data!

* population_broad -> population

* wip

* wip

* population change

* wip: explorer config

* wip

* add f/m migration

* feat: no longer drop title/subtitle/note

* Revert "feat: get rid of table defs!!"

This reverts commit bd5c65c.

* feat: specify map color schemes as metadata

* feat: get rid of explicit source information

* enhance: clarify titles and subtitles

* enhance: specify sensible `yAxisMin` for "Life expectancy at age ..."

* enhance: use better color scheme for 100+ deaths

* enhance: remove most explicit units

* enhance: explicitly set empty subtitles to single-space string

* enhance: don't specify column display name in most cases

* enhance: no special treatment for column type

* enhance: use `catalogPath` column

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip map brackets

* map brackets

* remove 'fertility' keyword

* remove note on mortality rates

* fix unwanted '-' removal

* remove dimensions based on public release

* bump wpp dataset version

* disable scenarios for num deaths

---------

Co-authored-by: Marcel Gerber <[email protected]>
Co-authored-by: owidbot <[email protected]>
  • Loading branch information
3 people authored Jul 12, 2024
1 parent ab880cf commit bb894df
Show file tree
Hide file tree
Showing 7 changed files with 2,227 additions and 2,955 deletions.
4,448 changes: 1,800 additions & 2,648 deletions explorers/population-and-demography.explorer.tsv

Large diffs are not rendered by default.

107 changes: 56 additions & 51 deletions scripts/demography-explorer/age_group.csv
Original file line number Diff line number Diff line change
@@ -1,51 +1,56 @@
slug,csv_slug,name,title_suffix
all,all,Total,
none,none,None
0,0,Under 1 year,of children under the age of 1
0_4,0_4,Under 5 years,of children under the age of 5
0_14,0_14,Under 15 years,of children under the age of 15
0_24,0_24,Under 25 years,under the age of 25
15_64,15_64,15-64 years,aged 15 to 64 years
1_4,1_4,1–4 years,aged 1 to 4 years
5_9,5_9,5–9 years,aged 5 to 9 years
10_14,10_14,10–14 years,aged 10 to 14 years
15_19,15_19,15–19 years,aged 15 to 19 years
15plus,15plus,15+ years,older than 15 years
18plus,18plus,18+ years,older than 18 years
20_29,20_29,20–29 years,aged 20 to 29 years
30_39,30_39,30–39 years,aged 30 to 39 years
40_49,40_49,40–49 years,aged 40 to 49 years
50_59,50_59,50–59 years,aged 50 to 59 years
60_69,60_69,60–69 years,aged 60 to 69 years
70_79,70_79,70–79 years,aged 70 to 79 years
80_89,80_89,80–89 years,aged 80 to 89 years
90_99,90_99,90–99 years,aged 90 to 99 years
100plus,100plus,100+ years,older than 100 years
mothers_15_19,15_19,Mothers aged 15–19 years,from mothers aged 15 to 19 years
mothers_20_24,20_24,Mothers aged 20–24 years,from mothers aged 20 to 24 years
mothers_25_29,25_29,Mothers aged 25–29 years,from mothers aged 25 to 29 years
mothers_30_34,30_34,Mothers aged 30–34 years,from mothers aged 30 to 34 years
mothers_35_39,35_39,Mothers aged 35–39 years,from mothers aged 35 to 39 years
mothers_40_44,40_44,Mothers aged 40–44 years,from mothers aged 40 to 44 years
mothers_45_49,45_49,Mothers aged 45–49 years,from mothers aged 45 to 49 years
aged_15,15,Aged 15,at age 15
aged_65,65,Aged 65,at age 65
aged_80,80,Aged 80,at age 80
at_birth,at_birth,At birth,at birth
1,1,At age 1,at age 1
5,5,At age 5,at age 5
10,10,At age 10,at age 10
15,15,At age 15,at age 15
20,20,At age 20,at age 20
30,30,At age 30,at age 30
40,40,At age 40,at age 40
50,50,At age 50,at age 50
60,60,At age 60,at age 60
65,65,At age 65,at age 65
70,70,At age 70,at age 70
80,80,At age 80,at age 80
90,90,At age 90,at age 90
100_and_over,100plus,At age 100 and over,at age 100 and over
dependency_total,dependency_total,Total dependency ratio,
dependency_child,dependency_child,Youth dependency ratio,
dependency_old,dependency_old,Old-age dependency ratio,
slug,csv_slug,name,title_suffix,plain
all,all,Total,,
none,none,None,,
0,0,Under 1 year,of children under the age of 1,
0_4,0_4,Under 5 years,of children under the age of 5,
0_14,0_14,Under 15 years,of children under the age of 15,
0_24,0_24,Under 25 years,under the age of 25,
15_64,15_64,15-64 years,aged 15 to 64 years,
1_4,1_4,1–4 years,aged 1 to 4 years,
5_9,5_9,5–9 years,aged 5 to 9 years,
10_14,10_14,10–14 years,aged 10 to 14 years,
15_19,15_19,15–19 years,aged 15 to 19 years,
15plus,15plus,15+ years,older than 15 years,
18plus,18plus,18+ years,older than 18 years,
20_29,20_29,20–29 years,aged 20 to 29 years,
30_39,30_39,30–39 years,aged 30 to 39 years,
40_49,40_49,40–49 years,aged 40 to 49 years,
50_59,50_59,50–59 years,aged 50 to 59 years,
60_69,60_69,60–69 years,aged 60 to 69 years,
70_79,70_79,70–79 years,aged 70 to 79 years,
80_89,80_89,80–89 years,aged 80 to 89 years,
90_99,90_99,90–99 years,aged 90 to 99 years,
100plus,100plus,100+ years,older than 100 years,
mothers_10_14,10_14,Mothers aged 10–14 years,from mothers aged 10 to 14 years,
mothers_15_19,15_19,Mothers aged 15–19 years,from mothers aged 15 to 19 years,
mothers_20_24,20_24,Mothers aged 20–24 years,from mothers aged 20 to 24 years,
mothers_25_29,25_29,Mothers aged 25–29 years,from mothers aged 25 to 29 years,
mothers_30_34,30_34,Mothers aged 30–34 years,from mothers aged 30 to 34 years,
mothers_35_39,35_39,Mothers aged 35–39 years,from mothers aged 35 to 39 years,
mothers_40_44,40_44,Mothers aged 40–44 years,from mothers aged 40 to 44 years,
mothers_45_49,45_49,Mothers aged 45–49 years,from mothers aged 45 to 49 years,
mothers_50_54,50_54,Mothers aged 50–54 years,from mothers aged 50 to 54 years,
aged_15,15,Aged 15,at age 15,15
aged_30,30,Aged 30,at age 30,30
aged_45,45,Aged 45,at age 45,45
aged_65,65,Aged 65,at age 65,65
aged_80,80,Aged 80,at age 80,80
aged_90,90,Aged 90,at age 90,90
at_birth,0,At birth,at birth,0
1,1,At age 1,at age 1,1
5,5,At age 5,at age 5,5
10,10,At age 10,at age 10,10
15,15,At age 15,at age 15,15
20,20,At age 20,at age 20,20
30,30,At age 30,at age 30,30
40,40,At age 40,at age 40,40
50,50,At age 50,at age 50,50
60,60,At age 60,at age 60,60
65,65,At age 65,at age 65,65
70,70,At age 70,at age 70,70
80,80,At age 80,at age 80,80
90,90,At age 90,at age 90,90
100_and_over,100plus,At age 100 and over,at age 100 and over,
dependency_total,dependency_total,Total dependency ratio,,
dependency_child,dependency_child,Youth dependency ratio,,
dependency_old,dependency_old,Old-age dependency ratio,,
93 changes: 45 additions & 48 deletions scripts/demography-explorer/demography-explorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@
import textwrap
import pandas as pd
import re
from collections import defaultdict

# There are two datasets available:
# - DATASET_PATH_PREFIX: Classic dataset, with estimates for 1950-2023 and projections for 2024-2100.
# - DATASET_PATH_PREFIX_FULL: Alternative daraset, with projections for 1950-2100 (the 1950-2023 part is the same in all projections). This dataset is helpful in explorers to be able to plot the complete time series (estimates + projections) for a given projection.
DATASET_PATH_PREFIX = "grapher/un/2024-07-12/un_wpp/"
DATASET_PATH_PREFIX_FULL = "grapher/un/2024-07-12/un_wpp_full/"

def file_url(tableSlug):
return (
f"https://catalog.ourworldindata.org/explorers/un/2022/un_wpp/{tableSlug}.csv"
)
COLS_TO_DROP = []


# %%
Expand All @@ -20,52 +21,43 @@ def substitute_rows(row):
if isinstance(row[key], str):
while "${" in row[key]:
template = Template(row[key])
row[key] = template.substitute(**row)
row[key] = template.substitute(
**row, DATASET_PATH_PREFIX=DATASET_PATH_PREFIX
)
return row


def table_def(tableSlug, rows, display_names):
table_def = f"table {file_url(tableSlug)} {tableSlug}"
rows["ySlugs"] = rows["ySlugs"].map(lambda x: x.split(" "))
rows = rows.explode("ySlugs").drop_duplicates("ySlugs").reset_index(drop=True)
def table_def(rows, display_names):
rows["yVariableIds"] = rows["yVariableIds"].map(lambda x: x.split(" "))
rows = (
rows.explode("yVariableIds")
.drop_duplicates("yVariableIds")
.reset_index(drop=True)
)

column_defs = rows.filter(regex="^column__", axis=1).rename(
columns=lambda x: re.sub("^column__", "", x)
)
column_defs = column_defs.drop(columns=["type"])
col_names = [
"slug",
"catalogPath",
"name",
"type",
"sourceName",
"sourceLink",
"dataPublishedBy",
"additionalInfo",
*column_defs.columns,
]
col_names = "\t".join(col_names)

col_defs = [
[
row["ySlugs"],
display_names[row["ySlugs"]],
row["column__type"],
"United Nations, World Population Prospects (2022)",
"https://population.un.org/wpp/",
"United Nations, Department of Economic and Social Affairs, Population Division (2022). World Population Prospects 2022, Online Edition.",
"""The 2022 Revision of World Population Prospects was released on 11 July 2022 by the Population Division of the Department of Economic and Social Affairs of the United Nations.\\n\\nIt presents population estimates from 1950 to the present, based on historical demographic trends. It also includes projections to the year 2100 based on a range of demographic scenarios. The three scenarios that we show (‘Low’, ‘Medium’, ‘High’) differ only with respect to the level of fertility; they share the same assumptions for sex ratio at birth, life expectancy and international migration.\\n\\nAll values are estimated based on current country borders.\\n\\nThe next revision of this data by the UN is due in 2024.""",
row["yVariableIds"],
display_names.get(row["yVariableIds"]) or "",
*column_defs.loc[idx].values.tolist(),
]
for (idx, row) in rows.iterrows()
]
col_defs = ["\t".join(col) for col in col_defs]
col_defs = textwrap.indent("\n".join(col_defs), "\t")

return f"""{table_def}
columns {tableSlug}
return f"""columns
{col_names}
location Country name EntityName
year Year Year
{col_defs}"""


Expand Down Expand Up @@ -117,20 +109,32 @@ def table_def(tableSlug, rows, display_names):
.apply(lambda x: x.strip())
.apply(lambda x: x[0].upper() + x[1:] if len(x) else x)
.apply(lambda x: re.sub(" {2,}", " ", x))
.apply(lambda x: x.replace("\\-\\", " "))
# .apply(lambda x: x or " ")
)
# explicitly set empty strings to a single space, so we don't inherit it from ETL
# df.loc[df[col] == "-", col] = " "


# %%
# Use DATASET_PATH_PREFIX_FULL when variant is not "None" (i.e. some projection scenario)
mask = df["projection__slug"] != "estimates"
df.loc[mask, "yVariableIds"] = df.loc[mask, "yVariableIds"].str.replace(
DATASET_PATH_PREFIX, DATASET_PATH_PREFIX_FULL
)

# %%
# Extract column display names from ySlugs
# The `ySlugs` column can contain names for column slugs, e.g.:
# Extract column display names from yVariableIds
# The `yVariableIds` column can contain names for column slugs, e.g.:
# population_broad__all__15-24__records:"15-24 years"
# Note the colon, and especially the quotes around the name. They are required!
# This config will use the name "15-24 years" as the display name for the column.
# If an explicit name is not given, the row's title will be used instead.
col_display_names = {}

y_slug_re = r"([\w\-+]+):\"([^\"]+)\""
y_slug_re = r"([\w\-\/_#]+):\"([^\"]+)\""
for idx, row in df.iterrows():
matches = re.finditer(y_slug_re, row["ySlugs"])
matches = re.finditer(y_slug_re, row["yVariableIds"])
slugs = []
for match in matches:
col_slug, col_name = match.groups()
Expand All @@ -139,21 +143,13 @@ def table_def(tableSlug, rows, display_names):
col_display_names[col_slug] = col_name

if len(slugs):
row["ySlugs"] = " ".join(slugs)
elif row["ySlugs"] not in col_display_names:
col_display_names[row["ySlugs"]] = row["title"]
df.loc[idx, "yVariableIds"] = " ".join(slugs)

# %%
tables = df["tableSlug"].unique()
table_defs = [
table_def(
tableSlug,
df[df["tableSlug"] == tableSlug].reset_index(drop=True),
col_display_names,
)
for tableSlug in tables
if tableSlug != ""
]
col_defs = table_def(
df.reset_index(drop=True),
col_display_names,
)

# %%

Expand All @@ -175,12 +171,13 @@ def table_def(tableSlug, rows, display_names):
# Drop all remaining programmatic columns containing __
df = df.drop(columns=df.filter(regex="__"))

# %%
df = df.rename(columns={col_name: "_" + col_name for col_name in COLS_TO_DROP})

# %%
graphers_tsv = df.to_csv(sep="\t", index=False)
graphers_tsv_indented = textwrap.indent(graphers_tsv, "\t")

table_defs = "\n".join(table_defs)

# %%
warning = "# DO NOT EDIT THIS FILE BY HAND. It is automatically generated using a set of input files. Any changes made directly to it will be overwritten.\n\n"

Expand All @@ -189,7 +186,7 @@ def table_def(tableSlug, rows, display_names):
warning
+ template.substitute(
graphers_tsv=graphers_tsv_indented,
table_defs=table_defs,
table_defs=col_defs,
)
)

Expand Down
Loading

0 comments on commit bb894df

Please sign in to comment.