mega-boost

NOTE: images viewable on github

mega-boost is an ETL system that ingests raw BOOST public finance microdata, cleans and harmonizes it across countries, and outputs both micro- and aggregate-level data for policy analysis and reporting.

This documentation guides new contributors through the process of onboarding a country—from data ingestion and transformation to validation, quality assurance, and integration into cross-country pipelines and dashboards.

Cross-Country Harmonization

BOOST data is country-specific — for instance, some ministries of finance have IT systems that record detailed expenditure data for every provider of public services, while others may only have data aggregated at the district levels. The harmonization work in this repo standardizes the labeling of common BOOST features on the fiscal data at the most disaggregated level available to maximize the flexibility for analysis. The resulting data is stored in a table named boost_gold. The column definitions are detailed at the bottom of this file.

Code Organization

Each country's ETL pipeline code resides in its own folder.
cross_country_aggregate_dlt.py vertically stacks all countries' micro-level data into the boost_gold table and creates aggregates at various levels.
/quality contains the ETL code for processing BOOST Cross Country Interface (CCI) Excel files' "Approved" and "Executed" sheets for all available countries. The resulting tables are used for both:
- Programmatic quality checks in cross_country_aggregate_dlt.py, and
- Manual review and resolution of discrepancies along critical dimensions (total, functional, economic, etc.) in the Quality Dashboard

How to Add a New Country

0. Clone the repository (if not already done)

In your Workspace in the top right click Create->Git Folder. The path to the git repository is https://github.com/dime-worldbank/mega-boost.git

1. Create a new branch for your country

Navigate into the directory
Click on the tag for the Main branch
Click create branch

2. Create a new folder named using the country name (see image below).

3. Write a notebook to extract raw data

If the raw data is already in a format supported by DLT load, such as CSV, you may skip this step/notebook altogether. This is required for .xlsx files
Name the file
- Prefix the file name with the three letter country code of your country (so the boost pipelines can pick it up later). 3-letter ISO 3166 country codes.
- Use the following suffix: _extract_microdata_excel_to_csv. see below example.
Extract the BOOST/raw data from the Excel sheet(s) (e.g. ALB_extract_microdata_excel_to_csv). The resulting CSV files can then be loaded directly by a subsequent Delta Live Table (DLT) script (described later).
This notebook should save the data in CSV format. See Albania on Github as an example.

4. Ingest subnational population data. See mega-boost-indicators for more documentation.

Find official statistics and projections from the country's national statistics office if available. Otherwise, check if census.gov might provide subnational population statistics for the given country. Nigeria's subnational population estimates come from census.gov, for example.
Add the ETL code to the mega-indicator repo because it's a public good by itself. Make sure the region names in the subnational population dataset align with the BOOST subnational names for the given country, as the two datasets will be joined on the subnational region names.
Add the country's ETL script to the subnational population pipeline "Indicators on Demand": Workflow > Jobs > Indicators on Demand > Tasks > + Add task.
Global Data Lab's subnational human development datasets are used as some outcome indicators. Align subnational region names with those of population and BOOST if necessary.

5. Locate the source data in the volume

This is the file you will use to create the formulas in your ETL. The root for the volume with country data can be found here.

6. Write a notebook to transform the data (i.e. emulate the calculations in the Excel/CSV document)

Emulate the formulas in the source data to create a table with the tags in the boost_gold table below (e.g. country_name, year, etc.).
The code hould follow the DLT Notebook format (e.g. ALB_transform_load_dlt).
Validate your results by running the notebook with the ITSDA_DAP_TEAM_boostprocessed cluster. Reference an existing country's DLT pipeline settings for other configurations.

After running, the table should appear in the boost_intermediate schema under the prd_mega catalog in the Databricks Catalog.
Be sure to follow the naming convention (code folder & file names, table names, pipeline names, etc.) referencing existing countries. When a country code is needed, use the 3-letter ISO 3166 country codes.

7. Modify the Boost Agg Staging Pipeline to run quality checks against your data.

add your three letter country code to the list of countries (or comment the others out and just include yours for speed)
Verify that your code passes all quality checks (the graph elements should be all green like below)
resolve quality checks if there are problems

8. Use the staging dashboard to verify that your calculations match those in the Excel workbook.

Filter Countries to the newly added country. Check that the year coverage range and number of functional & economic category coverage are as expected.
Further filter by Discrepancy $, set min as 1, and investigate discrepancies at various dimensions (e.g., func/econ/total). Resolution may involve an iterative process of updating the pre-existing BOOST Excel workflows and/or updating the ETL pipeline code.
Our current acceptable threshold of discrepancy is 5%.

9. Update Excel calculations (if necessary)

If there are formulas in Excel that need to be revised because of double counting (e.g. Capital Expenditure and Social Assistance have overlapping costs), tag them under Questions for Massimo. See items there for examples.
If there needs to be a fix in the BOOST CCI EXCEL sheets, document the content of the changes and leave the logs in the following document: BOOST CCI changes reconciliations.
Make sure that the EXCEL is updated at both BOOST's shared project folder and MEGA project folder for databricks processing.
Any changes to the Excel document should reflect in Massimo's source document as well- update his if necessary.

10. Add the new country's ETL pipeline steps to the "BOOST Harmonize" job: Workflows > Jobs > BOOST Harmonize > Tasks > + Add task

Add the extraction step using Type: Notebook, and Source: git. Use the default cluster (ITSDA_DAP_TEAM_boostprocessed) as the compute.
Add the DLT step using Type: Delta Live Table pipeline, and select your DLT pipeline.
Check that the step dependencies are configured correctly.

11. Add the country to the cross-country aggregation DLT pipeline

Update cross_country_aggregate_dlt.py with the country's three letter code to add the table you produce in step 6.

12. Pull request your work

Pull Request Checklist

13. Verify your results in the production pipeline

Execute the "BOOST Agg" DLT workflow to perform the stacking and aggregation: Workflows > Delta Live Tables > BOOST Agg > Start.
Verify your code still passes quality checks.

14. Check the Production Dashbord

ensure the new country is reflected and its narratives are correctly presented.

boost_gold

Column Name	Data Type	Description	Possible Values	Possible Value Catalog Query
country_name	string	The name of the country for which the budget data is recorded	e.g., "Kenya", "Brazil". See wbgapi or WB website for a full list	`SELECT DISTINCT country_name FROM indicator.country`
year	int	The fiscal year for the budget data	e.g., 2023, 2024
admin0	string	Who spent the money at the highest administrative level	Either "Central" or "Regional"
admin1	string	Who spent the money at the first sub-national administrative level	If admin0 is "Regional," this is the state or province name; if admin0 is "Central," this is always "Central Scope." "Scope" is postfixed to avoid conflicting with a subnational region named "Central" in some countries	`SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold`
admin2	string	Who spent the money at the second sub-national administrative level	For "Central" spending, this is the government agency/ministry name; for "Regional" spending, this can be the district or municipality name or subnational government agency/department name
geo0	string	Is the money geographically or centrally allocated (either "Central" or "Regional") regardless of the spender	Either "Central" or "Regional"
geo1	string	Geographically, at which first sub-national administrative level the money was spent	Same as admin1 possible values	`SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold`
func	string	Functional classification of the budget	e.g., Health, Education. See Classification of the Functions of Government (COFOG) for a full list	`SELECT DISTINCT func FROM boost_intermediate.quality_functional_gold`
func_sub	string	Sub-functional classification under the main COFOG function	e.g., "Primary Education", "Secondary Education"
econ	string	Economic classification of the budget	e.g., "Wage bill", "Goods and services"	`SELECT DISTINCT econ FROM boost_intermediate.quality_economic_gold`
econ_sub	string	Sub-economic classification under the main economic category	e.g., "Basic Wages", "Allowances"
approved	double	Amount of budget in current local currency approved by the relevant authority	Numeric value representing local currency
revised	double	Revised budget in current local currency amount during the fiscal year	Numeric value representing local currency
executed	double	Actual amount spent in current local currency by the end of the fiscal year	Numeric value representing local currency

Name		Name	Last commit message	Last commit date
Latest commit History 516 Commits
Albania		Albania
Bangladesh		Bangladesh
Bhutan		Bhutan
Burkina_Faso		Burkina_Faso
Chile		Chile
Colombia		Colombia
Congo_DR		Congo_DR
Ghana		Ghana
Kenya		Kenya
Liberia		Liberia
Mozambique		Mozambique
Nigeria		Nigeria
Pakistan		Pakistan
Paraguay		Paraguay
South_Africa		South_Africa
Togo		Togo
Tunisia		Tunisia
Uruguay		Uruguay
auxiliary_data		auxiliary_data
quality		quality
LICENSE		LICENSE
README.md		README.md
cross_country_aggregate_dlt.py		cross_country_aggregate_dlt.py
publish.py		publish.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mega-boost

Cross-Country Harmonization

Code Organization

How to Add a New Country

0. Clone the repository (if not already done)

1. Create a new branch for your country

2. Create a new folder named using the country name (see image below).

3. Write a notebook to extract raw data

4. Ingest subnational population data. See mega-boost-indicators for more documentation.

5. Locate the source data in the volume

6. Write a notebook to transform the data (i.e. emulate the calculations in the Excel/CSV document)

7. Modify the Boost Agg Staging Pipeline to run quality checks against your data.

8. Use the staging dashboard to verify that your calculations match those in the Excel workbook.

9. Update Excel calculations (if necessary)

10. Add the new country's ETL pipeline steps to the "BOOST Harmonize" job: Workflows > Jobs > BOOST Harmonize > Tasks > + Add task

11. Add the country to the cross-country aggregation DLT pipeline

12. Pull request your work

13. Verify your results in the production pipeline

14. Check the Production Dashbord

boost_gold

About

Uh oh!

Releases

Packages

Contributors 6

Uh oh!

Languages

License

dime-worldbank/mega-boost

Folders and files

Latest commit

History

Repository files navigation

mega-boost

Cross-Country Harmonization

Code Organization

How to Add a New Country

0. Clone the repository (if not already done)

1. Create a new branch for your country

2. Create a new folder named using the country name (see image below).

3. Write a notebook to extract raw data

4. Ingest subnational population data. See mega-boost-indicators for more documentation.

5. Locate the source data in the volume

6. Write a notebook to transform the data (i.e. emulate the calculations in the Excel/CSV document)

7. Modify the Boost Agg Staging Pipeline to run quality checks against your data.

8. Use the staging dashboard to verify that your calculations match those in the Excel workbook.

9. Update Excel calculations (if necessary)

10. Add the new country's ETL pipeline steps to the "BOOST Harmonize" job: Workflows > Jobs > BOOST Harmonize > Tasks > + Add task

11. Add the country to the cross-country aggregation DLT pipeline

12. Pull request your work

13. Verify your results in the production pipeline

14. Check the Production Dashbord

boost_gold

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Uh oh!

Languages

Packages