Skip to content

dime-worldbank/mega-boost

Repository files navigation

mega-boost

NOTE: images viewable on github

mega-boost is an ETL system that ingests raw BOOST public finance microdata, cleans and harmonizes it across countries, and outputs both micro- and aggregate-level data for policy analysis and reporting.

This documentation guides new contributors through the process of onboarding a country—from data ingestion and transformation to validation, quality assurance, and integration into cross-country pipelines and dashboards.

Cross-Country Harmonization

BOOST data is country-specific — for instance, some ministries of finance have IT systems that record detailed expenditure data for every provider of public services, while others may only have data aggregated at the district levels. The harmonization work in this repo standardizes the labeling of common BOOST features on the fiscal data at the most disaggregated level available to maximize the flexibility for analysis. The resulting data is stored in a table named boost_gold. The column definitions are detailed at the bottom of this file.

Code Organization

  • Each country's ETL pipeline code resides in its own folder.
  • cross_country_aggregate_dlt.py vertically stacks all countries' micro-level data into the boost_gold table and creates aggregates at various levels.
  • /quality contains the ETL code for processing BOOST Cross Country Interface (CCI) Excel files' "Approved" and "Executed" sheets for all available countries. The resulting tables are used for both:

How to Add a New Country

0. Clone the repository (if not already done)

In your Workspace in the top right click Create->Git Folder. The path to the git repository is https://github.com/dime-worldbank/mega-boost.git image

1. Create a new branch for your country

  • Navigate into the directory
  • Click on the tag for the Main branch
  • Click create branch

2. Create a new folder named using the country name (see image below).

image

3. Write a notebook to extract raw data

  • If the raw data is already in a format supported by DLT load, such as CSV, you may skip this step/notebook altogether. This is required for .xlsx files
  • Name the file
    • Prefix the file name with the three letter country code of your country (so the boost pipelines can pick it up later). 3-letter ISO 3166 country codes.
    • Use the following suffix: _extract_microdata_excel_to_csv. see below example.
    • image
  • Extract the BOOST/raw data from the Excel sheet(s) (e.g. ALB_extract_microdata_excel_to_csv). The resulting CSV files can then be loaded directly by a subsequent Delta Live Table (DLT) script (described later).
  • This notebook should save the data in CSV format. See Albania on Github as an example.

4. Ingest subnational population data. See mega-boost-indicators for more documentation.

  • Find official statistics and projections from the country's national statistics office if available. Otherwise, check if census.gov might provide subnational population statistics for the given country. Nigeria's subnational population estimates come from census.gov, for example.
  • Add the ETL code to the mega-indicator repo because it's a public good by itself. Make sure the region names in the subnational population dataset align with the BOOST subnational names for the given country, as the two datasets will be joined on the subnational region names.
  • Add the country's ETL script to the subnational population pipeline "Indicators on Demand": Workflow > Jobs > Indicators on Demand > Tasks > + Add task.
  • Global Data Lab's subnational human development datasets are used as some outcome indicators. Align subnational region names with those of population and BOOST if necessary.

5. Locate the source data in the volume

image This is the file you will use to create the formulas in your ETL. The root for the volume with country data can be found here.

6. Write a notebook to transform the data (i.e. emulate the calculations in the Excel/CSV document)

  • Emulate the formulas in the source data to create a table with the tags in the boost_gold table below (e.g. country_name, year, etc.).
  • The code hould follow the DLT Notebook format (e.g. ALB_transform_load_dlt).
  • Validate your results by running the notebook with the ITSDA_DAP_TEAM_boostprocessed cluster. Reference an existing country's DLT pipeline settings for other configurations.

image

  • After running, the table should appear in the boost_intermediate schema under the prd_mega catalog in the Databricks Catalog. image
  • Be sure to follow the naming convention (code folder & file names, table names, pipeline names, etc.) referencing existing countries. When a country code is needed, use the 3-letter ISO 3166 country codes.

7. Modify the Boost Agg Staging Pipeline to run quality checks against your data.

  • add your three letter country code to the list of countries (or comment the others out and just include yours for speed) image
  • Verify that your code passes all quality checks (the graph elements should be all green like below) image
  • resolve quality checks if there are problems

8. Use the staging dashboard to verify that your calculations match those in the Excel workbook.

  • Filter Countries to the newly added country. Check that the year coverage range and number of functional & economic category coverage are as expected.
  • Further filter by Discrepancy $, set min as 1, and investigate discrepancies at various dimensions (e.g., func/econ/total). Resolution may involve an iterative process of updating the pre-existing BOOST Excel workflows and/or updating the ETL pipeline code.
  • image
  • Our current acceptable threshold of discrepancy is 5%.

9. Update Excel calculations (if necessary)

10. Add the new country's ETL pipeline steps to the "BOOST Harmonize" job: Workflows > Jobs > BOOST Harmonize > Tasks > + Add task

image

  • Add the extraction step using Type: Notebook, and Source: git. Use the default cluster (ITSDA_DAP_TEAM_boostprocessed) as the compute.
  • Add the DLT step using Type: Delta Live Table pipeline, and select your DLT pipeline.
  • Check that the step dependencies are configured correctly.

11. Add the country to the cross-country aggregation DLT pipeline

image

12. Pull request your work

Pull Request Checklist

  • Extraction notebook complete
  • DLT transformation validated
  • Quality checks pass
  • Dashboard discrepancies < 5%
  • Excel reconciliation complete

13. Verify your results in the production pipeline

  • Execute the "BOOST Agg" DLT workflow to perform the stacking and aggregation: Workflows > Delta Live Tables > BOOST Agg > Start.
  • Verify your code still passes quality checks.

14. Check the Production Dashbord

  • ensure the new country is reflected and its narratives are correctly presented.

boost_gold

Column Name Data Type Description Possible Values Possible Value Catalog Query
country_name string The name of the country for which the budget data is recorded e.g., "Kenya", "Brazil". See wbgapi or WB website for a full list SELECT DISTINCT country_name FROM indicator.country
year int The fiscal year for the budget data e.g., 2023, 2024
admin0 string Who spent the money at the highest administrative level Either "Central" or "Regional"
admin1 string Who spent the money at the first sub-national administrative level If admin0 is "Regional," this is the state or province name; if admin0 is "Central," this is always "Central Scope." "Scope" is postfixed to avoid conflicting with a subnational region named "Central" in some countries SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold
admin2 string Who spent the money at the second sub-national administrative level For "Central" spending, this is the government agency/ministry name; for "Regional" spending, this can be the district or municipality name or subnational government agency/department name
geo0 string Is the money geographically or centrally allocated (either "Central" or "Regional") regardless of the spender Either "Central" or "Regional"
geo1 string Geographically, at which first sub-national administrative level the money was spent Same as admin1 possible values SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold
func string Functional classification of the budget e.g., Health, Education. See Classification of the Functions of Government (COFOG) for a full list SELECT DISTINCT func FROM boost_intermediate.quality_functional_gold
func_sub string Sub-functional classification under the main COFOG function e.g., "Primary Education", "Secondary Education"
econ string Economic classification of the budget e.g., "Wage bill", "Goods and services" SELECT DISTINCT econ FROM boost_intermediate.quality_economic_gold
econ_sub string Sub-economic classification under the main economic category e.g., "Basic Wages", "Allowances"
approved double Amount of budget in current local currency approved by the relevant authority Numeric value representing local currency
revised double Revised budget in current local currency amount during the fiscal year Numeric value representing local currency
executed double Actual amount spent in current local currency by the end of the fiscal year Numeric value representing local currency

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6