NOTE: images viewable on github
mega-boost is an ETL system that ingests raw BOOST public finance microdata, cleans and harmonizes it across countries, and outputs both micro- and aggregate-level data for policy analysis and reporting.
This documentation guides new contributors through the process of onboarding a country—from data ingestion and transformation to validation, quality assurance, and integration into cross-country pipelines and dashboards.
BOOST data is country-specific — for instance, some ministries of finance have IT systems that record detailed expenditure data for every provider of public services, while others may only have data aggregated at the district levels. The harmonization work in this repo standardizes the labeling of common BOOST features on the fiscal data at the most disaggregated level available to maximize the flexibility for analysis. The resulting data is stored in a table named boost_gold. The column definitions are detailed at the bottom of this file.
- Each country's ETL pipeline code resides in its own folder.
- cross_country_aggregate_dlt.py vertically stacks all countries' micro-level data into the
boost_goldtable and creates aggregates at various levels. - /quality contains the ETL code for processing BOOST Cross Country Interface (CCI) Excel files' "Approved" and "Executed" sheets for all available countries. The resulting tables are used for both:
- Programmatic quality checks in cross_country_aggregate_dlt.py, and
- Manual review and resolution of discrepancies along critical dimensions (total, functional, economic, etc.) in the Quality Dashboard
In your Workspace in the top right click Create->Git Folder. The path to the git repository is https://github.com/dime-worldbank/mega-boost.git

- Navigate into the directory
- Click on the tag for the Main branch
- Click
create branch
- If the raw data is already in a format supported by DLT load, such as CSV, you may skip this step/notebook altogether. This is required for .xlsx files
- Name the file
- Prefix the file name with the three letter country code of your country (so the boost pipelines can pick it up later). 3-letter ISO 3166 country codes.
- Use the following suffix:
_extract_microdata_excel_to_csv. see below example. -
- Extract the BOOST/raw data from the Excel sheet(s) (e.g. ALB_extract_microdata_excel_to_csv). The resulting CSV files can then be loaded directly by a subsequent Delta Live Table (DLT) script (described later).
- This notebook should save the data in CSV format. See Albania on Github as an example.
4. Ingest subnational population data. See mega-boost-indicators for more documentation.
- Find official statistics and projections from the country's national statistics office if available. Otherwise, check if census.gov might provide subnational population statistics for the given country. Nigeria's subnational population estimates come from census.gov, for example.
- Add the ETL code to the mega-indicator repo because it's a public good by itself. Make sure the region names in the subnational population dataset align with the BOOST subnational names for the given country, as the two datasets will be joined on the subnational region names.
- Add the country's ETL script to the subnational population pipeline "Indicators on Demand": Workflow > Jobs > Indicators on Demand > Tasks > + Add task.
- Global Data Lab's subnational human development datasets are used as some outcome indicators. Align subnational region names with those of population and BOOST if necessary.
This is the file you will use to create the formulas in your ETL. The root for the volume with country data can be found here.
- Emulate the formulas in the source data to create a table with the tags in the
boost_goldtable below (e.g. country_name, year, etc.). - The code hould follow the DLT Notebook format (e.g. ALB_transform_load_dlt).
- Validate your results by running the notebook with the
ITSDA_DAP_TEAM_boostprocessedcluster. Reference an existing country's DLT pipeline settings for other configurations.
- After running, the table should appear in the
boost_intermediateschema under theprd_megacatalog in the Databricks Catalog.
- Be sure to follow the naming convention (code folder & file names, table names, pipeline names, etc.) referencing existing countries. When a country code is needed, use the 3-letter ISO 3166 country codes.
7. Modify the Boost Agg Staging Pipeline to run quality checks against your data.
- add your three letter country code to the list of countries (or comment the others out and just include yours for speed)

- Verify that your code passes all quality checks (the graph elements should be all green like below)

- resolve quality checks if there are problems
8. Use the staging dashboard to verify that your calculations match those in the Excel workbook.
- Filter Countries to the newly added country. Check that the year coverage range and number of functional & economic category coverage are as expected.
- Further filter by Discrepancy $, set min as 1, and investigate discrepancies at various dimensions (e.g., func/econ/total). Resolution may involve an iterative process of updating the pre-existing BOOST Excel workflows and/or updating the ETL pipeline code.

- Our current acceptable threshold of discrepancy is 5%.
- If there are formulas in Excel that need to be revised because of double counting (e.g.
Capital ExpenditureandSocial Assistancehave overlapping costs), tag them under Questions for Massimo. See items there for examples. - If there needs to be a fix in the BOOST CCI EXCEL sheets, document the content of the changes and leave the logs in the following document: BOOST CCI changes reconciliations.
- Make sure that the EXCEL is updated at both BOOST's shared project folder and MEGA project folder for databricks processing.
- Any changes to the Excel document should reflect in Massimo's source document as well- update his if necessary.
10. Add the new country's ETL pipeline steps to the "BOOST Harmonize" job: Workflows > Jobs > BOOST Harmonize > Tasks > + Add task
- Add the extraction step using Type: Notebook, and Source: git. Use the default cluster (
ITSDA_DAP_TEAM_boostprocessed) as the compute. - Add the DLT step using Type: Delta Live Table pipeline, and select your DLT pipeline.
- Check that the step dependencies are configured correctly.
- Update cross_country_aggregate_dlt.py with the country's three letter code to add the table you produce in step 6.
Pull Request Checklist
- Extraction notebook complete
- DLT transformation validated
- Quality checks pass
- Dashboard discrepancies < 5%
- Excel reconciliation complete
- Execute the "BOOST Agg" DLT workflow to perform the stacking and aggregation: Workflows > Delta Live Tables > BOOST Agg > Start.
- Verify your code still passes quality checks.
14. Check the Production Dashbord
- ensure the new country is reflected and its narratives are correctly presented.
| Column Name | Data Type | Description | Possible Values | Possible Value Catalog Query |
|---|---|---|---|---|
| country_name | string | The name of the country for which the budget data is recorded | e.g., "Kenya", "Brazil". See wbgapi or WB website for a full list | SELECT DISTINCT country_name FROM indicator.country |
| year | int | The fiscal year for the budget data | e.g., 2023, 2024 | |
| admin0 | string | Who spent the money at the highest administrative level | Either "Central" or "Regional" | |
| admin1 | string | Who spent the money at the first sub-national administrative level | If admin0 is "Regional," this is the state or province name; if admin0 is "Central," this is always "Central Scope." "Scope" is postfixed to avoid conflicting with a subnational region named "Central" in some countries | SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold |
| admin2 | string | Who spent the money at the second sub-national administrative level | For "Central" spending, this is the government agency/ministry name; for "Regional" spending, this can be the district or municipality name or subnational government agency/department name | |
| geo0 | string | Is the money geographically or centrally allocated (either "Central" or "Regional") regardless of the spender | Either "Central" or "Regional" | |
| geo1 | string | Geographically, at which first sub-national administrative level the money was spent | Same as admin1 possible values | SELECT DISTINCT admin1_region FROM indicator.admin1_boundaries_gold |
| func | string | Functional classification of the budget | e.g., Health, Education. See Classification of the Functions of Government (COFOG) for a full list | SELECT DISTINCT func FROM boost_intermediate.quality_functional_gold |
| func_sub | string | Sub-functional classification under the main COFOG function | e.g., "Primary Education", "Secondary Education" | |
| econ | string | Economic classification of the budget | e.g., "Wage bill", "Goods and services" | SELECT DISTINCT econ FROM boost_intermediate.quality_economic_gold |
| econ_sub | string | Sub-economic classification under the main economic category | e.g., "Basic Wages", "Allowances" | |
| approved | double | Amount of budget in current local currency approved by the relevant authority | Numeric value representing local currency | |
| revised | double | Revised budget in current local currency amount during the fiscal year | Numeric value representing local currency | |
| executed | double | Actual amount spent in current local currency by the end of the fiscal year | Numeric value representing local currency |


