Almost automatically analyze the quality of a DCAT metadata catalog with a little help from ✨ AI.
Contents
# Clone the repository
git clone https://github.com/statistikZH/ogd_ai-analyzer.git
cd ogd_ai-analyzer
# Install uv and dependencies
pip3 install uv
uv venv
source .venv/bin/activate
uv sync- You need to create an OpenRouter API key to use the LLM-based assessments. Create an
.envfile and input your API key like so:
OPENROUTER_API_KEY=sk-or-v1-...
- Open the notebooks in your favorite IDE and run the code.
- Check the results (in folder
_results) and fix issues in your metadata.
Note
The notebooks are set up as Quarto files. You don't need to use Quarto. You can simply run the notebooks as is and look at the results. However, we encourage you to try it out with Quarto. The results will be much more shareable, e.g., to a non-technical audience that doesn't want or need to see code. Simply install Quarto, add an extension to your IDE, and convert the notebooks to HTML or PDF files. You can also render the EDA notebook directly from the command line:
quarto render 01_mdv_quality_checks.ipynbWe perform a thorough metadata analysis and quality check using the OGD metadata catalog of the Canton of Zurich as an example.
This project:
- Treats the metadata catalog as a regular dataset and performs a structured, detailed exploratory data analysis (EDA)
- Uses an LLM to analyze titles and descriptions to discover semantic deficits and nonsensical entries that are hard to catch otherwise
We set up the code to perform most of the checks automatically. It should be easy to adapt these notebooks to other data catalogs that conform to the DCAT-AP CH standard.
The two notebooks produce the following outputs:
- a HTML report detailing all issues that were found
- an Excel file with all major issues categorized and sortable
- another Excel file with a qualitative assessment of the title and description of each dataset created by an LLM
Important
Use of the LLM-based analysis code results in data being sent to third-party model providers through OpenRouter, which brokers requests to multiple LLM services. Do not submit sensitive or confidential data.
Important
LLMs make errors. This app provides suggestions only and yields a draft analysis that you should always double-check.
- Conformity to the DCAT standard
- Missing values
- Hidden nulls (e.g., "", "null", "none", "nichts")
- Empty lists and dictionaries
- Duplicates
- Text issues in titles and descriptions, such as unstripped text, line breaks, escape sequences, control characters, and unnecessary whitespace
- Abbreviations that might erode clarity or make search unnecessarily hard
- Titles copied verbatim to descriptions or resource descriptions, adding no new information
- Overall semantic quality of titles and descriptions (✨ powered by an LLM)
- Date issues, such as non-parsable dates and start dates that come after end dates
- Issues in individual properties
- Offline or invalid landing pages and distributions
- and many more...
These checks cover metadata at both dataset and distribution levels.
The second notebook provides an in-depth analysis of each dataset's title and description. An ✨ LLM assesses whether the title and description clearly explain:
- what the dataset is about («Dateninhalt»),
- how the data was collected («Entstehungszusammenhang»),
- how the data quality is («Datenqualität»),
- what the spatial aggregation is («Räumlicher Bezug»),
- and how the data can be linked to other data («Verknüpfungsmöglichkeiten»).
Each dataset receives a score from 1 (least informative) to 5 (most informative):
- 1 point - No information about this criterion.
- 2 points - Little information, much is missing.
- 3 points - Average information, some information is available, some is missing.
- 4 points - Good information, most information is available.
- 5 points - Excellent information, everything is very clear, complete, and detailed.
Metadata is essential for data users to fully understand context, methodology, content, and quality. Creating good metadata requires time and effort, yet not all metadata meets sufficient quality standards. We observe issues in our catalog and others, such as opendata.swiss.
Swiss OGD offerings follow the DCAT-AP CH standard. While widely adopted, it can be easily «hacked».
- Dataset entries can conform to the standard but lack meaningful content by inputting empty strings, lists or dictionaries for mandatory fields, or by inputting a single nonsensical element like one character or number
- The standard can be «misused» by copying the title into the description field, adding no additional information
OGD catalogues contain many such examples, plus datasets that perfectly adhere to the standard but are completely broken.
Note
These problems are not the «fault» of DCAT. The standard is a sincere recommendation, but it cannot ensure that every entry is meaningful. This responsibility lies with us as data stewards and publishers.
Our OGD catalog lists ~1,050 datasets and opendata.swiss lists ~14,000 datasets. Manually checking each dataset for metadata quality issues is unrealistic. We address this by developing automatic procedures to programmatically check and highlight metadata issues. This project provides a template and fresh ideas to achieve this.
Laure Stadler, Chantal Amrhein, Patrick Arnecke – Statistisches Amt Zürich: Team Data
Many thanks also go to Corinna Grobe and our former colleague Adrian Rupp.
We would love to hear from you. Please share your feedback and let us know how you use the code. You can write an email or share your ideas by opening an issue or pull request.
Please note that we use Ruff for linting and code formatting with default settings.
This software (the Software) incorporates models (Models) from OpenAI and others and has been developed according to and with the intent to be used under Swiss law. Please be aware that the EU Artificial Intelligence Act (EU AI Act) may, under certain circumstances, be applicable to your use of the Software. You are solely responsible for ensuring that your use of the Software as well as of the underlying Models complies with all applicable local, national and international laws and regulations. By using this Software, you acknowledge and agree (a) that it is your responsibility to assess which laws and regulations, in particular regarding the use of AI technologies, are applicable to your intended use and to comply therewith, and (b) that you will hold us harmless from any action, claims, liability or loss in respect of your use of the Software.