🦄 OGD Auto AI Analyzer

Almost automatically analyze the quality of a DCAT metadata catalog with a little help from ✨ AI.

Contents

Usage
What does the code do?
What exactly do we check?
Why check metadata?
- How to fix this?
Project team
Feedback and contributing
Disclaimer

Usage

# Clone the repository
git clone https://github.com/statistikZH/ogd_ai-analyzer.git
cd ogd_ai-analyzer

# Install uv and dependencies
pip3 install uv
uv venv
source .venv/bin/activate
uv sync

You need to create an OpenRouter API key to use the LLM-based assessments. Create an .env file and input your API key like so:

    OPENROUTER_API_KEY=sk-or-v1-...

Open the notebooks in your favorite IDE and run the code.
Check the results (in folder _results) and fix issues in your metadata.

Note

The notebooks are set up as Quarto files. You don't need to use Quarto. You can simply run the notebooks as is and look at the results. However, we encourage you to try it out with Quarto. The results will be much more shareable, e.g., to a non-technical audience that doesn't want or need to see code. Simply install Quarto, add an extension to your IDE, and convert the notebooks to HTML or PDF files. You can also render the EDA notebook directly from the command line:

quarto render 01_mdv_quality_checks.ipynb

What does the code do?

We perform a thorough metadata analysis and quality check using the OGD metadata catalog of the Canton of Zurich as an example.

This project:

Treats the metadata catalog as a regular dataset and performs a structured, detailed exploratory data analysis (EDA)
Uses an LLM to analyze titles and descriptions to discover semantic deficits and nonsensical entries that are hard to catch otherwise

We set up the code to perform most of the checks automatically. It should be easy to adapt these notebooks to other data catalogs that conform to the DCAT-AP CH standard.

The two notebooks produce the following outputs:

a HTML report detailing all issues that were found
an Excel file with all major issues categorized and sortable
another Excel file with a qualitative assessment of the title and description of each dataset created by an LLM

Important

Use of the LLM-based analysis code results in data being sent to third-party model providers through OpenRouter, which brokers requests to multiple LLM services. Do not submit sensitive or confidential data.

Important

LLMs make errors. This app provides suggestions only and yields a draft analysis that you should always double-check.

What exactly do we check?

Conformity to the DCAT standard
Missing values
Hidden nulls (e.g., "", "null", "none", "nichts")
Empty lists and dictionaries
Duplicates
Text issues in titles and descriptions, such as unstripped text, line breaks, escape sequences, control characters, and unnecessary whitespace
Abbreviations that might erode clarity or make search unnecessarily hard
Titles copied verbatim to descriptions or resource descriptions, adding no new information
Overall semantic quality of titles and descriptions (✨ powered by an LLM)
Date issues, such as non-parsable dates and start dates that come after end dates
Issues in individual properties
Offline or invalid landing pages and distributions
and many more...

These checks cover metadata at both dataset and distribution levels.

The second notebook provides an in-depth analysis of each dataset's title and description. An ✨ LLM assesses whether the title and description clearly explain:

what the dataset is about («Dateninhalt»),
how the data was collected («Entstehungszusammenhang»),
how the data quality is («Datenqualität»),
what the spatial aggregation is («Räumlicher Bezug»),
and how the data can be linked to other data («Verknüpfungsmöglichkeiten»).

Each dataset receives a score from 1 (least informative) to 5 (most informative):

1 point - No information about this criterion.
2 points - Little information, much is missing.
3 points - Average information, some information is available, some is missing.
4 points - Good information, most information is available.
5 points - Excellent information, everything is very clear, complete, and detailed.

Background: Why check metadata?

Metadata is essential for data users to fully understand context, methodology, content, and quality. Creating good metadata requires time and effort, yet not all metadata meets sufficient quality standards. We observe issues in our catalog and others, such as opendata.swiss.

Swiss OGD offerings follow the DCAT-AP CH standard. While widely adopted, it can be easily «hacked».

Dataset entries can conform to the standard but lack meaningful content by inputting empty strings, lists or dictionaries for mandatory fields, or by inputting a single nonsensical element like one character or number
The standard can be «misused» by copying the title into the description field, adding no additional information

OGD catalogues contain many such examples, plus datasets that perfectly adhere to the standard but are completely broken.

Note

These problems are not the «fault» of DCAT. The standard is a sincere recommendation, but it cannot ensure that every entry is meaningful. This responsibility lies with us as data stewards and publishers.

How to fix this?

Our OGD catalog lists ~1,050 datasets and opendata.swiss lists ~14,000 datasets. Manually checking each dataset for metadata quality issues is unrealistic. We address this by developing automatic procedures to programmatically check and highlight metadata issues. This project provides a template and fresh ideas to achieve this.

Project Team

Laure Stadler, Chantal Amrhein, Patrick Arnecke – Statistisches Amt Zürich: Team Data

Many thanks also go to Corinna Grobe and our former colleague Adrian Rupp.

Feedback and contributing

We would love to hear from you. Please share your feedback and let us know how you use the code. You can write an email or share your ideas by opening an issue or pull request.

Please note that we use Ruff for linting and code formatting with default settings.

Disclaimer

This software (the Software) incorporates models (Models) from OpenAI and others and has been developed according to and with the intent to be used under Swiss law. Please be aware that the EU Artificial Intelligence Act (EU AI Act) may, under certain circumstances, be applicable to your use of the Software. You are solely responsible for ensuring that your use of the Software as well as of the underlying Models complies with all applicable local, national and international laws and regulations. By using this Software, you acknowledge and agree (a) that it is your responsibility to assess which laws and regulations, in particular regarding the use of AI technologies, are applicable to your intended use and to comply therewith, and (b) that you will hold us harmless from any action, claims, liability or loss in respect of your use of the Software.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
_data		_data
_results		_results
.gitignore		.gitignore
.python-version		.python-version
01_mdv_eda.ipynb		01_mdv_eda.ipynb
02_mdv_llm_assessment.ipynb		02_mdv_llm_assessment.ipynb
LICENSE		LICENSE
README.md		README.md
_gh.code-workspace		_gh.code-workspace
_quarto.yaml		_quarto.yaml
config.yaml		config.yaml
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🦄 OGD Auto AI Analyzer

Usage

What does the code do?

What exactly do we check?

Background: Why check metadata?

How to fix this?

Project Team

Feedback and contributing

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

machinelearningZH/ogd_ai-analyzer

Folders and files

Latest commit

History

Repository files navigation

🦄 OGD Auto AI Analyzer

Usage

What does the code do?

What exactly do we check?

Background: Why check metadata?

How to fix this?

Project Team

Feedback and contributing

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages