A DuckDB extension for parsing and analyzing chess games in PGN format.
The core idea: load raw PGNs into DuckDB via read_pgn(), then do opening detection, deduplication, and move/position analysis directly in SQL.
- Parse PGN files with
read_pgn()(single file or glob patterns) - Lichess-style columns (Event/Site/players/elos/opening/time control/movetext)
- Movetext utilities: normalize, hash, ply count
- Position tracing: convert movetext into per-ply JSON including FEN/EPD (useful for joining to openings datasets)
-- Read a pgn file
SELECT Event, White, Black, Result, Termination, TimeControl FROM read_pgn('test/pgn_files/sample.pgn');
-- Read multiple pgn files
SELECT COUNT(*) FROM read_pgn('test/pgn_files/*.pgn');
-- Read zstd-compressed pgn files
SELECT COUNT(*) FROM read_pgn('test/pgn_files/sample.pgn.zst', compression := 'zstd');
-- How many games started with 1. e4 e5
SELECT COUNT_IF(chess_moves_subset('1. e4 e5', movetext)) FROM read_pgn('test/pgn_files/sample.pgn');
-- Removes comments/variations/NAGs and normalizes move numbers
SELECT chess_moves_normalize(movetext) FROM read_pgn('test/pgn_files/sample.pgn');
-- Zobrist hash of the final mainline position
SELECT chess_moves_hash('1. e4 e5 2. Nf3 Nc6') AS hash;
-- Ply count
SELECT chess_ply_count('1. e4 e5 2. Nf3') AS ply;
-- Normalize Time control notation
SELECT chess_timecontrol_normalize('90min./40 + 30min. + 30s./move'); -- 40/5400+30:1800+30
-- Converts FEN to EPD
SELECT chess_fen_epd('rnbq1rk1/1pp1bppp/p3pn2/8/2pP4/2N2NP1/PP2PPBP/R1BQ1RK1 w - - 0 8') AS epd;
-- Get the first 40 moves of a game as ply, san and epd
WITH g AS (
SELECT movetext
FROM read_pgn('test/pgn_files/sample.pgn')
WHERE parse_error IS NULL
LIMIT 1
)
SELECT
json_extract(m.value, '$.ply')::INT AS ply,
json_extract_string(m.value, '$.move') AS san,
json_extract_string(m.value, '$.epd') AS epd
FROM g,
json_each(CAST(chess_moves_json(g.movetext, 40) AS JSON)) m;- Rust toolchain: repo toolchain is
1.93(rust-toolchain.toml), minimum supported Rust version for this crate (MSRV) is1.89 - DuckDB
1.4.4 - Run
just install-toolsto install:- cargo-duckdb-ext-tools: A Rust-based toolkit for building and packaging DuckDB extensions without Python dependencies
- duckdb-slt: A Rust-based sqllogictest runner for DuckDB.
# format + lint
just check
# build
just debug # (debug)
just release # (release)
# tests (unit + SQLLogicTest)
just test # (debug)
just test-release # (release)
# main dev loop
just dev # (debug)The just recipes call Rust-first commands (cargo duckdb-ext-build, cargo test, cargo fmt, cargo clippy).
extension-ci-tools/ is kept for DuckDB community extension template compatibility. Local Rust-first targets above do not require Python/venv, but template/CI compatibility targets may.
Local builds are unsigned; start DuckDB with -unsigned:
duckdb -unsignedThen:
LOAD './target/release/chess.duckdb_extension';- Make changes to the source code
- Run the main workflow:
just dev - Run release checks:
just test-release - Test manually with DuckDB CLI
LOAD './target/release/chess.duckdb_extension';
SELECT Event, White, Black, Result, Opening
FROM read_pgn('games.pgn')
WHERE parse_error IS NULL
LIMIT 10;
-- Glob patterns work too
SELECT count(*)
FROM read_pgn('lichess_db_2024-*.pgn');Notes:
- Glob expansion currently triggers when
path_patterncontains*or?. movetextis mainline only; variations are skipped,{ ... }comments are preserved.- Terminal result markers are not appended to
movetext; use theResultcolumn for game result metadata. - If a game fails to parse, you still get a row with
parse_errorset. - When reading multiple files (via glob), unreadable files are skipped with a warning; a single explicit file path fails hard.
SELECT chess_moves_normalize('1. e4! {comment} e5?? $1 2. Nf3') AS clean;
-- clean = '1. e4 e5 2. Nf3'
SELECT chess_moves_hash('1. e4 e5 2. Nf3 Nc6') AS h; -- UBIGINT
SELECT chess_ply_count('1. e4 e5 2. Nf3') AS ply_count; -- BIGINTSELECT chess_timecontrol_normalize('90min./40 + 30min. + 30s./move');
-- 40/5400+30:1800+30
SELECT chess_timecontrol_normalize('15 + 10');
-- 900+10 -- Infers 15 is probably minutes and not seconds. To see if it was inferred, you can use chess_timecontrol_json
SELECT chess_timecontrol_json('15 + 10');
-- {"raw":"15 + 10","normalized":"900+10","mode":"normal","periods":[{"base":900,"increment":10}],"warnings":["normalized_operator_whitespace","interpreted_small_base_as_minutes"],"inferred":true}
SELECT chess_timecontrol_category('2+12');
-- rapid
SELECT chess_timecontrol_category('29''''');
-- ultra-bullet
SELECT chess_timecontrol_category('29+0');
-- classical (small-base shorthand is interpreted as minutes)Use the pattern that matches your data quality and workload.
| Data shape | Recommended pattern | Why |
|---|---|---|
| Raw/noisy movetext (comments, NAGs, variations, mixed formatting) | chess_moves_subset(short, long) |
Parser-backed subset semantics |
| Canonical/materialized movetext (already normalized) | starts_with(long_norm, short_norm) |
Faster repeated prefix filtering |
Raw/noisy workflow:
SELECT *
FROM read_pgn('games/*.pgn')
WHERE chess_moves_subset('1. e4 e5 2. Nf3', movetext);Canonical/materialized workflow:
CREATE OR REPLACE TABLE games_norm AS
SELECT
*,
chess_moves_normalize(movetext) AS movetext_norm
FROM read_pgn('games/*.pgn')
WHERE parse_error IS NULL;
WITH needle AS (
SELECT chess_moves_normalize('1. e4 e5 2. Nf3') AS short_norm
)
SELECT g.*
FROM games_norm g, needle n
WHERE starts_with(g.movetext_norm, n.short_norm);contains is not subset-prefix semantics:
SELECT
contains('1. d4 d5 2. e4 e5', 'e4 e5') AS contains_match,
chess_moves_subset('e4 e5', '1. d4 d5 2. e4 e5') AS subset_match;
-- contains_match = true, subset_match = falseCaveats:
read_pgn(...).movetextdoes not include a terminal result marker (1-0,0-1,1/2-1/2,*).starts_withis only safe when both sides are canonicalized with the same normalization pipeline.- If input quality is uncertain, prefer
chess_moves_subset.
chess_moves_json() returns a JSON string (cast to JSON if you want to use JSON functions).
WITH g AS (
SELECT movetext
FROM read_pgn('test/pgn_files/sample.pgn')
WHERE parse_error IS NULL
LIMIT 1
)
SELECT
json_extract(m.value, '$.ply')::INT AS ply,
json_extract_string(m.value, '$.move') AS san,
json_extract_string(m.value, '$.fen') AS fen,
json_extract_string(m.value, '$.epd') AS epd
FROM g,
json_each(CAST(chess_moves_json(g.movetext, 40) AS JSON)) m;Assumes an openings table with columns epd, eco, name, and a per-opening mainline in uci.
If your openings dataset is a Parquet file, a typical setup looks like:
CREATE OR REPLACE TABLE openings AS
SELECT * FROM read_parquet('openings.parquet');WITH params AS (
SELECT max(array_length(string_split(uci, ' '))) AS max_opening_ply
FROM openings
),
games AS (
SELECT row_number() OVER () AS game_id, movetext
FROM read_pgn('games.pgn')
WHERE parse_error IS NULL
),
pos AS (
SELECT
g.game_id,
json_extract(m.value, '$.ply')::INT AS ply,
trim(json_extract_string(m.value, '$.epd')) AS epd
FROM games g,
params p,
json_each(CAST(chess_moves_json(g.movetext, p.max_opening_ply) AS JSON)) m
),
matches AS (
SELECT p.game_id, p.ply, o.eco, o.name
FROM pos p
JOIN openings o ON trim(o.epd) = p.epd
)
SELECT game_id, eco, name
FROM (
SELECT *, row_number() OVER (PARTITION BY game_id ORDER BY ply DESC) AS rn
FROM matches
)
WHERE rn = 1;Reads chess games from one or more PGN files.
path_pattern can be a single path or a glob pattern (e.g. lichess_db_2024-*.pgn).
compression is optional. Supported values:
NULLor omitted: plain PGN input (default)'zstd': zstd-compressed PGN input streams (e.g..pgn.zst)
Returned columns:
| Column | Type | Notes |
|---|---|---|
| Event | VARCHAR | PGN tag |
| Site | VARCHAR | PGN tag |
| White | VARCHAR | PGN tag |
| Black | VARCHAR | PGN tag |
| Result | VARCHAR | PGN tag |
| WhiteTitle | VARCHAR | PGN tag (nullable) |
| BlackTitle | VARCHAR | PGN tag (nullable) |
| WhiteElo | UINTEGER | PGN tag (nullable) |
| BlackElo | UINTEGER | PGN tag (nullable) |
| UTCDate | DATE | PGN tag (nullable) |
| UTCTime | TIMETZ | PGN tag (nullable, treated as UTC) |
| ECO | VARCHAR | PGN tag |
| Opening | VARCHAR | PGN tag |
| Termination | VARCHAR | PGN tag |
| TimeControl | VARCHAR | PGN tag |
| movetext | VARCHAR | Mainline only, includes {...} comments, no terminal result marker |
| parse_error | VARCHAR | NULL on success; error message on failure |
| Source | VARCHAR | PGN tag (nullable) |
| Function | Returns | Notes |
|---|---|---|
chess_moves_normalize(movetext) |
VARCHAR | Removes comments/variations/NAGs and normalizes move numbers |
chess_moves_hash(movetext) |
UBIGINT | Zobrist hash of the final mainline position (comments/variations/NAGs ignored); NULL for empty/unparseable input |
chess_ply_count(movetext) |
BIGINT | Ply count (NULL-safe macro) |
chess_moves_json(movetext, max_ply := NULL) |
VARCHAR | JSON string of {ply, move, fen, epd} (NULL-safe macro) |
chess_fen_epd(fen) |
VARCHAR | Converts FEN to EPD join key (board/side/castling/ep) |
chess_moves_subset(short_movetext, long_movetext) |
BOOLEAN | True if short mainline is a prefix of long mainline |
chess_timecontrol_normalize(timecontrol) |
VARCHAR | Normalizes PGN TimeControl to canonical seconds; NULL on failure |
chess_timecontrol_json(timecontrol) |
VARCHAR | JSON with raw, normalized, mode, periods, warnings, inferred |
chess_timecontrol_category(timecontrol) |
VARCHAR | Lichess-style category from estimated seconds (base + 40*increment); NULL for unsupported/unparseable inputs |
MIT. See LICENSE.
- Built on DuckDB's extension framework
- Uses the modern duckdb-ext-rs-template by @redraiment
- PGN parsing by pgn-reader
- Chess logic by shakmaty