JeffCarpenter · JeffCarpenter · Jun 28, 2025 · Sep 29, 2025
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -18,7 +18,17 @@ jobs:
         cache-dependency-path: setup.py
     - name: Install dependencies
       run: |
-        pip install -e '.[test]'
+        pip install -e '.[test,docs]'
+        pip install mypy ruff
     - name: Run tests
       run: |
-        pytest
+        pytest --cov=github_to_sqlite --cov-branch -q
+    - name: Run ruff
+      run: |
+        ruff check . --format=github --exit-zero
+    - name: Run mypy
+      run: |
+        mypy github_to_sqlite --no-error-summary
+    - name: Build docs
+      run: |
+        sphinx-build -b html docs docs/_build -W
diff --git a/PLAN.md b/PLAN.md
@@ -1,5 +1,168 @@
 # Embeddings feature plan
 
+This document decomposes the work required to generate sentence-transformer embeddings for starred repositories. The work is split into three phases so that core functionality lands first, then documentation tooling, followed by publishing the new docs.
+
+## Phase 1: Generate and store embeddings
+
+This phase introduces embeddings for starred repositories.
+
+### Dependencies
+- [x] **Add runtime dependencies**
+  - [x] Install `sentence-transformers` for embedding inference.
+  - [x] Install `sqlite-vec` to store and query embedding vectors in SQLite.
+  - [x] Install `semantic-chunkers` from GitHub to chunk README text using
+    `semantic_chunkers.chunkers.StatisticalChunker`.
+  - [x] Install `fd` to locate build definition files across the repository tree.
+    `find_build_files()` prefers `fd` but falls back to `find` or `os.walk` if
+    needed.
+- [x] **Add development dependencies**
+  - [x] Include `pytest-cov` for coverage reports.
+  - [x] Update `setup.py` or `pyproject.toml` accordingly.
+
+### Database changes
+- [x] **Create `repo_embeddings` table**
+  - [x] Columns: `repo_id` (FK to `repos`), `title_embedding`, `description_embedding`, `readme_embedding`.
+  - [x] Store embeddings using `sqlite-vec` vec0 virtual tables for efficient vector search.
+  - [x] Add indexes on `repo_id` for fast lookup.
+- [x] **Create `readme_chunk_embeddings` table**
+  - [x] Columns: `repo_id` (FK to `repos`), `chunk_index`, `chunk_text`, `embedding`.
+  - [x] Use `sqlite-vec` for the `embedding` column to enable similarity search over
+    individual README chunks.
+  - [x] Add a composite index on `repo_id` and `chunk_index`.
+- [x] **Create `repo_build_files` table**
+  - [x] Columns: `repo_id` (FK to `repos`), `file_path`, `metadata` (JSON).
+  - [x] Store one row per build definition (e.g. `pyproject.toml`, `package.json`).
+  - [x] The `metadata` column captures the entire parsed contents of the file so that
+    fields such as package name or author can be queried later.
+- [x] **Create `repo_metadata` table**
+  - [x] Columns: `repo_id` (FK to `repos`), `language`, `directory_tree`.
+  - [x] Capture the primary programming language and a serialized directory structure
+    for quick reference.
+ - [x] **Migration script**
+  - [x] Provide SQL script or CLI command that creates the table if it does not exist.
+  - [x] Document migration process in README.
+
+### Embedding generation
+- [x] **Model loading**
+  - [x] Default to `huggingface.co/Alibaba-NLP/gte-modernbert-base`.
+  - [x] Allow overriding the model path via CLI option or environment variable.
+- [x] **Data collection**
+  - [x] Fetch starred repositories from GitHub using existing API utilities.
+  - [x] Retrieve README HTML or markdown for each repo.
+  - [x] Locate common build files (`pyproject.toml`, `package.json`,
+    `Cargo.toml`, `Gemfile`) using `fd` when available, otherwise `find` or
+    `os.walk`.
+  - [x] Parse each file and store its entire contents as JSON in the
+    `repo_build_files.metadata` column. Package name and author can then be
+    derived from this JSON as needed.
+  - [x] Record the repository's primary programming language and generate a serialized
+    directory tree for storage in `repo_metadata`.
+- [x] **Chunking**
+  - [x] Use `semantic_chunkers.chunkers.StatisticalChunker` to split README text
+    into semantically meaningful chunks. See `docs/00-chunkers-intro.ipynb` in
+    the `semantic-chunkers` repository for usage examples.
+  - [x] If that library is not available at runtime, fall back to splitting on
+    blank lines to ensure tests run without optional dependencies.
+- [x] **Vector inference**
+  - [x] Run the model on the repository title, description and each README chunk.
+  - [x] Batch requests when possible to speed up inference.
+- [x] **Storage**
+  - [x] Save repository-level vectors to `repo_embeddings`.
+  - [x] Save each chunk's embedding to `readme_chunk_embeddings` along with the
+    chunk text and index.
+  - [x] Skip entries that already exist unless `--force` is supplied.
+
+### CLI integration
+- [x] **New command** `starred-embeddings`
+  - [x] Accept database path and optional model path.
+  - [x] Iterate through all starred repos and compute embeddings.
+  - [x] Chunk each README using `StatisticalChunker` and store chunk embeddings.
+  - [x] Collect build metadata using `find_build_files()` (using `fd`, `find` or
+    `os.walk` as available) and store the entire parsed JSON in the
+    `repo_build_files.metadata` column.
+  - [x] Support `--force` and `--verbose` flags.
+- [x] **Error handling**
+  - [x] Handle missing READMEs gracefully.
+  - [x] Retry transient network failures.
+
+### Testing
+- [x] **Unit tests**
+  - [x] Mock GitHub API calls and README fetches.
+  - [x] Verify embeddings are generated and stored correctly, including per-chunk
+    embeddings.
+  - [x] Ensure build metadata is parsed and stored as JSON in `repo_build_files`.
+  - [x] **Coverage**
+  - [x] Run `pytest --cov --cov-branch` in CI to ensure branch coverage does not regress.
+- [x] **Integration tests**
+  - [x] Simulate GitHub API responses with `requests_mock` and run the
+    `starred-embeddings` command end-to-end.
+  - [x] Confirm embeddings, chunked README data and build metadata are
+    stored in the database.
+
+### Documentation
+- [x] **README updates**
+  - [x] Describe the new command and its options.
+  - [x] Mention default model and how to override it.
+  - [x] Document how README files are chunked using `semantic-chunkers` before
+    embedding.
+  - [x] Explain how build files are detected using `find_build_files()`
+    (preferring `fd`) and stored for analysis.
+- [x] **Changelog entry**
+  - [x] Summarize the feature and dependencies.
+
+## Phase 2: Documentation tooling
+
+- [x] **Introduce RST and Sphinx**
+  - [x] Add `sphinx` and `sphinx-rtd-theme` to development dependencies.
+  - [x] Configure a `docs/` directory with Sphinx `conf.py` and initial structure.
+- [x] **Convert existing documentation**
+  - [x] Migrate `README.md` or relevant guides into RST as needed.
+  - [x] Ensure the embeddings feature is documented in the new docs site.
+- [ ] **Automation**
+  - [ ] Update CI to build documentation and fail on warnings.
+
+## Phase 3: Publish documentation
+
+- [ ] **Deployment**
+  - [ ] Publish the documentation using GitHub Pages or another hosting service.
+  - [ ] Automate deployment on release so new docs are available immediately.
+
+## Next task: publish documentation site
+
+With the documentation building in CI, the next step is to publish it so users
+can browse the docs online.
+
+Steps:
+
+- [ ] Set up a GitHub Pages workflow that uploads ``docs/_build``
+  from the main branch.
+- [ ] Trigger the deployment after tests pass on ``main``.
+
+Completed build steps:
+
+- [x] Install documentation dependencies in the CI environment.
+- [x] Run ``sphinx-build -b html docs docs/_build`` during CI.
+- [x] Treat warnings as errors so the build fails on broken docs.
+
+- [x] Add a `starred-embeddings` Click command in `cli.py`.
+  - [x] Accept a database path argument.
+  - [x] Accept `--model` to override the default model.
+  - [x] Support `--force` and `--verbose` flags.
+- [x] Load the sentence-transformers model using the configured name.
+- [x] Iterate through starred repositories using existing API helpers.
+  - [x] Save repository metadata to the database.
+  - [x] Fetch README content for each repository.
+  - [x] Use `StatisticalChunker` to split README text.
+  - [x] Run embeddings for titles, descriptions and README chunks.
+  - [x] Save vectors to `repo_embeddings` and `readme_chunk_embeddings`.
+  - [x] Extract build files using `find_build_files()` and store metadata in
+    `repo_build_files`.
+  - [x] Capture the primary language and directory tree in `repo_metadata`.
+- [x] Write unit tests for the new command using mocks to avoid network calls.
+  - [x] Ensure coverage passes with `pytest --cov --cov-branch`.
+  - [x] Add tests for utility helpers like `vector_to_blob`, `parse_build_file`,
+    `directory_tree` and `_maybe_load_sqlite_vec`.
+=======
 This document decomposes the work required to generate sentence-transformer embeddings for starred repositories. Each bullet point expands into further tasks until reaching granular actionable steps.
 
 ## 1. Dependencies
@@ -57,3 +220,4 @@ This document decomposes the work required to generate sentence-transformer embe
 - **Changelog entry**
   - Summarize the feature and dependencies.
 
+
diff --git a/README.md b/README.md
@@ -27,6 +27,8 @@ Save data from GitHub to a SQLite database.
 - [Scraping dependents for a repository](#scraping-dependents-for-a-repository)
 - [Fetching emojis](#fetching-emojis)
 - [Making authenticated API calls](#making-authenticated-api-calls)
+- [Running migrations](#running-migrations)
+- [Generating embeddings for starred repositories](#generating-embeddings-for-starred-repositories)
 
 <!-- tocstop -->
 
@@ -258,3 +260,38 @@ Many GitHub APIs are [paginated using the HTTP Link header](https://docs.github.
 You can outline newline-delimited JSON for each item using `--nl`. This can be useful for streaming items into another tool.
 
     $ github-to-sqlite get /users/simonw/repos --nl
+
+## Running migrations
+
+Run the `migrate` command to create any optional tables and indexes:
+
+    $ github-to-sqlite migrate github.db
+
+The command ensures embedding tables exist and sets up FTS, foreign keys and
+views using the same logic as the main CLI commands. It will create
+`repo_embeddings`, `readme_chunk_embeddings`, `repo_build_files`, and
+`repo_metadata` tables, using `sqlite-vec` if available.
+
+## Build file detection
+
+Some commands extract metadata from standard build files. The helper prefers the
+[`fd`](https://github.com/sharkdp/fd) tool if available, falling back to the
+`find` utility or a Python implementation.
+
+## Generating embeddings for starred repositories
+
+Use the `starred-embeddings` command to compute embeddings for repositories you
+have starred. The command loads the sentence-transformers model configured in
+`config.default_model` (currently `Alibaba-NLP/gte-modernbert-base`) unless you
+specify `--model`. You can also set the `GITHUB_TO_SQLITE_MODEL` environment
+variable to override the default.
+
+```
+$ github-to-sqlite starred-embeddings github.db --model my/custom-model
+```
+
+Embeddings for repository titles, descriptions and README chunks are stored in
+`repo_embeddings` and `readme_chunk_embeddings`. Build files discovered using
+`find_build_files()` are parsed and saved to `repo_build_files`, while basic
+language information and a directory listing are recorded in `repo_metadata`.
+
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,12 @@
+import os
+import sys
+
+sys.path.insert(0, os.path.abspath('..'))
+
+project = 'github-to-sqlite'
+author = 'Simon Willison'
+release = '2.9'
+
+extensions = ['sphinx.ext.autodoc']
+
+html_theme = 'sphinx_rtd_theme'
diff --git a/docs/embeddings.rst b/docs/embeddings.rst
@@ -0,0 +1,10 @@
+Generating embeddings
+=====================
+
+The ``starred-embeddings`` command computes sentence-transformer embeddings for repositories you have starred. It loads the model configured in ``config.default_model`` (``Alibaba-NLP/gte-modernbert-base`` by default) unless you specify ``--model`` or set the ``GITHUB_TO_SQLITE_MODEL`` environment variable.
+
+.. code-block:: console
+
+    $ github-to-sqlite starred-embeddings github.db --model my/custom-model
+
+The command stores repository-level vectors in ``repo_embeddings`` and README chunk vectors in ``readme_chunk_embeddings``. Build files discovered via ``find_build_files()`` are parsed and saved to ``repo_build_files``. Basic language information and the directory listing are recorded in ``repo_metadata``.
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,9 @@
+Welcome to github-to-sqlite's documentation!
+=============================================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   embeddings
+   migrations
diff --git a/docs/migrations.rst b/docs/migrations.rst
@@ -0,0 +1,15 @@
+Migrations and build files
+==========================
+
+Run the ``migrate`` command to create any optional tables and indexes used by the embeddings feature:
+
+.. code-block:: console
+
+    $ github-to-sqlite migrate github.db
+
+This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.
+
+Build file detection
+--------------------
+
+Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.