Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,17 @@ jobs:
cache-dependency-path: setup.py
- name: Install dependencies
run: |
pip install -e '.[test]'
pip install -e '.[test,docs]'
pip install mypy ruff
- name: Run tests
run: |
pytest
pytest --cov=github_to_sqlite --cov-branch -q
- name: Run ruff
run: |
ruff check . --format=github --exit-zero
- name: Run mypy
run: |
mypy github_to_sqlite --no-error-summary
- name: Build docs
run: |
sphinx-build -b html docs docs/_build -W
164 changes: 164 additions & 0 deletions PLAN.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,168 @@
# Embeddings feature plan

This document decomposes the work required to generate sentence-transformer embeddings for starred repositories. The work is split into three phases so that core functionality lands first, then documentation tooling, followed by publishing the new docs.

## Phase 1: Generate and store embeddings

This phase introduces embeddings for starred repositories.

### Dependencies
- [x] **Add runtime dependencies**
- [x] Install `sentence-transformers` for embedding inference.
- [x] Install `sqlite-vec` to store and query embedding vectors in SQLite.
- [x] Install `semantic-chunkers` from GitHub to chunk README text using
`semantic_chunkers.chunkers.StatisticalChunker`.
- [x] Install `fd` to locate build definition files across the repository tree.
`find_build_files()` prefers `fd` but falls back to `find` or `os.walk` if
needed.
- [x] **Add development dependencies**
- [x] Include `pytest-cov` for coverage reports.
- [x] Update `setup.py` or `pyproject.toml` accordingly.

### Database changes
- [x] **Create `repo_embeddings` table**
- [x] Columns: `repo_id` (FK to `repos`), `title_embedding`, `description_embedding`, `readme_embedding`.
- [x] Store embeddings using `sqlite-vec` vec0 virtual tables for efficient vector search.
- [x] Add indexes on `repo_id` for fast lookup.
- [x] **Create `readme_chunk_embeddings` table**
- [x] Columns: `repo_id` (FK to `repos`), `chunk_index`, `chunk_text`, `embedding`.
- [x] Use `sqlite-vec` for the `embedding` column to enable similarity search over
individual README chunks.
- [x] Add a composite index on `repo_id` and `chunk_index`.
- [x] **Create `repo_build_files` table**
- [x] Columns: `repo_id` (FK to `repos`), `file_path`, `metadata` (JSON).
- [x] Store one row per build definition (e.g. `pyproject.toml`, `package.json`).
- [x] The `metadata` column captures the entire parsed contents of the file so that
fields such as package name or author can be queried later.
- [x] **Create `repo_metadata` table**
- [x] Columns: `repo_id` (FK to `repos`), `language`, `directory_tree`.
- [x] Capture the primary programming language and a serialized directory structure
for quick reference.
- [x] **Migration script**
- [x] Provide SQL script or CLI command that creates the table if it does not exist.
- [x] Document migration process in README.

### Embedding generation
- [x] **Model loading**
- [x] Default to `huggingface.co/Alibaba-NLP/gte-modernbert-base`.
- [x] Allow overriding the model path via CLI option or environment variable.
- [x] **Data collection**
- [x] Fetch starred repositories from GitHub using existing API utilities.
- [x] Retrieve README HTML or markdown for each repo.
- [x] Locate common build files (`pyproject.toml`, `package.json`,
`Cargo.toml`, `Gemfile`) using `fd` when available, otherwise `find` or
`os.walk`.
- [x] Parse each file and store its entire contents as JSON in the
`repo_build_files.metadata` column. Package name and author can then be
derived from this JSON as needed.
- [x] Record the repository's primary programming language and generate a serialized
directory tree for storage in `repo_metadata`.
- [x] **Chunking**
- [x] Use `semantic_chunkers.chunkers.StatisticalChunker` to split README text
into semantically meaningful chunks. See `docs/00-chunkers-intro.ipynb` in
the `semantic-chunkers` repository for usage examples.
- [x] If that library is not available at runtime, fall back to splitting on
blank lines to ensure tests run without optional dependencies.
- [x] **Vector inference**
- [x] Run the model on the repository title, description and each README chunk.
- [x] Batch requests when possible to speed up inference.
- [x] **Storage**
- [x] Save repository-level vectors to `repo_embeddings`.
- [x] Save each chunk's embedding to `readme_chunk_embeddings` along with the
chunk text and index.
- [x] Skip entries that already exist unless `--force` is supplied.

### CLI integration
- [x] **New command** `starred-embeddings`
- [x] Accept database path and optional model path.
- [x] Iterate through all starred repos and compute embeddings.
- [x] Chunk each README using `StatisticalChunker` and store chunk embeddings.
- [x] Collect build metadata using `find_build_files()` (using `fd`, `find` or
`os.walk` as available) and store the entire parsed JSON in the
`repo_build_files.metadata` column.
- [x] Support `--force` and `--verbose` flags.
- [x] **Error handling**
- [x] Handle missing READMEs gracefully.
- [x] Retry transient network failures.

### Testing
- [x] **Unit tests**
- [x] Mock GitHub API calls and README fetches.
- [x] Verify embeddings are generated and stored correctly, including per-chunk
embeddings.
- [x] Ensure build metadata is parsed and stored as JSON in `repo_build_files`.
- [x] **Coverage**
- [x] Run `pytest --cov --cov-branch` in CI to ensure branch coverage does not regress.
- [x] **Integration tests**
- [x] Simulate GitHub API responses with `requests_mock` and run the
`starred-embeddings` command end-to-end.
- [x] Confirm embeddings, chunked README data and build metadata are
stored in the database.

### Documentation
- [x] **README updates**
- [x] Describe the new command and its options.
- [x] Mention default model and how to override it.
- [x] Document how README files are chunked using `semantic-chunkers` before
embedding.
- [x] Explain how build files are detected using `find_build_files()`
(preferring `fd`) and stored for analysis.
- [x] **Changelog entry**
- [x] Summarize the feature and dependencies.

## Phase 2: Documentation tooling

- [x] **Introduce RST and Sphinx**
- [x] Add `sphinx` and `sphinx-rtd-theme` to development dependencies.
- [x] Configure a `docs/` directory with Sphinx `conf.py` and initial structure.
- [x] **Convert existing documentation**
- [x] Migrate `README.md` or relevant guides into RST as needed.
- [x] Ensure the embeddings feature is documented in the new docs site.
- [ ] **Automation**
- [ ] Update CI to build documentation and fail on warnings.

## Phase 3: Publish documentation

- [ ] **Deployment**
- [ ] Publish the documentation using GitHub Pages or another hosting service.
- [ ] Automate deployment on release so new docs are available immediately.

## Next task: publish documentation site

With the documentation building in CI, the next step is to publish it so users
can browse the docs online.

Steps:

- [ ] Set up a GitHub Pages workflow that uploads ``docs/_build``
from the main branch.
- [ ] Trigger the deployment after tests pass on ``main``.

Completed build steps:

- [x] Install documentation dependencies in the CI environment.
- [x] Run ``sphinx-build -b html docs docs/_build`` during CI.
- [x] Treat warnings as errors so the build fails on broken docs.

- [x] Add a `starred-embeddings` Click command in `cli.py`.
- [x] Accept a database path argument.
- [x] Accept `--model` to override the default model.
- [x] Support `--force` and `--verbose` flags.
- [x] Load the sentence-transformers model using the configured name.
- [x] Iterate through starred repositories using existing API helpers.
- [x] Save repository metadata to the database.
- [x] Fetch README content for each repository.
- [x] Use `StatisticalChunker` to split README text.
- [x] Run embeddings for titles, descriptions and README chunks.
- [x] Save vectors to `repo_embeddings` and `readme_chunk_embeddings`.
- [x] Extract build files using `find_build_files()` and store metadata in
`repo_build_files`.
- [x] Capture the primary language and directory tree in `repo_metadata`.
- [x] Write unit tests for the new command using mocks to avoid network calls.
- [x] Ensure coverage passes with `pytest --cov --cov-branch`.
- [x] Add tests for utility helpers like `vector_to_blob`, `parse_build_file`,
`directory_tree` and `_maybe_load_sqlite_vec`.
=======
This document decomposes the work required to generate sentence-transformer embeddings for starred repositories. Each bullet point expands into further tasks until reaching granular actionable steps.

## 1. Dependencies
Expand Down Expand Up @@ -57,3 +220,4 @@ This document decomposes the work required to generate sentence-transformer embe
- **Changelog entry**
- Summarize the feature and dependencies.


37 changes: 37 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Save data from GitHub to a SQLite database.
- [Scraping dependents for a repository](#scraping-dependents-for-a-repository)
- [Fetching emojis](#fetching-emojis)
- [Making authenticated API calls](#making-authenticated-api-calls)
- [Running migrations](#running-migrations)
- [Generating embeddings for starred repositories](#generating-embeddings-for-starred-repositories)

<!-- tocstop -->

Expand Down Expand Up @@ -258,3 +260,38 @@ Many GitHub APIs are [paginated using the HTTP Link header](https://docs.github.
You can outline newline-delimited JSON for each item using `--nl`. This can be useful for streaming items into another tool.

$ github-to-sqlite get /users/simonw/repos --nl

## Running migrations

Run the `migrate` command to create any optional tables and indexes:

$ github-to-sqlite migrate github.db

The command ensures embedding tables exist and sets up FTS, foreign keys and
views using the same logic as the main CLI commands. It will create
`repo_embeddings`, `readme_chunk_embeddings`, `repo_build_files`, and
`repo_metadata` tables, using `sqlite-vec` if available.

## Build file detection

Some commands extract metadata from standard build files. The helper prefers the
[`fd`](https://github.com/sharkdp/fd) tool if available, falling back to the
`find` utility or a Python implementation.

## Generating embeddings for starred repositories

Use the `starred-embeddings` command to compute embeddings for repositories you
have starred. The command loads the sentence-transformers model configured in
`config.default_model` (currently `Alibaba-NLP/gte-modernbert-base`) unless you
specify `--model`. You can also set the `GITHUB_TO_SQLITE_MODEL` environment
variable to override the default.

```
$ github-to-sqlite starred-embeddings github.db --model my/custom-model
```

Embeddings for repository titles, descriptions and README chunks are stored in
`repo_embeddings` and `readme_chunk_embeddings`. Build files discovered using
`find_build_files()` are parsed and saved to `repo_build_files`, while basic
language information and a directory listing are recorded in `repo_metadata`.

12 changes: 12 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import os
import sys

sys.path.insert(0, os.path.abspath('..'))

project = 'github-to-sqlite'
author = 'Simon Willison'
release = '2.9'

extensions = ['sphinx.ext.autodoc']

html_theme = 'sphinx_rtd_theme'
10 changes: 10 additions & 0 deletions docs/embeddings.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Generating embeddings
=====================

The ``starred-embeddings`` command computes sentence-transformer embeddings for repositories you have starred. It loads the model configured in ``config.default_model`` (``Alibaba-NLP/gte-modernbert-base`` by default) unless you specify ``--model`` or set the ``GITHUB_TO_SQLITE_MODEL`` environment variable.

.. code-block:: console

$ github-to-sqlite starred-embeddings github.db --model my/custom-model

The command stores repository-level vectors in ``repo_embeddings`` and README chunk vectors in ``readme_chunk_embeddings``. Build files discovered via ``find_build_files()`` are parsed and saved to ``repo_build_files``. Basic language information and the directory listing are recorded in ``repo_metadata``.
9 changes: 9 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Welcome to github-to-sqlite's documentation!
=============================================

.. toctree::
:maxdepth: 2
:caption: Contents:

embeddings
migrations
15 changes: 15 additions & 0 deletions docs/migrations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Migrations and build files
==========================

Run the ``migrate`` command to create any optional tables and indexes used by the embeddings feature:

.. code-block:: console

$ github-to-sqlite migrate github.db

This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.

Build file detection
--------------------

Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.
Loading