Skip to content

Add integration test for starred embeddings#5

Open
JeffCarpenter wants to merge 1 commit intomainfrom
dldg6y-codex/plan-feature-to-infer-embeddings-for-starred-repos
Open

Add integration test for starred embeddings#5
JeffCarpenter wants to merge 1 commit intomainfrom
dldg6y-codex/plan-feature-to-infer-embeddings-for-starred-repos

Conversation

@JeffCarpenter
Copy link
Owner

@JeffCarpenter JeffCarpenter commented Jun 13, 2025

Summary

  • document upcoming integration tests in the plan
  • add a new end-to-end test exercising starred-embeddings

Testing

  • ruff check --fix .
  • pytest -q
  • pytest --cov=github_to_sqlite --cov-branch -q
  • mypy github_to_sqlite

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Add a new end-to-end starred embeddings feature and supporting infrastructure: introduce the starred-embeddings and migrate commands, create embedding and metadata tables with sqlite-vec support, add utilities for chunking and build file discovery, update dependencies, CI, documentation, and comprehensive tests.

New Features:

  • Add starred-embeddings CLI command to compute and store embeddings for user-starred repositories
  • Introduce migrate command and embedding tables (repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata) with optional sqlite-vec support

Enhancements:

  • Add utility functions for README chunking fallback, build file discovery, metadata parsing, and vector serialization
  • Integrate Pydantic-based configuration for default models and chunking parameters
  • Improve GitHub pagination and commit fetching robustness with safer link handling and type hints

Build:

  • Expand runtime dependencies to include sentence-transformers, sqlite-vec, nltk, onnx, pydantic, tokenizers and update extras_require for tests, docs, and GPU

CI:

  • Enhance CI workflow to install docs dependencies, run ruff and mypy checks, build Sphinx documentation, and enforce coverage

Documentation:

  • Update README with migration and starred embeddings usage
  • Add Sphinx docs structure and reStructuredText files for migrations and embeddings

Tests:

  • Add unit tests for new utilities, chunkers, tokenization, migration and CLI commands
  • Add integration test for starred-embeddings end-to-end

Copilot AI review requested due to automatic review settings June 13, 2025 12:37
@sourcery-ai
Copy link

sourcery-ai bot commented Jun 13, 2025

Reviewer's Guide

This PR implements a complete embeddings pipeline for user-starred repositories: it extends the database layer to conditionally load and manage vector tables, adds new utility modules for chunking and tokenization, introduces a starred-embeddings CLI command with model selection and force/verbose flags, updates dependencies and CI to support the feature, and delivers thorough unit and integration tests alongside updated documentation and a phased plan.

Sequence Diagram for starred-embeddings Command

sequenceDiagram
    actor User
    participant CLI as "CLI (starred-embeddings)"
    participant GitHubAPI as "GitHub API"
    participant SentenceTransformer as "SentenceTransformer Model"
    participant SQLiteDB as "SQLite Database"
    participant Utils as "utils.py functions"

    User->>CLI: Executes `github-to-sqlite starred-embeddings github.db`
    CLI->>Utils: load_token(auth)
    Utils-->>CLI: token
    CLI->>SQLiteDB: Initialize DB connection
    CLI->>Utils: _maybe_load_sqlite_vec(db)
    Utils-->>CLI: using_vec_status
    CLI->>Utils: fetch_all_starred(token)
    Utils->>GitHubAPI: GET /user/starred
    GitHubAPI-->>Utils: Starred Repos List (repo_data)
    Utils-->>CLI: Starred Repos List

    loop For each starred repo
        CLI->>Utils: save_repo(db, repo_data)
        Utils->>SQLiteDB: UPSERT into repos table
        SQLiteDB-->>Utils: repo_id
        Utils-->>CLI: repo_id

        opt Skip if embeddings exist and --force is not set
            CLI->>SQLiteDB: Check repo_embeddings for repo_id
            alt Embeddings exist
                CLI->>User: Output "Skipping {repo_name} (already processed)" (if verbose)
                Note right of CLI: Continue to next repo
            end
        end

        CLI->>Utils: fetch_readme(token, repo_full_name)
        Utils->>GitHubAPI: GET /repos/{repo_full_name}/readme
        GitHubAPI-->>Utils: README content
        Utils-->>CLI: README content

        CLI->>Utils: chunk_readme(readme_content)
        Utils-->>CLI: readme_chunks

        CLI->>SentenceTransformer: encode([title, description] + readme_chunks)
        SentenceTransformer-->>CLI: vectors (title_vec, desc_vec, chunk_vecs)

        CLI->>SQLiteDB: UPSERT into repo_embeddings (repo_id, title_vec, desc_vec, readme_avg_vec)
        loop For each chunk_vec
            CLI->>SQLiteDB: UPSERT into readme_chunk_embeddings (repo_id, chunk_idx, chunk_text, chunk_vec)
        end

        CLI->>Utils: find_build_files(repo_path)
        Utils-->>CLI: build_file_paths
        loop For each build_file_path
            CLI->>Utils: parse_build_file(build_file_path)
            Utils-->>CLI: build_file_metadata (JSON)
            CLI->>SQLiteDB: UPSERT into repo_build_files (repo_id, file_path, metadata)
        end

        CLI->>Utils: directory_tree(repo_path)
        Utils-->>CLI: dir_tree_json
        CLI->>SQLiteDB: UPSERT into repo_metadata (repo_id, language, dir_tree_json)
        CLI->>User: Output "Processed {repo_name}" (if verbose)
    end
    CLI->>User: Output "Embeddings generation complete"
Loading

Entity Relationship Diagram for New Embeddings Tables

erDiagram
    repos {
        int id PK "Primary Key"
        text full_name
        text name
        text description
        text language
        text readme
        text readme_html
    }

    repo_embeddings {
        int repo_id PK "Primary Key, FK to repos.id"
        bytes title_embedding "BLOB or sqlite-vec float[768]"
        bytes description_embedding "BLOB or sqlite-vec float[768]"
        bytes readme_embedding "BLOB or sqlite-vec float[768]"
    }

    readme_chunk_embeddings {
        int repo_id PK "Part of PK, FK to repos.id"
        int chunk_index PK "Part of PK"
        text chunk_text
        bytes embedding "BLOB or sqlite-vec float[768]"
    }

    repo_build_files {
        int repo_id PK "Part of PK, FK to repos.id"
        text file_path PK "Part of PK"
        text metadata "JSON"
    }

    repo_metadata {
        int repo_id PK "Primary Key, FK to repos.id"
        text language
        text directory_tree "JSON"
    }

    repos |o--|| repo_embeddings : "identifies"
    repos |o--|| repo_metadata : "identifies"
    repos ||--o{ readme_chunk_embeddings : "contains"
    repos ||--o{ repo_build_files : "contains"
Loading

Class Diagram for New Helper Classes in Embeddings Feature

classDiagram
    class Config {
        +str default_model
        +str onnx_provider
        +int max_length
    }

    class Token {
        +int id
        +str value
        +tuple offsets
    }

    class Chunk {
        +str content
        +int token_count
        +bool is_triggered
        +float triggered_score
    }

    class BaseSplitter {
        <<Interface>>
        +__call__(doc: str) List~str~
    }

    class BaseChunker {
        +str name
        +BaseSplitter splitter
        +DenseEncoder encoder
        +__call__(docs: List~str~) List~List~Chunk~~
        #_split(doc: str) List~str~
        #_chunk(splits: List~any~) List~Chunk~
        +print(document_splits: List~Chunk~) void
    }

    class SimpleChunker {
        +int target_length
        +__call__(docs: List~str~) List~List~Chunk~~
        #_split(doc: str) List~str~
        #_chunk(sentences: List~str~) List~Chunk~
    }
    BaseChunker <|-- SimpleChunker

    class BasicSentencizerChunker {
        +str period_token
        +chunk(tokens: Sequence~str~, vectors: Sequence~any~) List~List~any~~
        +__call__(tokens: Sequence~str~, vectors: Sequence~any~) List~List~any~~
    }

    class DenseEncoder {
        %% Fallback definition if semantic_router not installed
        +str name
    }
    BaseChunker o-- "1" DenseEncoder : uses >
    BaseChunker o-- "1" BaseSplitter : uses >
Loading

File-Level Changes

Change Details Files
Load sqlite-vec extension and define embedding tables
  • Implement _maybe_load_sqlite_vec to detect and load sqlite-vec
  • Add ensure_embedding_tables to create repo_embeddings, readme_chunk_embeddings, repo_build_files and repo_metadata
  • Invoke ensure_embedding_tables from ensure_db_shape
github_to_sqlite/utils.py
Introduce new modules for chunking, tokenization, and configuration
  • Add simple_chunker for sentence-based text splitting with fallback
  • Implement BasicSentencizerChunker in sentencizer_chunker for token-vector chunking
  • Provide load_tokenizer helper and global config via pydantic model
github_to_sqlite/simple_chunker.py
github_to_sqlite/sentencizer_chunker.py
github_to_sqlite/tokenization.py
github_to_sqlite/config.py
Add starred-embeddings CLI command
  • Register starred-embeddings command with options for model, force and verbose
  • Fetch starred repos, chunk READMEs, encode texts and upsert vectors into SQLite
  • Discover build files, parse metadata and store language/directory information
github_to_sqlite/cli.py
Update project setup and CI for embedding support
  • Add sentence-transformers, sqlite-vec, nltk, onnx and tokenizers to install_requires
  • Expand extras_require for tests, docs, semantic_chunkers and GPU
  • Extend GitHub Actions to run pytest, ruff, mypy and Sphinx docs build
  • Update README.md with migrate and starred-embeddings usage
setup.py
README.md
.github/workflows/test.yml
Add unit and integration tests for new features
  • Test embedding table creation, utils functions and migration command
  • Verify starred-embeddings command behavior with and without sqlite-vec, including end-to-end integration
  • Cover chunking, build-file detection, tokenization and utility helpers
tests/test_repos.py
tests/test_starred.py
tests/test_starred_embeddings_command.py
tests/test_starred_embeddings_integration.py
tests/test_utils_functions.py
tests/test_find_build_files.py
tests/test_simple_chunker.py
tests/test_sentencizer_chunker.py
tests/test_chunk_readme.py
tests/test_config.py
tests/test_embedding_tables.py
tests/test_migrate_command.py
Provide detailed plan and documentation for the embeddings feature
  • Add PLAN.md with phased implementation roadmap
  • Introduce Sphinx conf and rst documents for embeddings and migrations
  • Include AGENTS.md and update docs index
PLAN.md
docs/conf.py
docs/index.rst
docs/embeddings.rst
docs/migrations.rst
AGENTS.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Implements the starred-embeddings feature end-to-end and adds supporting unit tests, CI updates, and documentation.

  • Introduce utility functions for embedding tables, build-file parsing, README chunking, and vector serialization.
  • Add a new CLI command starred-embeddings with DB migrations, schema updates, and optional sqlite-vec support.
  • Update dependencies, tests, CI workflow, and docs (RST and README) to cover the new embedding functionality.

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_find_build_files.py Unit tests for the new find_build_files helper.
tests/test_embedding_tables.py Verifies creation of embedding tables via ensure_db_shape.
tests/test_config.py Tests tokenizer loading with env var and HF from_pretrained.
tests/test_chunk_readme.py Tests semantic-chunking fallback and blank-line split.
setup.py Adds sentence-transformers, sqlite-vec, semantic-chunkers, etc.
github_to_sqlite/utils.py Adds _maybe_load_sqlite_vec, ensure_embedding_tables, parsing, etc.
github_to_sqlite/cli.py New starred-embeddings command, type casts, and imports.
docs/embeddings.rst, migrations.rst Document the new embeddings command and migrations.
.github/workflows/test.yml CI now runs pytest‐cov, mypy, ruff, and builds docs.
Comments suppressed due to low confidence (1)

.github/workflows/test.yml:22

  • [nitpick] You already include mypy and ruff in the [test] extras, so this manual install is redundant. You can remove it to simplify the CI step.
        pip install mypy ruff

)

# Build file metadata
for build_path in utils.find_build_files(repo["full_name"]):
Copy link

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attempts to scan local filesystem paths named by repo["full_name"], but the repo directory isn’t cloned locally. You’ll need to clone or download the repo contents before calling find_build_files, or switch to using the GitHub contents API.

Copilot uses AI. Check for mistakes.
return arr.tobytes()


def parse_build_file(path: str) -> dict:
Copy link

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Python 3.10, import tomllib will fail (it’s only in 3.11+), causing .toml files to always return {}. Consider importing tomli as a fallback when tomllib isn’t available.

Copilot uses AI. Check for mistakes.
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JeffCarpenter - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +40 to +41
if cmd == 'find':
return '/usr/bin/find'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +71 to +72
if "repo_embeddings" not in tables:
db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +73 to +74
if "readme_chunk_embeddings" not in tables:
db["readme_chunk_embeddings"].create({"repo_id": int, "chunk_index": int, "chunk_text": str, "embedding": bytes}, pk=("repo_id", "chunk_index"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +75 to +76
if "repo_build_files" not in tables:
db["repo_build_files"].create({"repo_id": int, "file_path": str, "metadata": str}, pk=("repo_id", "file_path"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +77 to +78
if "repo_metadata" not in tables:
db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

default="auth.json",
help="Path to auth.json token file",
)
def starred_embeddings(db_path, model, force, verbose, auth):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

from bs4 import BeautifulSoup

url = "https://github.com/{}/network/dependents".format(repo)
url: str | None = "https://github.com/{}/network/dependents".format(repo)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)

Suggested change
url: str | None = "https://github.com/{}/network/dependents".format(repo)
url: str | None = f"https://github.com/{repo}/network/dependents"

check=True,
)
for line in result.stdout.splitlines():
line = line.strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)

Comment on lines +40 to +49
if cmd == 'find':
return '/usr/bin/find'
return None

def fake_run(args, capture_output, text, check):
pattern = args[3]
calls.append((args[0], pattern))
class R:
def __init__(self):
self.stdout = 'repo/' + pattern + '\n'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments