Skip to content

Add comprehensive integration test for embeddings#6

Open
JeffCarpenter wants to merge 1 commit intomainfrom
4nd0vj-codex/plan-feature-to-infer-embeddings-for-starred-repos
Open

Add comprehensive integration test for embeddings#6
JeffCarpenter wants to merge 1 commit intomainfrom
4nd0vj-codex/plan-feature-to-infer-embeddings-for-starred-repos

Conversation

@JeffCarpenter
Copy link
Owner

@JeffCarpenter JeffCarpenter commented Jun 13, 2025

Summary

  • mark integration testing complete in the plan
  • expand test_starred_embeddings_integration to verify stored data

Testing

  • ruff check --fix .
  • pytest -q
  • pytest --cov=github_to_sqlite --cov-branch -q
  • mypy github_to_sqlite (failed: command produced no output)

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Add a new embeddings feature for starred repositories, including database schema changes, CLI commands, utility functions, documentation, and tests to support generating, storing, and migrating vector embeddings for repo titles, descriptions, README chunks, and build metadata.

New Features:

  • Add starred-embeddings CLI command to compute and store embeddings for user-starred repositories
  • Introduce new database tables (repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata) with optional sqlite-vec support
  • Add migrate CLI command to initialize embedding tables and other optional schema elements
  • Provide utilities for chunking README text, locating and parsing build files, building directory trees, and serializing vectors

Enhancements:

  • Integrate embedding table creation into ensure_db_shape
  • Improve pagination logic, type annotations, and stop-when callbacks in GitHub API utilities

Build:

  • Update dependencies in setup.py to include sentence-transformers, sqlite-vec, semantic-chunkers, docs and testing tools

CI:

  • Enhance CI workflow to run ruff, mypy, coverage, and build documentation

Documentation:

  • Update README with migration and embeddings usage instructions
  • Add PLAN.md and Sphinx documentation for embeddings, migrations, and build file detection

Tests:

  • Add unit and integration tests covering the new CLI, utils, chunkers, build-file detection, tokenization, and migration command
  • Expand integration test for starred embeddings to verify stored data

Copilot AI review requested due to automatic review settings June 13, 2025 15:11
@sourcery-ai
Copy link

sourcery-ai bot commented Jun 13, 2025

Reviewer's Guide

This PR expands the embeddings feature with full end-to-end support and testing by loading the sqlite-vec extension, creating new embedding and metadata tables, adding a Click command to generate and store starred-repo embeddings, introducing helper utilities for chunking and file parsing, updating dependencies, CI, and docs, and providing comprehensive unit and integration tests to verify stored vectors and metadata.

File-Level Changes

Change Details Files
SQLite-vec extension support and embedding table creation
  • Add _maybe_load_sqlite_vec to load sqlite-vec extension
  • Implement ensure_embedding_tables to create repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata based on extension availability
  • Integrate ensure_embedding_tables call into ensure_db_shape
github_to_sqlite/utils.py
starred-embeddings CLI command
  • Add new Click command to compute and store embeddings for starred repos
  • Load model via sentence-transformers with CLI and env var override
  • Iterate over starred repos, chunk README, find and parse build files
  • Upsert embedding vectors and metadata into new tables, support --force and --verbose
github_to_sqlite/cli.py
Utility helpers for chunking, file detection, and serialization
  • Implement chunk_readme with semantic_chunkers fallback
  • Add find_build_files preferring fd, then find, then os.walk
  • Add vector_to_blob, parse_build_file, directory_tree helpers
  • Introduce simple_chunker, sentencizer_chunker, and tokenization module
  • Add config for default model
github_to_sqlite/utils.py
github_to_sqlite/simple_chunker.py
github_to_sqlite/sentencizer_chunker.py
github_to_sqlite/tokenization.py
github_to_sqlite/config.py
Dependency, CI, and documentation updates
  • Extend install_requires and extras_require in setup.py for embeddings and tools
  • Update GitHub Actions to run pytest, ruff, mypy, and docs build
  • Add README sections for migrations and starred-embeddings
  • Add Sphinx docs configuration and new docs stubs
setup.py
.github/workflows/test.yml
README.md
docs/conf.py
Comprehensive tests for embedding workflow
  • Add integration tests verifying stored embeddings and metadata end-to-end
  • Add command tests for starred-embeddings with sqlite-vec and env var scenarios
  • Add unit tests for utils, chunkers, tokenization, migration
  • Update existing tests to include new tables
tests/test_starred_embeddings_integration.py
tests/test_starred_embeddings_command.py
tests/test_utils_functions.py
tests/test_find_build_files.py
tests/test_simple_chunker.py
tests/test_chunk_readme.py
tests/test_sentencizer_chunker.py
tests/test_migrate_command.py
tests/test_config.py
tests/test_embedding_tables.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the end-to-end embeddings feature and adds corresponding tests, CLI commands, documentation updates, and dependency changes.

  • Introduces embedding storage tables (repo_embeddings, readme_chunk_embeddings), build-file metadata, and ensure_embedding_tables
  • Adds new CLI command starred-embeddings plus updated setup.py, CI workflows, and docs
  • Expands unit tests for utilities (find_build_files, chunk_readme, ensure_db_shape, etc.)

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
github_to_sqlite/utils.py Added embedding table creation, chunking & build-file helpers
github_to_sqlite/tokenization.py Exposed load_tokenizer using HF Tokenizer
github_to_sqlite/simple_chunker.py New simple chunker implementation
github_to_sqlite/sentencizer_chunker.py Added BasicSentencizerChunker
github_to_sqlite/config.py New Pydantic config defaults
github_to_sqlite/cli.py New starred-embeddings command and type casting
setup.py Updated dependencies and extras
mypy.ini Added strict type‐checking flags
tests/ New unit tests for all new utils and helpers
docs/ Added RST docs for embeddings and migrations
PLAN.md, AGENTS.md Marked embedding plan complete and notes
.github/workflows/test.yml CI now runs ruff, mypy, docs build

click.echo("sqlite-vec extension not loaded; storing embeddings as BLOBs")

env_model = os.environ.get("GITHUB_TO_SQLITE_MODEL")
model_name = model or env_model or config.config.default_model
Copy link

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference config.config.default_model is incorrect—config is the instance imported from config.py, so it should be config.default_model.

Suggested change
model_name = model or env_model or config.config.default_model
model_name = model or env_model or config.default_model

Copilot uses AI. Check for mistakes.
calls.append(model)
return 'tok'

monkeypatch.setattr('tokenizers.Tokenizer.from_pretrained', fake_from_pretrained)
Copy link

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monkeypatch.setattr target should be a module attribute, not a string—replace with e.g. monkeypatch.setattr(tokenization.Tokenizer, 'from_pretrained', fake_from_pretrained) or patch Tokenizer.from_pretrained on the actual imported class.

Suggested change
monkeypatch.setattr('tokenizers.Tokenizer.from_pretrained', fake_from_pretrained)
monkeypatch.setattr(Tokenizer, 'from_pretrained', fake_from_pretrained)

Copilot uses AI. Check for mistakes.
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

  • In ensure_embedding_tables you have nearly identical logic for creating non-vec tables—consider extracting a small helper (e.g. create_table_if_missing) to reduce duplication and make future additions easier.
  • find_build_files’ Python fallback could be simplified by using pathlib.Path(path).rglob(pattern) instead of manual os.walk and subprocess calls, making it more cross-platform and easier to maintain.
  • The starred-embeddings command currently encodes each repo’s chunks in one shot; for users with many stars you might batch or stream encode calls to avoid blowing up memory or exceeding model input limits.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In ensure_embedding_tables you have nearly identical logic for creating non-vec tables—consider extracting a small helper (e.g. create_table_if_missing) to reduce duplication and make future additions easier.
- find_build_files’ Python fallback could be simplified by using pathlib.Path(path).rglob(pattern) instead of manual os.walk and subprocess calls, making it more cross-platform and easier to maintain.
- The starred-embeddings command currently encodes each repo’s chunks in one shot; for users with many stars you might batch or stream encode calls to avoid blowing up memory or exceeding model input limits.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+
+    if shutil.which("fd"):
+        for pattern in BUILD_PATTERNS:
+            result = subprocess.run(
+                ["fd", "-HI", "-t", "f", pattern, path],
+                capture_output=True,
</code_context>

<issue_to_address>
subprocess.run(check=True) can raise and crash

Wrap the call in try/except to handle CalledProcessError and ensure the program handles failures gracefully.
</issue_to_address>

### Comment 2
<location> `docs/migrations.rst:12` </location>
<code_context>
+
+This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.
+
+Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.
</code_context>

<issue_to_address>
Consider adjusting the presentation of the build file detection paragraph.

Clarify the relationship between this paragraph and the earlier discussion of the `migrate` command, or consider making it a separate sub-topic for improved readability.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


if shutil.which("fd"):
for pattern in BUILD_PATTERNS:
result = subprocess.run(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): subprocess.run(check=True) can raise and crash

Wrap the call in try/except to handle CalledProcessError and ensure the program handles failures gracefully.


This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.

Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider adjusting the presentation of the build file detection paragraph.

Clarify the relationship between this paragraph and the earlier discussion of the migrate command, or consider making it a separate sub-topic for improved readability.

Comment on lines +40 to +41
if cmd == 'find':
return '/usr/bin/find'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +71 to +72
if "repo_embeddings" not in tables:
db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +73 to +74
if "readme_chunk_embeddings" not in tables:
db["readme_chunk_embeddings"].create({"repo_id": int, "chunk_index": int, "chunk_text": str, "embedding": bytes}, pk=("repo_id", "chunk_index"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +77 to +78
if "repo_metadata" not in tables:
db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

default="auth.json",
help="Path to auth.json token file",
)
def starred_embeddings(db_path, model, force, verbose, auth):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

from bs4 import BeautifulSoup

url = "https://github.com/{}/network/dependents".format(repo)
url: str | None = "https://github.com/{}/network/dependents".format(repo)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)

Suggested change
url: str | None = "https://github.com/{}/network/dependents".format(repo)
url: str | None = f"https://github.com/{repo}/network/dependents"

check=True,
)
for line in result.stdout.splitlines():
line = line.strip()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)

Comment on lines +40 to +49
if cmd == 'find':
return '/usr/bin/find'
return None

def fake_run(args, capture_output, text, check):
pattern = args[3]
calls.append((args[0], pattern))
class R:
def __init__(self):
self.stdout = 'repo/' + pattern + '\n'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments