Add comprehensive integration test for embeddings by JeffCarpenter · Pull Request #6 · JeffCarpenter/github-to-sqlite

JeffCarpenter · 2025-06-13T15:11:28Z

Summary

mark integration testing complete in the plan
expand test_starred_embeddings_integration to verify stored data

Testing

ruff check --fix .
pytest -q
pytest --cov=github_to_sqlite --cov-branch -q
mypy github_to_sqlite (failed: command produced no output)

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Add a new embeddings feature for starred repositories, including database schema changes, CLI commands, utility functions, documentation, and tests to support generating, storing, and migrating vector embeddings for repo titles, descriptions, README chunks, and build metadata.

New Features:

Add starred-embeddings CLI command to compute and store embeddings for user-starred repositories
Introduce new database tables (repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata) with optional sqlite-vec support
Add migrate CLI command to initialize embedding tables and other optional schema elements
Provide utilities for chunking README text, locating and parsing build files, building directory trees, and serializing vectors

Enhancements:

Integrate embedding table creation into ensure_db_shape
Improve pagination logic, type annotations, and stop-when callbacks in GitHub API utilities

Build:

Update dependencies in setup.py to include sentence-transformers, sqlite-vec, semantic-chunkers, docs and testing tools

CI:

Enhance CI workflow to run ruff, mypy, coverage, and build documentation

Documentation:

Update README with migration and embeddings usage instructions
Add PLAN.md and Sphinx documentation for embeddings, migrations, and build file detection

Tests:

Add unit and integration tests covering the new CLI, utils, chunkers, build-file detection, tokenization, and migration command
Expand integration test for starred embeddings to verify stored data

sourcery-ai · 2025-06-13T15:11:33Z

Reviewer's Guide

This PR expands the embeddings feature with full end-to-end support and testing by loading the sqlite-vec extension, creating new embedding and metadata tables, adding a Click command to generate and store starred-repo embeddings, introducing helper utilities for chunking and file parsing, updating dependencies, CI, and docs, and providing comprehensive unit and integration tests to verify stored vectors and metadata.

File-Level Changes

Change	Details	Files
SQLite-vec extension support and embedding table creation	Add _maybe_load_sqlite_vec to load sqlite-vec extension Implement ensure_embedding_tables to create repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata based on extension availability Integrate ensure_embedding_tables call into ensure_db_shape	`github_to_sqlite/utils.py`
starred-embeddings CLI command	Add new Click command to compute and store embeddings for starred repos Load model via sentence-transformers with CLI and env var override Iterate over starred repos, chunk README, find and parse build files Upsert embedding vectors and metadata into new tables, support --force and --verbose	`github_to_sqlite/cli.py`
Utility helpers for chunking, file detection, and serialization	Implement chunk_readme with semantic_chunkers fallback Add find_build_files preferring fd, then find, then os.walk Add vector_to_blob, parse_build_file, directory_tree helpers Introduce simple_chunker, sentencizer_chunker, and tokenization module Add config for default model	`github_to_sqlite/utils.py` `github_to_sqlite/simple_chunker.py` `github_to_sqlite/sentencizer_chunker.py` `github_to_sqlite/tokenization.py` `github_to_sqlite/config.py`
Dependency, CI, and documentation updates	Extend install_requires and extras_require in setup.py for embeddings and tools Update GitHub Actions to run pytest, ruff, mypy, and docs build Add README sections for migrations and starred-embeddings Add Sphinx docs configuration and new docs stubs	`setup.py` `.github/workflows/test.yml` `README.md` `docs/conf.py`
Comprehensive tests for embedding workflow	Add integration tests verifying stored embeddings and metadata end-to-end Add command tests for starred-embeddings with sqlite-vec and env var scenarios Add unit tests for utils, chunkers, tokenization, migration Update existing tests to include new tables	`tests/test_starred_embeddings_integration.py` `tests/test_starred_embeddings_command.py` `tests/test_utils_functions.py` `tests/test_find_build_files.py` `tests/test_simple_chunker.py` `tests/test_chunk_readme.py` `tests/test_sentencizer_chunker.py` `tests/test_migrate_command.py` `tests/test_config.py` `tests/test_embedding_tables.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Copilot

Pull Request Overview

This PR implements the end-to-end embeddings feature and adds corresponding tests, CLI commands, documentation updates, and dependency changes.

Introduces embedding storage tables (repo_embeddings, readme_chunk_embeddings), build-file metadata, and ensure_embedding_tables
Adds new CLI command starred-embeddings plus updated setup.py, CI workflows, and docs
Expands unit tests for utilities (find_build_files, chunk_readme, ensure_db_shape, etc.)

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
github_to_sqlite/utils.py	Added embedding table creation, chunking & build-file helpers
github_to_sqlite/tokenization.py	Exposed `load_tokenizer` using HF Tokenizer
github_to_sqlite/simple_chunker.py	New simple chunker implementation
github_to_sqlite/sentencizer_chunker.py	Added BasicSentencizerChunker
github_to_sqlite/config.py	New Pydantic config defaults
github_to_sqlite/cli.py	New `starred-embeddings` command and type casting
setup.py	Updated dependencies and extras
mypy.ini	Added strict type‐checking flags
tests/	New unit tests for all new utils and helpers
docs/	Added RST docs for embeddings and migrations
PLAN.md, AGENTS.md	Marked embedding plan complete and notes
.github/workflows/test.yml	CI now runs ruff, mypy, docs build

Copilot · 2025-06-13T15:12:41Z

github_to_sqlite/cli.py

+            click.echo("sqlite-vec extension not loaded; storing embeddings as BLOBs")
+
+    env_model = os.environ.get("GITHUB_TO_SQLITE_MODEL")
+    model_name = model or env_model or config.config.default_model


The reference config.config.default_model is incorrect—config is the instance imported from config.py, so it should be config.default_model.

Suggested change

model_name = model or env_model or config.config.default_model

model_name = model or env_model or config.default_model

Copilot · 2025-06-13T15:12:42Z

tests/test_config.py

+        calls.append(model)
+        return 'tok'
+
+    monkeypatch.setattr('tokenizers.Tokenizer.from_pretrained', fake_from_pretrained)


The monkeypatch.setattr target should be a module attribute, not a string—replace with e.g. monkeypatch.setattr(tokenization.Tokenizer, 'from_pretrained', fake_from_pretrained) or patch Tokenizer.from_pretrained on the actual imported class.

Suggested change

monkeypatch.setattr('tokenizers.Tokenizer.from_pretrained', fake_from_pretrained)

monkeypatch.setattr(Tokenizer, 'from_pretrained', fake_from_pretrained)

sourcery-ai

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

In ensure_embedding_tables you have nearly identical logic for creating non-vec tables—consider extracting a small helper (e.g. create_table_if_missing) to reduce duplication and make future additions easier.
find_build_files’ Python fallback could be simplified by using pathlib.Path(path).rglob(pattern) instead of manual os.walk and subprocess calls, making it more cross-platform and easier to maintain.
The starred-embeddings command currently encodes each repo’s chunks in one shot; for users with many stars you might batch or stream encode calls to avoid blowing up memory or exceeding model input limits.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- In ensure_embedding_tables you have nearly identical logic for creating non-vec tables—consider extracting a small helper (e.g. create_table_if_missing) to reduce duplication and make future additions easier.
- find_build_files’ Python fallback could be simplified by using pathlib.Path(path).rglob(pattern) instead of manual os.walk and subprocess calls, making it more cross-platform and easier to maintain.
- The starred-embeddings command currently encodes each repo’s chunks in one shot; for users with many stars you might batch or stream encode calls to avoid blowing up memory or exceeding model input limits.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+
+    if shutil.which("fd"):
+        for pattern in BUILD_PATTERNS:
+            result = subprocess.run(
+                ["fd", "-HI", "-t", "f", pattern, path],
+                capture_output=True,
</code_context>

<issue_to_address>
subprocess.run(check=True) can raise and crash

Wrap the call in try/except to handle CalledProcessError and ensure the program handles failures gracefully.
</issue_to_address>

### Comment 2
<location> `docs/migrations.rst:12` </location>
<code_context>
+
+This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.
+
+Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.
</code_context>

<issue_to_address>
Consider adjusting the presentation of the build file detection paragraph.

Clarify the relationship between this paragraph and the earlier discussion of the `migrate` command, or consider making it a separate sub-topic for improved readability.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-06-13T15:14:34Z

github_to_sqlite/utils.py

+
+    if shutil.which("fd"):
+        for pattern in BUILD_PATTERNS:
+            result = subprocess.run(


issue (bug_risk): subprocess.run(check=True) can raise and crash

Wrap the call in try/except to handle CalledProcessError and ensure the program handles failures gracefully.

sourcery-ai · 2025-06-13T15:14:35Z

docs/migrations.rst

+
+This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.
+
+Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.


suggestion: Consider adjusting the presentation of the build file detection paragraph.

Clarify the relationship between this paragraph and the earlier discussion of the migrate command, or consider making it a separate sub-topic for improved readability.

sourcery-ai · 2025-06-13T15:14:35Z

tests/test_find_build_files.py

+        if cmd == 'find':
+            return '/usr/bin/find'


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-13T15:14:35Z

tests/test_starred_embeddings_command.py

+        if "repo_embeddings" not in tables:
+            db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id")


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-13T15:14:35Z

tests/test_starred_embeddings_command.py

+        if "readme_chunk_embeddings" not in tables:
+            db["readme_chunk_embeddings"].create({"repo_id": int, "chunk_index": int, "chunk_text": str, "embedding": bytes}, pk=("repo_id", "chunk_index"))


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-13T15:14:35Z

tests/test_starred_embeddings_command.py

+        if "repo_metadata" not in tables:
+            db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-13T15:14:35Z

github_to_sqlite/cli.py

+    default="auth.json",
+    help="Path to auth.json token file",
+)
+def starred_embeddings(db_path, model, force, verbose, auth):


issue (code-quality): Low code quality found in starred_embeddings - 17% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-06-13T15:14:35Z

github_to_sqlite/utils.py

    from bs4 import BeautifulSoup

-    url = "https://github.com/{}/network/dependents".format(repo)
+    url: str | None = "https://github.com/{}/network/dependents".format(repo)


suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)

Suggested change

url: str | None = "https://github.com/{}/network/dependents".format(repo)

url: str | None = f"https://github.com/{repo}/network/dependents"

sourcery-ai · 2025-06-13T15:14:35Z

github_to_sqlite/utils.py

+                check=True,
+            )
+            for line in result.stdout.splitlines():
+                line = line.strip()


issue (code-quality): Use named expression to simplify assignment and conditional [×2] (use-named-expression)

sourcery-ai · 2025-06-13T15:14:35Z

tests/test_find_build_files.py

+        if cmd == 'find':
+            return '/usr/bin/find'
+        return None
+
+    def fake_run(args, capture_output, text, check):
+        pattern = args[3]
+        calls.append((args[0], pattern))
+        class R:
+            def __init__(self):
+                self.stdout = 'repo/' + pattern + '\n'


issue (code-quality): We've found these issues:

Lift code into else after jump in control flow (reintroduce-else)

Replace if statement with if expression (assign-if-exp)

Use f-string instead of string concatenation (use-fstring-for-concatenation)

Add comprehensive integration test

185345e

Copilot AI review requested due to automatic review settings June 13, 2025 15:11

JeffCarpenter added the codex label Jun 13, 2025 — with ChatGPT Codex Connector

Copilot AI reviewed Jun 13, 2025

View reviewed changes

sourcery-ai bot reviewed Jun 13, 2025

View reviewed changes

	model_name = model or env_model or config.config.default_model
	model_name = model or env_model or config.default_model

	monkeypatch.setattr('tokenizers.Tokenizer.from_pretrained', fake_from_pretrained)
	monkeypatch.setattr(Tokenizer, 'from_pretrained', fake_from_pretrained)


		This sets up the ``repo_embeddings``, ``readme_chunk_embeddings``, ``repo_build_files`` and ``repo_metadata`` tables. The helper prefers the ``sqlite-vec`` extension when available.

		Some commands look for standard build definitions such as ``pyproject.toml`` or ``package.json``. The ``find_build_files()`` helper uses the ``fd`` command if installed, otherwise falling back to ``find`` or a Python implementation.

		if "repo_embeddings" not in tables:
		db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id")

		if "readme_chunk_embeddings" not in tables:
		db["readme_chunk_embeddings"].create({"repo_id": int, "chunk_index": int, "chunk_text": str, "embedding": bytes}, pk=("repo_id", "chunk_index"))

		if "repo_metadata" not in tables:
		db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")

	url: str \| None = "https://github.com/{}/network/dependents".format(repo)
	url: str \| None = f"https://github.com/{repo}/network/dependents"

Conversation

JeffCarpenter commented Jun 13, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

JeffCarpenter commented Jun 13, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jun 13, 2025 •

edited

Loading