Refine virtual table creation and build file discovery by JeffCarpenter · Pull Request #9 · JeffCarpenter/github-to-sqlite

JeffCarpenter · 2025-06-28T21:45:23Z

Summary

avoid AttributeError when loading sqlite-vec
create vec0 tables using CREATE VIRTUAL TABLE IF NOT EXISTS
deduplicate build file filtering logic

Testing

pytest -q
mypy github_to_sqlite

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Refactor virtual table creation and build file discovery, and add a new embeddings feature with supporting utilities and CLI commands.

New Features:

Add starred-embeddings CLI command to compute and store embeddings for starred repositories
Introduce migrate CLI command to initialize optional tables, FTS, and foreign keys
Add find_build_files utility to locate and normalize build definition files using fd/find/os.walk
Add chunk_readme, parse_build_file, directory_tree, and vector_to_blob utilities
Introduce simple_chunker and sentencizer_chunker modules for text splitting and chunking
Add tokenization module for loading a Hugging Face tokenizer

Enhancements:

Refactor embedding table creation with _create_table_if_missing/_create_virtual_table_if_missing and use CREATE VIRTUAL TABLE IF NOT EXISTS for vec0
Improve sqlite-vec extension loading to handle missing or failing extensions gracefully
Deduplicate and normalize build file path filtering logic in _post_process_build_files
Cast tables to sqlite_utils.Table in CLI to avoid AttributeError

CI:

Extend CI to run ruff linting, mypy type checks, pytest-cov coverage, and Sphinx doc builds

Documentation:

Update README with migration, build file detection, and starred-embeddings usage
Add Sphinx docs for migrations and embeddings feature

Tests:

Add comprehensive tests for embedding generation, build file discovery, utilities, chunkers, CLI commands, and migration

sourcery-ai · 2025-06-28T21:45:27Z

Reviewer's Guide

This PR overhauls how embedding tables and build files are created and discovered by introducing safe sqlite-vec loading with vec0 virtual tables, unifying build file search/filtering, handling API pagination errors, and extending the CLI and CI with new commands and metadata.

Entity relationship diagram for new and updated embedding tables

erDiagram
    repos ||--o{ repo_embeddings : "has"
    repos ||--o{ readme_chunk_embeddings : "has"
    repos ||--o{ repo_build_files : "has"
    repos ||--o{ repo_metadata : "has"

    repo_embeddings {
        int repo_id PK
        float[768] title_embedding
        float[768] description_embedding
        float[768] readme_embedding
    }
    readme_chunk_embeddings {
        int repo_id PK
        int chunk_index PK
        text chunk_text
        float[768] embedding
    }
    repo_build_files {
        int repo_id PK
        string file_path PK
        string metadata
    }
    repo_metadata {
        int repo_id PK
        string language
        string directory_tree
    }

Class diagram for chunker and config utilities

classDiagram
    class Config {
        +str default_model
        +str onnx_provider
        +int max_length
    }
    class Chunk {
        +str content
        +int token_count
        +bool is_triggered
        +float triggered_score
    }
    class BaseSplitter {
        <<abstract>>
        +__call__(doc: str) List~str~
    }
    class BaseChunker {
        +str name
        +BaseSplitter splitter
        +DenseEncoder encoder
        +__call__(docs: List~str~) List~List~Chunk~~
        +_split(doc: str) List~str~
        +_chunk(splits: List~Any~) List~Chunk~
        +print(document_splits: List~Chunk~)
    }
    class SimpleChunker {
        +int target_length
        +__call__(docs: List~str~) List~List~Chunk~~
        +_split(doc: str) List~str~
        +_chunk(sentences: List~str~) List~Chunk~
    }
    class BasicSentencizerChunker {
        +str period_token
        +chunk(tokens: Sequence~str~, vectors: Sequence~Any~) List~List~Any~~
        +__call__ = chunk
    }
    class Token {
        +int id
        +str value
        +tuple~int, int~ offsets
    }
    class Tokenizer {
    }
    Config <|-- config
    BaseChunker <|-- SimpleChunker
    BaseSplitter <|-- BaseChunker

File-Level Changes

Change	Details	Files
Safe sqlite-vec extension loading and unified embedding table creation	Add `_maybe_load_sqlite_vec` with caching and error handling for ImportError, OSError, DatabaseError, and AttributeError Introduce helper functions `_create_table_if_missing` and `_create_virtual_table_if_missing` Refactor `ensure_embedding_tables` to conditionally create vec0 virtual tables with IF NOT EXISTS and fallback to standard tables Ensure all embedding‐related tables (`repo_embeddings`, `readme_chunk_embeddings`, `repo_build_files`, `repo_metadata`) are created if missing	`github_to_sqlite/utils.py`
Refactor build file discovery with normalization and deduplication	Extract `_post_process_build_files` to normalize paths, filter out `.git` and `node_modules`, and deduplicate in order Update `find_build_files` to prefer `fd`, then `find`, then `os.walk`, and pipe results through the post‐processor	`github_to_sqlite/utils.py`
Fix API pagination and stop_when defaults to avoid AttributeError	Replace lambda default in `fetch_commits` with a proper function definition Use `response.links.get("next", {})` to safely handle missing link entries	`github_to_sqlite/utils.py`
Enhance CLI with new commands, typing casts, and dependency checks	Add `starred-embeddings` and `migrate` commands with appropriate options and help text Insert typing casts (`cast(sqlite_utils.db.Table, ...)`) and import hints for table operations Refactor optional dependency checks for `bs4` and unify stop_when function signature	`github_to_sqlite/cli.py`
Update project metadata and CI pipeline	Expand `setup.py` extras to include semantic_chunkers, vector-search-index, GPU, and docs Enhance `README.md` with migration and starred embeddings sections Improve GitHub Actions (`test.yml`) to run mypy, ruff, coverage, and docs build	`setup.py` `README.md` `.github/workflows/test.yml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

Extract the hard-coded embedding dimension (768) into a single constant or config value so it isn’t duplicated in the vec0 DDL strings and Python code.
Refactor ensure_embedding_tables by driving table creation from a schema mapping or list rather than repeating nearly identical blocks for each table and storage mode.
Make the build file patterns (pyproject.toml, package.json, Cargo.toml, Gemfile) configurable or injectable rather than hard-coding them in find_build_files for easier future extension.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Extract the hard-coded embedding dimension (768) into a single constant or config value so it isn’t duplicated in the vec0 DDL strings and Python code.
- Refactor ensure_embedding_tables by driving table creation from a schema mapping or list rather than repeating nearly identical blocks for each table and storage mode.
- Make the build file patterns (pyproject.toml, package.json, Cargo.toml, Gemfile) configurable or injectable rather than hard-coding them in find_build_files for easier future extension.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/tokenization.py:16` </location>
<code_context>
+
+
+
+def load_tokenizer(model: str | None = None) -> Tokenizer:
+    """Load a Hugging Face tokenizer.
+
</code_context>

<issue_to_address>
No error handling if the tokenizer model cannot be loaded.

Catch exceptions from Tokenizer.from_pretrained to provide a clearer error message or fallback if the model cannot be loaded.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def load_tokenizer(model: str | None = None) -> Tokenizer:
    """Load a Hugging Face tokenizer.

=======
def load_tokenizer(model: str | None = None) -> Tokenizer:
    """Load a Hugging Face tokenizer.

    Args:
        model: The name or path of the pretrained tokenizer.

    Returns:
        Tokenizer: The loaded tokenizer.

    Raises:
        RuntimeError: If the tokenizer cannot be loaded.
    """
    try:
        return Tokenizer.from_pretrained(model)
    except Exception as e:
        raise RuntimeError(
            f"Failed to load tokenizer for model '{model}': {e}"
        ) from e

>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-06-28T21:46:46Z

github_to_sqlite/tokenization.py

+def load_tokenizer(model: str | None = None) -> Tokenizer:
+    """Load a Hugging Face tokenizer.
+


suggestion: No error handling if the tokenizer model cannot be loaded.

Catch exceptions from Tokenizer.from_pretrained to provide a clearer error message or fallback if the model cannot be loaded.

Suggested change

def load_tokenizer(model: str | None = None) -> Tokenizer:

"""Load a Hugging Face tokenizer.

def load_tokenizer(model: str | None = None) -> Tokenizer:

"""Load a Hugging Face tokenizer.

Args:

model: The name or path of the pretrained tokenizer.

Returns:

Tokenizer: The loaded tokenizer.

Raises:

RuntimeError: If the tokenizer cannot be loaded.

"""

try:

return Tokenizer.from_pretrained(model)

except Exception as e:

raise RuntimeError(

f"Failed to load tokenizer for model '{model}': {e}"

) from e

sourcery-ai · 2025-06-28T21:46:46Z

tests/test_find_build_files.py

+        if pattern == 'package.json':
+            raise subprocess.CalledProcessError(1, args)


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-28T21:46:47Z

tests/test_find_build_files.py

+        if cmd == 'find':
+            return '/usr/bin/find'


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-28T21:46:47Z

tests/test_find_build_files.py

+        if pattern == 'package.json':
+            raise subprocess.CalledProcessError(1, args)


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-28T21:46:47Z

tests/test_starred_embeddings_command.py

+        if "repo_embeddings" not in tables:
+            db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id")


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-28T21:46:47Z

tests/test_starred_embeddings_command.py

+        if "repo_metadata" not in tables:
+            db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")


issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

Explanation
Avoid complex code, like conditionals, in test functions.
Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

loops

conditionals

Some ways to fix this:

Use parametrized tests to get rid of the loop.

Move the complex logic into helpers.

Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

sourcery-ai · 2025-06-28T21:46:47Z

github_to_sqlite/utils.py

    from bs4 import BeautifulSoup

-    url = "https://github.com/{}/network/dependents".format(repo)
+    url: str | None = "https://github.com/{}/network/dependents".format(repo)


suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)

Suggested change

url: str | None = "https://github.com/{}/network/dependents".format(repo)

url: str | None = f"https://github.com/{repo}/network/dependents"

sourcery-ai · 2025-06-28T21:46:47Z

github_to_sqlite/utils.py

+            for full in pathlib.Path(path).rglob(pattern):
+                if full.is_file():
+                    found.append(str(full))


suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)

Suggested change

for full in pathlib.Path(path).rglob(pattern):

if full.is_file():

found.append(str(full))

found.extend(

str(full)

for full in pathlib.Path(path).rglob(pattern)

if full.is_file()

)

sourcery-ai · 2025-06-28T21:46:47Z

github_to_sqlite/utils.py

+    return tree
+
+
+def generate_starred_embeddings(


issue (code-quality): Low code quality found in generate_starred_embeddings - 16% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-06-28T21:46:47Z

tests/test_find_build_files.py

+    calls = []
+
+    def fake_which(cmd):
+        if cmd == 'find':


issue (code-quality): We've found these issues:

Lift code into else after jump in control flow (reintroduce-else)

Replace if statement with if expression (assign-if-exp)

Use f-string instead of string concatenation (use-fstring-for-concatenation)

Copilot

Pull Request Overview

This PR enhances embedding support by using CREATE VIRTUAL TABLE IF NOT EXISTS for vec0 tables, hardens sqlite-vec loading to avoid AttributeError, and cleans up build file discovery logic with deduplication and consistent filtering.

Introduce helper functions to create virtual tables if missing
Improve _maybe_load_sqlite_vec to guard against missing or broken extensions
Centralize and post-process build file paths to remove duplicates and ignore unwanted directories

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
github_to_sqlite/utils.py	Added virtual-table creation and deduplicated `find_build_files()` logic
tests/test_find_build_files.py	New tests covering fd/find/os.walk fallbacks and path normalization
tests/test_embedding_tables.py	New test ensuring embedding tables are created via `ensure_db_shape`
setup.py	Updated `extras_require` to include `sqlite-vec`, NLP & testing tools
docs/migrations.rst	Documented the `migrate` command for embedding table setup
docs/embeddings.rst	Added docs for the `starred-embeddings` command and table schemas
README.md	Extended with sections on migrations, build file detection, embeddings

Copilot · 2025-06-28T21:47:47Z

github_to_sqlite/utils.py

+    """Create tables used for embedding storage if they do not exist."""
+    using_vec = _maybe_load_sqlite_vec(db)
+
+    tables = set(db.table_names())


After creating each table, consider updating or re-fetching the tables set instead of using a single snapshot. This ensures subsequent checks use the current schema and prevents unnecessary creation attempts.

Copilot · 2025-06-28T21:47:47Z

github_to_sqlite/utils.py

+    unique: list[str] = []
+    seen = set()
+    for item in found:
+        if "/.git/" in item or "/node_modules/" in item:


Use os.path.sep when checking path segments to ensure the filter works on Windows as well as POSIX. For example, build the pattern with os.path.sep + '.git' + os.path.sep.

Suggested change

if "/.git/" in item or "/node_modules/" in item:

if os.path.sep + ".git" + os.path.sep in item or os.path.sep + "node_modules" + os.path.sep in item:

Copilot · 2025-06-28T21:47:47Z

tests/test_simple_chunker.py

+        monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', lambda t: ['ok'])
+        raise LookupError
+
+    monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', failing)


The use of monkeypatch.setattr with a single string path is incorrect here. It’s safer to import the module and do monkeypatch.setattr(simple_chunker, 'sent_tokenize', lambda t: ['ok']) to ensure the stub is applied correctly.

Suggested change

monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', lambda t: ['ok'])

raise LookupError

monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', failing)

monkeypatch.setattr(sent_tokenize, lambda t: ['ok'])

raise LookupError

monkeypatch.setattr(sent_tokenize, failing)

…s-for-starred-repos

Refine embedding helpers

a8767c2

Copilot AI review requested due to automatic review settings June 28, 2025 21:45

JeffCarpenter added the codex label Jun 28, 2025 — with ChatGPT Codex Connector

sourcery-ai bot reviewed Jun 28, 2025

View reviewed changes

Copilot AI reviewed Jun 28, 2025

View reviewed changes

Merge branch 'main' into 97bx9d-codex/plan-feature-to-infer-embedding…

c2be582

…s-for-starred-repos

		def load_tokenizer(model: str \| None = None) -> Tokenizer:
		"""Load a Hugging Face tokenizer.

		if pattern == 'package.json':
		raise subprocess.CalledProcessError(1, args)

		if "repo_embeddings" not in tables:
		db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id")

		if "repo_metadata" not in tables:
		db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")

	url: str \| None = "https://github.com/{}/network/dependents".format(repo)
	url: str \| None = f"https://github.com/{repo}/network/dependents"

	if "/.git/" in item or "/node_modules/" in item:
	if os.path.sep + ".git" + os.path.sep in item or os.path.sep + "node_modules" + os.path.sep in item:

Comments

Conversation

JeffCarpenter commented Jun 28, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Entity relationship diagram for new and updated embedding tables

Class diagram for chunker and config utilities

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JeffCarpenter commented Jun 28, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jun 28, 2025 •

edited

Loading