Skip to content

Comments

Refine virtual table creation and build file discovery#9

Open
JeffCarpenter wants to merge 2 commits intomainfrom
97bx9d-codex/plan-feature-to-infer-embeddings-for-starred-repos
Open

Refine virtual table creation and build file discovery#9
JeffCarpenter wants to merge 2 commits intomainfrom
97bx9d-codex/plan-feature-to-infer-embeddings-for-starred-repos

Conversation

@JeffCarpenter
Copy link
Owner

@JeffCarpenter JeffCarpenter commented Jun 28, 2025

Summary

  • avoid AttributeError when loading sqlite-vec
  • create vec0 tables using CREATE VIRTUAL TABLE IF NOT EXISTS
  • deduplicate build file filtering logic

Testing

  • pytest -q
  • mypy github_to_sqlite

https://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28

Summary by Sourcery

Refactor virtual table creation and build file discovery, and add a new embeddings feature with supporting utilities and CLI commands.

New Features:

  • Add starred-embeddings CLI command to compute and store embeddings for starred repositories
  • Introduce migrate CLI command to initialize optional tables, FTS, and foreign keys
  • Add find_build_files utility to locate and normalize build definition files using fd/find/os.walk
  • Add chunk_readme, parse_build_file, directory_tree, and vector_to_blob utilities
  • Introduce simple_chunker and sentencizer_chunker modules for text splitting and chunking
  • Add tokenization module for loading a Hugging Face tokenizer

Enhancements:

  • Refactor embedding table creation with _create_table_if_missing/_create_virtual_table_if_missing and use CREATE VIRTUAL TABLE IF NOT EXISTS for vec0
  • Improve sqlite-vec extension loading to handle missing or failing extensions gracefully
  • Deduplicate and normalize build file path filtering logic in _post_process_build_files
  • Cast tables to sqlite_utils.Table in CLI to avoid AttributeError

CI:

  • Extend CI to run ruff linting, mypy type checks, pytest-cov coverage, and Sphinx doc builds

Documentation:

  • Update README with migration, build file detection, and starred-embeddings usage
  • Add Sphinx docs for migrations and embeddings feature

Tests:

  • Add comprehensive tests for embedding generation, build file discovery, utilities, chunkers, CLI commands, and migration

Copilot AI review requested due to automatic review settings June 28, 2025 21:45
@sourcery-ai
Copy link

sourcery-ai bot commented Jun 28, 2025

Reviewer's Guide

This PR overhauls how embedding tables and build files are created and discovered by introducing safe sqlite-vec loading with vec0 virtual tables, unifying build file search/filtering, handling API pagination errors, and extending the CLI and CI with new commands and metadata.

Entity relationship diagram for new and updated embedding tables

erDiagram
    repos ||--o{ repo_embeddings : "has"
    repos ||--o{ readme_chunk_embeddings : "has"
    repos ||--o{ repo_build_files : "has"
    repos ||--o{ repo_metadata : "has"

    repo_embeddings {
        int repo_id PK
        float[768] title_embedding
        float[768] description_embedding
        float[768] readme_embedding
    }
    readme_chunk_embeddings {
        int repo_id PK
        int chunk_index PK
        text chunk_text
        float[768] embedding
    }
    repo_build_files {
        int repo_id PK
        string file_path PK
        string metadata
    }
    repo_metadata {
        int repo_id PK
        string language
        string directory_tree
    }
Loading

Class diagram for chunker and config utilities

classDiagram
    class Config {
        +str default_model
        +str onnx_provider
        +int max_length
    }
    class Chunk {
        +str content
        +int token_count
        +bool is_triggered
        +float triggered_score
    }
    class BaseSplitter {
        <<abstract>>
        +__call__(doc: str) List~str~
    }
    class BaseChunker {
        +str name
        +BaseSplitter splitter
        +DenseEncoder encoder
        +__call__(docs: List~str~) List~List~Chunk~~
        +_split(doc: str) List~str~
        +_chunk(splits: List~Any~) List~Chunk~
        +print(document_splits: List~Chunk~)
    }
    class SimpleChunker {
        +int target_length
        +__call__(docs: List~str~) List~List~Chunk~~
        +_split(doc: str) List~str~
        +_chunk(sentences: List~str~) List~Chunk~
    }
    class BasicSentencizerChunker {
        +str period_token
        +chunk(tokens: Sequence~str~, vectors: Sequence~Any~) List~List~Any~~
        +__call__ = chunk
    }
    class Token {
        +int id
        +str value
        +tuple~int, int~ offsets
    }
    class Tokenizer {
    }
    Config <|-- config
    BaseChunker <|-- SimpleChunker
    BaseSplitter <|-- BaseChunker
Loading

File-Level Changes

Change Details Files
Safe sqlite-vec extension loading and unified embedding table creation
  • Add _maybe_load_sqlite_vec with caching and error handling for ImportError, OSError, DatabaseError, and AttributeError
  • Introduce helper functions _create_table_if_missing and _create_virtual_table_if_missing
  • Refactor ensure_embedding_tables to conditionally create vec0 virtual tables with IF NOT EXISTS and fallback to standard tables
  • Ensure all embedding‐related tables (repo_embeddings, readme_chunk_embeddings, repo_build_files, repo_metadata) are created if missing
github_to_sqlite/utils.py
Refactor build file discovery with normalization and deduplication
  • Extract _post_process_build_files to normalize paths, filter out .git and node_modules, and deduplicate in order
  • Update find_build_files to prefer fd, then find, then os.walk, and pipe results through the post‐processor
github_to_sqlite/utils.py
Fix API pagination and stop_when defaults to avoid AttributeError
  • Replace lambda default in fetch_commits with a proper function definition
  • Use response.links.get("next", {}) to safely handle missing link entries
github_to_sqlite/utils.py
Enhance CLI with new commands, typing casts, and dependency checks
  • Add starred-embeddings and migrate commands with appropriate options and help text
  • Insert typing casts (cast(sqlite_utils.db.Table, ...)) and import hints for table operations
  • Refactor optional dependency checks for bs4 and unify stop_when function signature
github_to_sqlite/cli.py
Update project metadata and CI pipeline
  • Expand setup.py extras to include semantic_chunkers, vector-search-index, GPU, and docs
  • Enhance README.md with migration and starred embeddings sections
  • Improve GitHub Actions (test.yml) to run mypy, ruff, coverage, and docs build
setup.py
README.md
.github/workflows/test.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:

  • Extract the hard-coded embedding dimension (768) into a single constant or config value so it isn’t duplicated in the vec0 DDL strings and Python code.
  • Refactor ensure_embedding_tables by driving table creation from a schema mapping or list rather than repeating nearly identical blocks for each table and storage mode.
  • Make the build file patterns (pyproject.toml, package.json, Cargo.toml, Gemfile) configurable or injectable rather than hard-coding them in find_build_files for easier future extension.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the hard-coded embedding dimension (768) into a single constant or config value so it isn’t duplicated in the vec0 DDL strings and Python code.
- Refactor ensure_embedding_tables by driving table creation from a schema mapping or list rather than repeating nearly identical blocks for each table and storage mode.
- Make the build file patterns (pyproject.toml, package.json, Cargo.toml, Gemfile) configurable or injectable rather than hard-coding them in find_build_files for easier future extension.

## Individual Comments

### Comment 1
<location> `github_to_sqlite/tokenization.py:16` </location>
<code_context>
+
+
+
+def load_tokenizer(model: str | None = None) -> Tokenizer:
+    """Load a Hugging Face tokenizer.
+
</code_context>

<issue_to_address>
No error handling if the tokenizer model cannot be loaded.

Catch exceptions from Tokenizer.from_pretrained to provide a clearer error message or fallback if the model cannot be loaded.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def load_tokenizer(model: str | None = None) -> Tokenizer:
    """Load a Hugging Face tokenizer.

=======
def load_tokenizer(model: str | None = None) -> Tokenizer:
    """Load a Hugging Face tokenizer.

    Args:
        model: The name or path of the pretrained tokenizer.

    Returns:
        Tokenizer: The loaded tokenizer.

    Raises:
        RuntimeError: If the tokenizer cannot be loaded.
    """
    try:
        return Tokenizer.from_pretrained(model)
    except Exception as e:
        raise RuntimeError(
            f"Failed to load tokenizer for model '{model}': {e}"
        ) from e

>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +16 to +18
def load_tokenizer(model: str | None = None) -> Tokenizer:
"""Load a Hugging Face tokenizer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: No error handling if the tokenizer model cannot be loaded.

Catch exceptions from Tokenizer.from_pretrained to provide a clearer error message or fallback if the model cannot be loaded.

Suggested change
def load_tokenizer(model: str | None = None) -> Tokenizer:
"""Load a Hugging Face tokenizer.
def load_tokenizer(model: str | None = None) -> Tokenizer:
"""Load a Hugging Face tokenizer.
Args:
model: The name or path of the pretrained tokenizer.
Returns:
Tokenizer: The loaded tokenizer.
Raises:
RuntimeError: If the tokenizer cannot be loaded.
"""
try:
return Tokenizer.from_pretrained(model)
except Exception as e:
raise RuntimeError(
f"Failed to load tokenizer for model '{model}': {e}"
) from e

Comment on lines +17 to +18
if pattern == 'package.json':
raise subprocess.CalledProcessError(1, args)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +41 to +42
if cmd == 'find':
return '/usr/bin/find'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +48 to +49
if pattern == 'package.json':
raise subprocess.CalledProcessError(1, args)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +71 to +72
if "repo_embeddings" not in tables:
db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

Comment on lines +77 to +78
if "repo_metadata" not in tables:
db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)

ExplanationAvoid complex code, like conditionals, in test functions.

Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:

  • loops
  • conditionals

Some ways to fix this:

  • Use parametrized tests to get rid of the loop.
  • Move the complex logic into helpers.
  • Move the complex part into pytest fixtures.

Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.

Software Engineering at Google / Don't Put Logic in Tests

from bs4 import BeautifulSoup

url = "https://github.com/{}/network/dependents".format(repo)
url: str | None = "https://github.com/{}/network/dependents".format(repo)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)

Suggested change
url: str | None = "https://github.com/{}/network/dependents".format(repo)
url: str | None = f"https://github.com/{repo}/network/dependents"

Comment on lines +1138 to +1140
for full in pathlib.Path(path).rglob(pattern):
if full.is_file():
found.append(str(full))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)

Suggested change
for full in pathlib.Path(path).rglob(pattern):
if full.is_file():
found.append(str(full))
found.extend(
str(full)
for full in pathlib.Path(path).rglob(pattern)
if full.is_file()
)

return tree


def generate_starred_embeddings(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Low code quality found in generate_starred_embeddings - 16% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

calls = []

def fake_which(cmd):
if cmd == 'find':
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances embedding support by using CREATE VIRTUAL TABLE IF NOT EXISTS for vec0 tables, hardens sqlite-vec loading to avoid AttributeError, and cleans up build file discovery logic with deduplication and consistent filtering.

  • Introduce helper functions to create virtual tables if missing
  • Improve _maybe_load_sqlite_vec to guard against missing or broken extensions
  • Centralize and post-process build file paths to remove duplicates and ignore unwanted directories

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
github_to_sqlite/utils.py Added virtual-table creation and deduplicated find_build_files() logic
tests/test_find_build_files.py New tests covering fd/find/os.walk fallbacks and path normalization
tests/test_embedding_tables.py New test ensuring embedding tables are created via ensure_db_shape
setup.py Updated extras_require to include sqlite-vec, NLP & testing tools
docs/migrations.rst Documented the migrate command for embedding table setup
docs/embeddings.rst Added docs for the starred-embeddings command and table schemas
README.md Extended with sections on migrations, build file detection, embeddings

"""Create tables used for embedding storage if they do not exist."""
using_vec = _maybe_load_sqlite_vec(db)

tables = set(db.table_names())
Copy link

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After creating each table, consider updating or re-fetching the tables set instead of using a single snapshot. This ensures subsequent checks use the current schema and prevents unnecessary creation attempts.

Copilot uses AI. Check for mistakes.
unique: list[str] = []
seen = set()
for item in found:
if "/.git/" in item or "/node_modules/" in item:
Copy link

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use os.path.sep when checking path segments to ensure the filter works on Windows as well as POSIX. For example, build the pattern with os.path.sep + '.git' + os.path.sep.

Suggested change
if "/.git/" in item or "/node_modules/" in item:
if os.path.sep + ".git" + os.path.sep in item or os.path.sep + "node_modules" + os.path.sep in item:

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +32
monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', lambda t: ['ok'])
raise LookupError

monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', failing)
Copy link

Copilot AI Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of monkeypatch.setattr with a single string path is incorrect here. It’s safer to import the module and do monkeypatch.setattr(simple_chunker, 'sent_tokenize', lambda t: ['ok']) to ensure the stub is applied correctly.

Suggested change
monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', lambda t: ['ok'])
raise LookupError
monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', failing)
monkeypatch.setattr(sent_tokenize, lambda t: ['ok'])
raise LookupError
monkeypatch.setattr(sent_tokenize, failing)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant