Refine virtual table creation and build file discovery#9
Refine virtual table creation and build file discovery#9JeffCarpenter wants to merge 2 commits intomainfrom
Conversation
Reviewer's GuideThis PR overhauls how embedding tables and build files are created and discovered by introducing safe sqlite-vec loading with vec0 virtual tables, unifying build file search/filtering, handling API pagination errors, and extending the CLI and CI with new commands and metadata. Entity relationship diagram for new and updated embedding tableserDiagram
repos ||--o{ repo_embeddings : "has"
repos ||--o{ readme_chunk_embeddings : "has"
repos ||--o{ repo_build_files : "has"
repos ||--o{ repo_metadata : "has"
repo_embeddings {
int repo_id PK
float[768] title_embedding
float[768] description_embedding
float[768] readme_embedding
}
readme_chunk_embeddings {
int repo_id PK
int chunk_index PK
text chunk_text
float[768] embedding
}
repo_build_files {
int repo_id PK
string file_path PK
string metadata
}
repo_metadata {
int repo_id PK
string language
string directory_tree
}
Class diagram for chunker and config utilitiesclassDiagram
class Config {
+str default_model
+str onnx_provider
+int max_length
}
class Chunk {
+str content
+int token_count
+bool is_triggered
+float triggered_score
}
class BaseSplitter {
<<abstract>>
+__call__(doc: str) List~str~
}
class BaseChunker {
+str name
+BaseSplitter splitter
+DenseEncoder encoder
+__call__(docs: List~str~) List~List~Chunk~~
+_split(doc: str) List~str~
+_chunk(splits: List~Any~) List~Chunk~
+print(document_splits: List~Chunk~)
}
class SimpleChunker {
+int target_length
+__call__(docs: List~str~) List~List~Chunk~~
+_split(doc: str) List~str~
+_chunk(sentences: List~str~) List~Chunk~
}
class BasicSentencizerChunker {
+str period_token
+chunk(tokens: Sequence~str~, vectors: Sequence~Any~) List~List~Any~~
+__call__ = chunk
}
class Token {
+int id
+str value
+tuple~int, int~ offsets
}
class Tokenizer {
}
Config <|-- config
BaseChunker <|-- SimpleChunker
BaseSplitter <|-- BaseChunker
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:
- Extract the hard-coded embedding dimension (768) into a single constant or config value so it isn’t duplicated in the vec0 DDL strings and Python code.
- Refactor ensure_embedding_tables by driving table creation from a schema mapping or list rather than repeating nearly identical blocks for each table and storage mode.
- Make the build file patterns (pyproject.toml, package.json, Cargo.toml, Gemfile) configurable or injectable rather than hard-coding them in find_build_files for easier future extension.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the hard-coded embedding dimension (768) into a single constant or config value so it isn’t duplicated in the vec0 DDL strings and Python code.
- Refactor ensure_embedding_tables by driving table creation from a schema mapping or list rather than repeating nearly identical blocks for each table and storage mode.
- Make the build file patterns (pyproject.toml, package.json, Cargo.toml, Gemfile) configurable or injectable rather than hard-coding them in find_build_files for easier future extension.
## Individual Comments
### Comment 1
<location> `github_to_sqlite/tokenization.py:16` </location>
<code_context>
+
+
+
+def load_tokenizer(model: str | None = None) -> Tokenizer:
+ """Load a Hugging Face tokenizer.
+
</code_context>
<issue_to_address>
No error handling if the tokenizer model cannot be loaded.
Catch exceptions from Tokenizer.from_pretrained to provide a clearer error message or fallback if the model cannot be loaded.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def load_tokenizer(model: str | None = None) -> Tokenizer:
"""Load a Hugging Face tokenizer.
=======
def load_tokenizer(model: str | None = None) -> Tokenizer:
"""Load a Hugging Face tokenizer.
Args:
model: The name or path of the pretrained tokenizer.
Returns:
Tokenizer: The loaded tokenizer.
Raises:
RuntimeError: If the tokenizer cannot be loaded.
"""
try:
return Tokenizer.from_pretrained(model)
except Exception as e:
raise RuntimeError(
f"Failed to load tokenizer for model '{model}': {e}"
) from e
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def load_tokenizer(model: str | None = None) -> Tokenizer: | ||
| """Load a Hugging Face tokenizer. | ||
|
|
There was a problem hiding this comment.
suggestion: No error handling if the tokenizer model cannot be loaded.
Catch exceptions from Tokenizer.from_pretrained to provide a clearer error message or fallback if the model cannot be loaded.
| def load_tokenizer(model: str | None = None) -> Tokenizer: | |
| """Load a Hugging Face tokenizer. | |
| def load_tokenizer(model: str | None = None) -> Tokenizer: | |
| """Load a Hugging Face tokenizer. | |
| Args: | |
| model: The name or path of the pretrained tokenizer. | |
| Returns: | |
| Tokenizer: The loaded tokenizer. | |
| Raises: | |
| RuntimeError: If the tokenizer cannot be loaded. | |
| """ | |
| try: | |
| return Tokenizer.from_pretrained(model) | |
| except Exception as e: | |
| raise RuntimeError( | |
| f"Failed to load tokenizer for model '{model}': {e}" | |
| ) from e | |
| if pattern == 'package.json': | ||
| raise subprocess.CalledProcessError(1, args) |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if cmd == 'find': | ||
| return '/usr/bin/find' |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if pattern == 'package.json': | ||
| raise subprocess.CalledProcessError(1, args) |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_embeddings" not in tables: | ||
| db["repo_embeddings"].create({"repo_id": int, "title_embedding": bytes, "description_embedding": bytes, "readme_embedding": bytes}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_metadata" not in tables: | ||
| db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| from bs4 import BeautifulSoup | ||
|
|
||
| url = "https://github.com/{}/network/dependents".format(repo) | ||
| url: str | None = "https://github.com/{}/network/dependents".format(repo) |
There was a problem hiding this comment.
suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)
| url: str | None = "https://github.com/{}/network/dependents".format(repo) | |
| url: str | None = f"https://github.com/{repo}/network/dependents" |
| for full in pathlib.Path(path).rglob(pattern): | ||
| if full.is_file(): | ||
| found.append(str(full)) |
There was a problem hiding this comment.
suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)
| for full in pathlib.Path(path).rglob(pattern): | |
| if full.is_file(): | |
| found.append(str(full)) | |
| found.extend( | |
| str(full) | |
| for full in pathlib.Path(path).rglob(pattern) | |
| if full.is_file() | |
| ) |
| return tree | ||
|
|
||
|
|
||
| def generate_starred_embeddings( |
There was a problem hiding this comment.
issue (code-quality): Low code quality found in generate_starred_embeddings - 16% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| calls = [] | ||
|
|
||
| def fake_which(cmd): | ||
| if cmd == 'find': |
There was a problem hiding this comment.
issue (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp) - Use f-string instead of string concatenation (
use-fstring-for-concatenation)
There was a problem hiding this comment.
Pull Request Overview
This PR enhances embedding support by using CREATE VIRTUAL TABLE IF NOT EXISTS for vec0 tables, hardens sqlite-vec loading to avoid AttributeError, and cleans up build file discovery logic with deduplication and consistent filtering.
- Introduce helper functions to create virtual tables if missing
- Improve
_maybe_load_sqlite_vecto guard against missing or broken extensions - Centralize and post-process build file paths to remove duplicates and ignore unwanted directories
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| github_to_sqlite/utils.py | Added virtual-table creation and deduplicated find_build_files() logic |
| tests/test_find_build_files.py | New tests covering fd/find/os.walk fallbacks and path normalization |
| tests/test_embedding_tables.py | New test ensuring embedding tables are created via ensure_db_shape |
| setup.py | Updated extras_require to include sqlite-vec, NLP & testing tools |
| docs/migrations.rst | Documented the migrate command for embedding table setup |
| docs/embeddings.rst | Added docs for the starred-embeddings command and table schemas |
| README.md | Extended with sections on migrations, build file detection, embeddings |
| """Create tables used for embedding storage if they do not exist.""" | ||
| using_vec = _maybe_load_sqlite_vec(db) | ||
|
|
||
| tables = set(db.table_names()) |
There was a problem hiding this comment.
After creating each table, consider updating or re-fetching the tables set instead of using a single snapshot. This ensures subsequent checks use the current schema and prevents unnecessary creation attempts.
| unique: list[str] = [] | ||
| seen = set() | ||
| for item in found: | ||
| if "/.git/" in item or "/node_modules/" in item: |
There was a problem hiding this comment.
Use os.path.sep when checking path segments to ensure the filter works on Windows as well as POSIX. For example, build the pattern with os.path.sep + '.git' + os.path.sep.
| if "/.git/" in item or "/node_modules/" in item: | |
| if os.path.sep + ".git" + os.path.sep in item or os.path.sep + "node_modules" + os.path.sep in item: |
| monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', lambda t: ['ok']) | ||
| raise LookupError | ||
|
|
||
| monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', failing) |
There was a problem hiding this comment.
The use of monkeypatch.setattr with a single string path is incorrect here. It’s safer to import the module and do monkeypatch.setattr(simple_chunker, 'sent_tokenize', lambda t: ['ok']) to ensure the stub is applied correctly.
| monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', lambda t: ['ok']) | |
| raise LookupError | |
| monkeypatch.setattr('github_to_sqlite.simple_chunker.sent_tokenize', failing) | |
| monkeypatch.setattr(sent_tokenize, lambda t: ['ok']) | |
| raise LookupError | |
| monkeypatch.setattr(sent_tokenize, failing) |
…s-for-starred-repos
Summary
sqlite-vecCREATE VIRTUAL TABLE IF NOT EXISTSTesting
pytest -qmypy github_to_sqlitehttps://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28
Summary by Sourcery
Refactor virtual table creation and build file discovery, and add a new embeddings feature with supporting utilities and CLI commands.
New Features:
Enhancements:
CI:
Documentation:
Tests: