Refactor utils and improve tokenizer handling#13
Conversation
Reviewer's GuideRefactors utility code into dedicated modules and implements a new embeddings pipeline with CLI commands for starred repositories and migrations, switches to AutoTokenizer, enhances build file discovery and chunking, caches sqlite-vec loading, and updates dependencies, CI, documentation, and tests. Class diagram for new and refactored utility modulesclassDiagram
class Config {
+str default_model
+str onnx_provider
+int max_length
+int embedding_dim
+list[str] build_patterns
}
class Token {
+int id
+str value
+tuple[int, int] offsets
}
class Chunk {
+str content
+int token_count
+bool is_triggered
+float triggered_score
}
class BaseSplitter {
+__call__(doc: str) List[str]
}
class BaseChunker {
+str name
+BaseSplitter splitter
+DenseEncoder encoder
+__call__(docs: List[str]) List[List[Chunk]]
+_split(doc: str) List[str]
+_chunk(splits: List[Any]) List[Chunk]
+print(document_splits: List[Chunk])
}
class SimpleChunker {
+int target_length
+__call__(docs: List[str]) List[List[Chunk]]
+_split(doc: str) List[str]
+_chunk(sentences: List[str]) List[Chunk]
}
class BasicSentencizerChunker {
+str period_token
+chunk(tokens: Sequence[str], vectors: Sequence[Any]) List[List[Any]]
+__call__ = chunk
}
class Tokenizer
class AutoTokenizer
class DenseEncoder
SimpleChunker --|> BaseChunker
BaseChunker o-- BaseSplitter
BaseChunker o-- DenseEncoder
Tokenization "1" --o "*" Token
AutoTokenizer <|-- Tokenizer
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a comprehensive embeddings feature for starred repositories while improving tokenizer handling and refactoring utility code. The changes enable generation and storage of sentence-transformer embeddings for repository metadata, with proper chunking of README content and build file metadata extraction.
- Add
starred-embeddingsCLI command for computing repository embeddings using sentence-transformers - Refactor utilities into focused modules (build files, embeddings, tokenization, chunking)
- Switch to
AutoTokenizerfor broader model compatibility and add better error handling
Reviewed Changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_workflows.py | Remove unused import for ForeignKey |
| tests/test_stargazers.py | Remove unused import for ForeignKey |
| tests/test_repos.py | Remove unused import and duplicate pytest import |
| tests/*.py (new files) | Add comprehensive test coverage for new modules and CLI commands |
| setup.py | Add new dependency groups for testing, vector search, chunking, and documentation |
| mypy.ini | Add mypy configuration for type checking |
| github_to_sqlite/utils.py | Add embedding generation function and refactor imports from new modules |
| github_to_sqlite/*.py (new modules) | Add modular code for tokenization, chunking, build files, and embedding management |
| docs/*.rst | Add Sphinx documentation for embeddings and migrations |
| .github/workflows/test.yml | Update CI to run additional tools and build documentation |
| README.md | Add documentation for new migration and embedding features |
Comments suppressed due to low confidence (1)
| @@ -2,11 +2,9 @@ | |||
| import pytest | |||
There was a problem hiding this comment.
Duplicate import statement for pytest. Remove one of the pytest imports as it's imported on line 2.
github_to_sqlite/cli.py
Outdated
| repos_seen.add(pr_repo_url) | ||
| utils.save_pull_requests(db, [pull_request], pr_repo) | ||
| else: | ||
| from typing import Iterable, Any |
There was a problem hiding this comment.
[nitpick] Import statement inside function body. Consider moving this import to the top of the file with other imports for better code organization.
| from typing import Iterable, Any |
github_to_sqlite/cli.py
Outdated
|
|
||
| def stop_when(commit): | ||
| from typing import Callable | ||
|
|
There was a problem hiding this comment.
[nitpick] Import statement inside function body. Consider moving this import to the top of the file with other imports for better code organization.
github_to_sqlite/simple_chunker.py
Outdated
| Style = _Style | ||
|
|
||
| from typing import TYPE_CHECKING | ||
|
|
There was a problem hiding this comment.
[nitpick] TYPE_CHECKING import should be grouped with other typing imports at the top of the file for better organization.
There was a problem hiding this comment.
Hey @JeffCarpenter - I've reviewed your changes - here's some feedback:
- There are still many print statements sprinkled throughout the core functions—consider switching to the standard logging module so verbosity and error reporting can be controlled more flexibly.
- The utils.py file has grown very large again; think about breaking out remaining HTTP/repo helpers into their own modules to keep each file’s responsibilities focused.
- CLI code is repetitively casting tables via cast(sqlite_utils.db.Table, …); you could introduce a small helper to fetch a typed table instance and DRY up those calls.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are still many print statements sprinkled throughout the core functions—consider switching to the standard logging module so verbosity and error reporting can be controlled more flexibly.
- The utils.py file has grown very large again; think about breaking out remaining HTTP/repo helpers into their own modules to keep each file’s responsibilities focused.
- CLI code is repetitively casting tables via cast(sqlite_utils.db.Table, …); you could introduce a small helper to fetch a typed table instance and DRY up those calls.
## Individual Comments
### Comment 1
<location> `github_to_sqlite/simple_chunker.py:135` </location>
<code_context>
+
+ def _chunk(self, sentences: List[str]) -> List[Chunk]:
+ chunks: List[Chunk] = []
+ for i in range(0, len(sentences), self.target_length):
+ piece = sentences[i : i + self.target_length]
+ if len(piece) < self.target_length:
</code_context>
<issue_to_address>
The chunking logic skips the final chunk if it is shorter than target_length.
Yield the final chunk even if it's shorter than target_length to avoid missing content.
</issue_to_address>
### Comment 2
<location> `github_to_sqlite/utils.py:1053` </location>
<code_context>
+ pk=("repo_id", "chunk_index"),
+ )
+
+ for build_path in find_build_files(repo["full_name"], patterns=patterns):
+ metadata = parse_build_file(os.path.join(repo["full_name"], build_path))
+ cast(sqlite_utils.db.Table, db["repo_build_files"]).upsert(
</code_context>
<issue_to_address>
find_build_files is called with repo["full_name"], which may not be a local path.
If repo["full_name"] is a GitHub identifier like 'owner/repo', it may not map to a local directory, which could cause find_build_files to fail. Please verify that a valid local path is used.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
|
||
| def _chunk(self, sentences: List[str]) -> List[Chunk]: | ||
| chunks: List[Chunk] = [] | ||
| for i in range(0, len(sentences), self.target_length): |
There was a problem hiding this comment.
suggestion (bug_risk): The chunking logic skips the final chunk if it is shorter than target_length.
Yield the final chunk even if it's shorter than target_length to avoid missing content.
github_to_sqlite/utils.py
Outdated
| pk=("repo_id", "chunk_index"), | ||
| ) | ||
|
|
||
| for build_path in find_build_files(repo["full_name"], patterns=patterns): |
There was a problem hiding this comment.
issue (bug_risk): find_build_files is called with repo["full_name"], which may not be a local path.
If repo["full_name"] is a GitHub identifier like 'owner/repo', it may not map to a local directory, which could cause find_build_files to fail. Please verify that a valid local path is used.
| if pattern == 'package.json': | ||
| raise subprocess.CalledProcessError(1, args) |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if cmd == 'find': | ||
| return '/usr/bin/find' |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if pattern == 'package.json': | ||
| raise subprocess.CalledProcessError(1, args) |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
| if "repo_metadata" not in tables: | ||
| db["repo_metadata"].create({"repo_id": int, "language": str, "directory_tree": str}, pk="repo_id") |
There was a problem hiding this comment.
issue (code-quality): Avoid conditionals in tests. (no-conditionals-in-tests)
Explanation
Avoid complex code, like conditionals, in test functions.Google's software engineering guidelines says:
"Clear tests are trivially correct upon inspection"
To reach that avoid complex code in tests:
- loops
- conditionals
Some ways to fix this:
- Use parametrized tests to get rid of the loop.
- Move the complex logic into helpers.
- Move the complex part into pytest fixtures.
Complexity is most often introduced in the form of logic. Logic is defined via the imperative parts of programming languages such as operators, loops, and conditionals. When a piece of code contains logic, you need to do a bit of mental computation to determine its result instead of just reading it off of the screen. It doesn't take much logic to make a test more difficult to reason about.
Software Engineering at Google / Don't Put Logic in Tests
github_to_sqlite/build_files.py
Outdated
| for full in pathlib.Path(path).rglob(pattern): | ||
| if full.is_file(): | ||
| found.append(str(full)) |
There was a problem hiding this comment.
suggestion (code-quality): Replace a for append loop with list extend (for-append-to-extend)
| for full in pathlib.Path(path).rglob(pattern): | |
| if full.is_file(): | |
| found.append(str(full)) | |
| found.extend( | |
| str(full) | |
| for full in pathlib.Path(path).rglob(pattern) | |
| if full.is_file() | |
| ) |
github_to_sqlite/utils.py
Outdated
| from bs4 import BeautifulSoup | ||
|
|
||
| url = "https://github.com/{}/network/dependents".format(repo) | ||
| url: str | None = "https://github.com/{}/network/dependents".format(repo) |
There was a problem hiding this comment.
suggestion (code-quality): Replace call to format with f-string (use-fstring-for-formatting)
| url: str | None = "https://github.com/{}/network/dependents".format(repo) | |
| url: str | None = f"https://github.com/{repo}/network/dependents" |
| ) | ||
|
|
||
|
|
||
| def generate_starred_embeddings( |
There was a problem hiding this comment.
issue (code-quality): Low code quality found in generate_starred_embeddings - 15% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| calls = [] | ||
|
|
||
| def fake_which(cmd): | ||
| if cmd == 'find': |
There was a problem hiding this comment.
issue (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else) - Replace if statement with if expression (
assign-if-exp) - Use f-string instead of string concatenation (
use-fstring-for-concatenation)
Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>
Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>
Co-authored-by: JeffCarpenter <178424+JeffCarpenter@users.noreply.github.com>
Summary
AutoTokenizerfor broader compatibility and update testsTesting
ruff check --fix .pytest --cov=github_to_sqlite --cov-branch -qmypy github_to_sqlitehttps://chatgpt.com/codex/tasks/task_e_684789d88fd48326886ef74e111feb28
Summary by Sourcery
Refactor utility code into separate modules and introduce comprehensive support for generating and storing embeddings for starred repositories, along with improved tokenizer and build-file handling, enhanced database setup, updated CI, and documentation.
New Features:
Bug Fixes:
Enhancements:
Build:
CI:
Documentation:
Tests: