Skip to content

Commit 15110c0

Browse files
committed
Search: Add scope filtering and improve AI prompt
- Add search scope field: comma-separated folder paths with `!` prefix for excludes, wildcard support in bare names, quoting/escaping for commas - Ancestor-walk filtering via `parent_id` chain in the in-memory index, with pre-resolved include path IDs and compiled exclude patterns for O(1) per-level checks - `parse_scope()` parser, `ScopeFilter` with `prepare_scope_filter()`, `parse_search_scope` IPC command - Scope row in SearchDialog between pattern and filter rows: text input, info `(i)` tooltip with syntax examples, `⌥F` (current folder) and `⌥D` (entire drive) buttons - Improved AI system prompt: glob-only `*`/`?` limitation (fixes `{a,b}` brace expansion failures), naming convention awareness, category→extension mapping, size inference, macOS screenshot naming, default code exclusions (`node_modules`, `.git`, etc.) - AI now returns `searchPaths`/`excludeDirs` fields, auto-populating the scope field - MCP `search` and `ai_search` tools accept optional `scope` parameter
1 parent c4cc26f commit 15110c0

File tree

17 files changed

+1180
-68
lines changed

17 files changed

+1180
-68
lines changed

apps/desktop/src-tauri/src/commands/CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ immediately to business-logic modules. No significant logic lives here.
2424
| `licensing.rs` | Licensing | Status query, activation, expiry, reminder, key validation |
2525
| `indexing.rs` | Drive index | `start_drive_index`, `stop_drive_index`, `get_index_status`, `get_dir_stats`, `get_dir_stats_batch`, `clear_drive_index`, `set_indexing_enabled`, `get_index_debug_status` (dev-only extended stats). Uses `State<IndexManagerState>`. |
2626
| `clipboard.rs` | Clipboard file ops | `copy_files_to_clipboard`, `cut_files_to_clipboard`, `read_clipboard_files`, `clear_clipboard_cut_state`. macOS uses NSPasteboard via `clipboard::pasteboard`; non-macOS stubs return errors. |
27-
| `search.rs` | Drive search | `prepare_search_index`, `search_files`, `release_search_index`, `translate_search_query`. Thin wrappers over `indexing::search` module. Post-filters directory sizes after `fill_directory_sizes`. |
27+
| `search.rs` | Drive search | `prepare_search_index`, `search_files`, `release_search_index`, `translate_search_query`, `parse_search_scope`. Thin wrappers over `indexing::search` module. Post-filters directory sizes after `fill_directory_sizes`. |
2828
| `sync_status.rs` | Cloud sync status | `get_sync_status` — macOS delegates to `file_system::sync_status`; non-macOS returns empty map via `#[cfg]` on the function itself (not the module). |
2929

3030
## Key decisions

apps/desktop/src-tauri/src/commands/search.rs

Lines changed: 124 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ use serde::{Deserialize, Serialize};
1111
use crate::ai::client::{AiBackend, ChatCompletionOptions};
1212
use crate::indexing::get_read_pool;
1313
use crate::indexing::search::{
14-
self, DIALOG_OPEN, SEARCH_INDEX, SearchIndexState, SearchQuery, SearchResult, drop_search_index,
14+
self, DIALOG_OPEN, ParsedScope, SEARCH_INDEX, SearchIndexState, SearchQuery, SearchResult, drop_search_index,
1515
fill_directory_sizes, start_backstop_timer, start_idle_timer, touch_activity,
1616
};
1717
use crate::indexing::writer::WRITER_GENERATION;
@@ -250,6 +250,12 @@ pub async fn release_search_index() -> Result<(), String> {
250250
Ok(())
251251
}
252252

253+
/// Parse a scope string into structured include/exclude data.
254+
#[tauri::command]
255+
pub fn parse_search_scope(scope: String) -> ParsedScope {
256+
search::parse_scope(&scope)
257+
}
258+
253259
// ============================================================================
254260
// AI search query translation
255261
// ============================================================================
@@ -265,6 +271,8 @@ pub(crate) struct AiSearchQuery {
265271
pub(crate) modified_after: Option<String>,
266272
pub(crate) modified_before: Option<String>,
267273
pub(crate) is_directory: Option<bool>,
274+
pub(crate) search_paths: Option<Vec<String>>,
275+
pub(crate) exclude_dirs: Option<Vec<String>>,
268276
}
269277

270278
/// Human-readable field values returned alongside the structured query.
@@ -286,6 +294,8 @@ pub struct TranslatedQuery {
286294
pub modified_after: Option<u64>,
287295
pub modified_before: Option<u64>,
288296
pub is_directory: Option<bool>,
297+
pub include_paths: Option<Vec<String>>,
298+
pub exclude_dir_names: Option<Vec<String>>,
289299
}
290300

291301
/// Human-readable values so the frontend can populate filter UI.
@@ -299,6 +309,8 @@ pub struct TranslateDisplay {
299309
pub modified_after: Option<String>,
300310
pub modified_before: Option<String>,
301311
pub is_directory: Option<bool>,
312+
pub include_paths: Option<Vec<String>>,
313+
pub exclude_dir_names: Option<Vec<String>>,
302314
}
303315

304316
/// Converts an ISO date string (YYYY-MM-DD) to a unix timestamp (seconds since epoch).
@@ -318,27 +330,44 @@ pub(crate) fn build_search_system_prompt() -> String {
318330
let format = time::macros::format_description!("[year]-[month]-[day]");
319331
let today_str = today.format(&format).expect("date format always succeeds");
320332

333+
let one_year_ago = today.replace_year(today.year() - 1).unwrap_or(today);
334+
let one_year_ago_str = one_year_ago.format(&format).expect("date format always succeeds");
335+
321336
format!(
322337
"You translate natural language file search queries into structured JSON filters.\n\
323338
\n\
324339
Return ONLY a JSON object with these optional fields:\n\
325-
- \"namePattern\": a filename pattern. Use glob (*, ?) for simple cases, regex for complex ones.\n\
326-
- \"patternType\": \"glob\" or \"regex\" — specify which format you used for namePattern.\n\
327-
- \"minSize\": size in bytes (e.g., 1048576 for 1 MB)\n\
328-
- \"maxSize\": size in bytes\n\
329-
- \"modifiedAfter\": ISO date string (e.g., \"2025-01-01\")\n\
330-
- \"modifiedBefore\": ISO date string\n\
340+
- \"namePattern\": filename pattern (glob or regex)\n\
341+
- \"patternType\": \"glob\" or \"regex\"\n\
342+
- \"minSize\"/\"maxSize\": size in bytes\n\
343+
- \"modifiedAfter\"/\"modifiedBefore\": ISO date (YYYY-MM-DD)\n\
331344
- \"isDirectory\": true for folders only, false for files only, omit for both\n\
345+
- \"searchPaths\": array of paths to search within (for example, [\"~/projects\"])\n\
346+
- \"excludeDirs\": array of directory names to exclude (for example, [\"node_modules\", \".git\"])\n\
347+
\n\
348+
Glob only supports * and ?. For multiple extensions or alternation, use regex.\n\
349+
Regex: Rust `regex` crate syntax (no lookahead/lookbehind, no backreferences, \
350+
no \\d — use [0-9]). Case-insensitive, unanchored unless you add ^ or $.\n\
332351
\n\
333-
For regex patterns, use Rust `regex` crate syntax (PCRE-like but no lookahead/lookbehind, \
334-
no backreferences, no \\d — use [0-9] instead). All regex is case-insensitive and unanchored \
335-
(partial match) unless you add ^ or $.\n\
352+
Category mapping: \"documents\" → regex for .pdf/.doc/.docx/.txt/.odt/.xls/.xlsx, \
353+
\"photos\"/\"images\" → .jpg/.jpeg/.png/.heic/.webp/.gif, \
354+
\"videos\" → .mp4/.mov/.avi/.mkv/.webm, \
355+
\"music\"/\"audio\" → .mp3/.m4a/.flac/.wav/.ogg/.aac.\n\
356+
Size hints: \"big\"/\"large\"/\"huge\" → minSize 100 MB+, \"taking up space\" → minSize 500 MB+.\n\
357+
If the user describes their naming convention (\"I name them...\", \"I mark them as...\", \
358+
\"tagged with...\"), use that as the filename pattern.\n\
359+
For code queries (programming languages, source files), auto-exclude: \
360+
excludeDirs: [\"node_modules\", \".git\", \"__pycache__\", \"vendor\", \".venv\", \"target\", \"build\", \"dist\"].\n\
336361
\n\
337362
Examples:\n\
338363
\"large pdfs\" → {{\"namePattern\": \"*.pdf\", \"patternType\": \"glob\", \"minSize\": 10485760}}\n\
339364
\"quarterly reports\" → {{\"namePattern\": \"(Q[1-4]|quarterly).*\\.pdf\", \"patternType\": \"regex\"}}\n\
340-
\"photos from last month\" → {{\"namePattern\": \"*.jpg\", \"patternType\": \"glob\", \"modifiedAfter\": \"2026-02-15\"}}\n\
365+
\"photos from last month\" → {{\"namePattern\": \"\\\\.(jpg|jpeg|png|heic|webp|gif)$\", \"patternType\": \"regex\", \"modifiedAfter\": \"2026-02-15\"}}\n\
341366
\"folders bigger than 1gb\" → {{\"isDirectory\": true, \"minSize\": 1073741824}}\n\
367+
\"screenshots from today\" → {{\"namePattern\": \"Screenshot.*\", \"patternType\": \"regex\", \"modifiedAfter\": \"{today_str}\"}}\n\
368+
\"invoices I mark as rymd\" → {{\"namePattern\": \"*rymd*\", \"patternType\": \"glob\"}}\n\
369+
\"documents older than a year\" → {{\"namePattern\": \"(\\\\.(pdf|doc|docx|txt|odt|xls|xlsx))$\", \"patternType\": \"regex\", \"modifiedBefore\": \"{one_year_ago_str}\"}}\n\
370+
\"python files in my projects\" → {{\"namePattern\": \"*.py\", \"patternType\": \"glob\", \"searchPaths\": [\"~/projects\"], \"excludeDirs\": [\"node_modules\", \".git\", \"__pycache__\", \".venv\"]}}\n\
342371
\n\
343372
Today's date is {today_str}. Return ONLY the JSON, no explanation."
344373
)
@@ -372,6 +401,16 @@ pub(crate) fn build_translate_result(ai_query: AiSearchQuery) -> Result<Translat
372401

373402
let pattern_type = ai_query.pattern_type.clone().unwrap_or_else(|| "glob".to_string());
374403

404+
// Expand ~ in search paths
405+
let include_paths = ai_query.search_paths.map(|paths| {
406+
paths
407+
.into_iter()
408+
.map(|p| crate::commands::file_system::expand_tilde(&p))
409+
.collect::<Vec<_>>()
410+
});
411+
412+
let exclude_dir_names = ai_query.exclude_dirs.clone();
413+
375414
Ok(TranslateResult {
376415
query: TranslatedQuery {
377416
name_pattern: ai_query.name_pattern.clone(),
@@ -381,6 +420,8 @@ pub(crate) fn build_translate_result(ai_query: AiSearchQuery) -> Result<Translat
381420
modified_after: modified_after_ts,
382421
modified_before: modified_before_ts,
383422
is_directory: ai_query.is_directory,
423+
include_paths: include_paths.clone(),
424+
exclude_dir_names: exclude_dir_names.clone(),
384425
},
385426
display: TranslateDisplay {
386427
name_pattern: ai_query.name_pattern,
@@ -390,6 +431,8 @@ pub(crate) fn build_translate_result(ai_query: AiSearchQuery) -> Result<Translat
390431
modified_after: ai_query.modified_after,
391432
modified_before: ai_query.modified_before,
392433
is_directory: ai_query.is_directory,
434+
include_paths,
435+
exclude_dir_names,
393436
},
394437
})
395438
}
@@ -565,6 +608,8 @@ mod tests {
565608
modified_after: Some(1_735_689_600),
566609
modified_before: None,
567610
is_directory: None,
611+
include_paths: None,
612+
exclude_dir_names: None,
568613
},
569614
display: TranslateDisplay {
570615
name_pattern: Some("*.pdf".to_string()),
@@ -574,6 +619,8 @@ mod tests {
574619
modified_after: Some("2025-01-01".to_string()),
575620
modified_before: None,
576621
is_directory: None,
622+
include_paths: None,
623+
exclude_dir_names: None,
577624
},
578625
};
579626
let json = serde_json::to_string(&result).unwrap();
@@ -628,6 +675,8 @@ mod tests {
628675
modified_after: None,
629676
modified_before: None,
630677
is_directory: None,
678+
search_paths: None,
679+
exclude_dirs: None,
631680
};
632681
assert!(validate_regex_pattern(&q).is_ok());
633682
}
@@ -642,6 +691,8 @@ mod tests {
642691
modified_after: None,
643692
modified_before: None,
644693
is_directory: None,
694+
search_paths: None,
695+
exclude_dirs: None,
645696
};
646697
assert!(validate_regex_pattern(&q).is_err());
647698
}
@@ -656,6 +707,8 @@ mod tests {
656707
modified_after: None,
657708
modified_before: None,
658709
is_directory: None,
710+
search_paths: None,
711+
exclude_dirs: None,
659712
};
660713
// Glob patterns aren't validated as regex
661714
assert!(validate_regex_pattern(&q).is_ok());
@@ -671,6 +724,8 @@ mod tests {
671724
modified_after: None,
672725
modified_before: None,
673726
is_directory: None,
727+
search_paths: None,
728+
exclude_dirs: None,
674729
};
675730
assert!(validate_regex_pattern(&q).is_ok());
676731
}
@@ -685,9 +740,67 @@ mod tests {
685740
modified_after: None,
686741
modified_before: None,
687742
is_directory: None,
743+
search_paths: None,
744+
exclude_dirs: None,
688745
};
689746
let result = build_translate_result(q).unwrap();
690747
assert_eq!(result.query.pattern_type, "regex");
691748
assert_eq!(result.display.pattern_type.as_deref(), Some("regex"));
692749
}
750+
751+
#[test]
752+
fn test_ai_search_query_deserialization_with_scope_fields() {
753+
let json = r#"{
754+
"namePattern": "*.py",
755+
"patternType": "glob",
756+
"searchPaths": ["~/projects", "~/work"],
757+
"excludeDirs": ["node_modules", ".git", "__pycache__"]
758+
}"#;
759+
let q: AiSearchQuery = serde_json::from_str(json).unwrap();
760+
assert_eq!(q.name_pattern.as_deref(), Some("*.py"));
761+
let paths = q.search_paths.unwrap();
762+
assert_eq!(paths.len(), 2);
763+
assert_eq!(paths[0], "~/projects");
764+
assert_eq!(paths[1], "~/work");
765+
let excludes = q.exclude_dirs.unwrap();
766+
assert_eq!(excludes.len(), 3);
767+
assert_eq!(excludes[0], "node_modules");
768+
}
769+
770+
#[test]
771+
fn test_build_translate_result_with_search_paths_and_excludes() {
772+
let q = AiSearchQuery {
773+
name_pattern: Some("*.py".to_string()),
774+
pattern_type: Some("glob".to_string()),
775+
min_size: None,
776+
max_size: None,
777+
modified_after: None,
778+
modified_before: None,
779+
is_directory: None,
780+
search_paths: Some(vec!["~/projects".to_string()]),
781+
exclude_dirs: Some(vec!["node_modules".to_string(), ".git".to_string()]),
782+
};
783+
let result = build_translate_result(q).unwrap();
784+
785+
// search_paths should have ~ expanded
786+
let paths = result.query.include_paths.unwrap();
787+
assert!(!paths[0].starts_with('~'), "~ should be expanded");
788+
assert!(paths[0].contains("projects"));
789+
790+
// exclude_dirs passed through
791+
let excludes = result.query.exclude_dir_names.unwrap();
792+
assert_eq!(excludes, vec!["node_modules", ".git"]);
793+
794+
// display should also have the values
795+
assert!(result.display.include_paths.is_some());
796+
assert!(result.display.exclude_dir_names.is_some());
797+
}
798+
799+
#[test]
800+
fn test_build_search_system_prompt_contains_scope_fields() {
801+
let prompt = build_search_system_prompt();
802+
assert!(prompt.contains("searchPaths"));
803+
assert!(prompt.contains("excludeDirs"));
804+
assert!(prompt.contains("node_modules"));
805+
}
693806
}

apps/desktop/src-tauri/src/indexing/CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Full design: `docs/specs/drive-indexing/plan.md`
2121
- **reconciler.rs** -- Buffers FSEvents during scan (capped at 500K events; overflow sets `buffer_overflow` flag forcing full rescan), replays after scan completes using event IDs to skip stale events. Processes live events for file creates/removes/modifies using integer-keyed write messages (`UpsertEntryV2`, `DeleteEntryById`, `DeleteSubtreeById`, `PropagateDeltaById`). Resolves filesystem paths to entry IDs via `store::resolve_path()` using a read connection passed by callers. Key functions (`process_fs_event`, `emit_dir_updated`) are `pub(super)` so `mod.rs` can call them directly during cold-start replay. `reconcile_subtree()` handles MustScanSubDirs by diffing filesystem vs DB directory-by-directory instead of delete-then-reinsert, making it safe to interrupt at any point.
2222
- **firmlinks.rs** -- Parses `/usr/share/firmlinks`, builds prefix map, normalizes paths. Converts `/System/Volumes/Data/Users/foo` to `/Users/foo`.
2323
- **verifier.rs** -- Per-navigation background readdir diff. On each directory navigation, `trigger_verification()` (called from `streaming.rs` and `operations.rs` after enrichment) is fully fire-and-forget: it spawns a task that acquires the `INDEXING` lock (never blocking the navigation thread), checks dedup/debounce via static `VerifierState` (in-flight set + recent timestamps), then spawns a second async task that: (1) reads DB children via `ReadPool`, (2) reads disk via `read_dir` (filtering through `scanner::should_exclude`), (3) diffs by normalized name, sending `UpsertEntryV2`/`DeleteEntryById`/`DeleteSubtreeById`/`PropagateDeltaById` corrections to the writer. New directories are flushed then scanned via `scan_subtree` with delta propagation. Debounce: 30s per path, max 2 concurrent verifications. Only runs after initial scan is complete (checks `scanning` flag). `invalidate()` clears state on shutdown/clear.
24-
- **search.rs** -- In-memory search index for whole-drive file search. Lazily loads all entries from the index DB into a `Vec<SearchEntry>` for fast parallel scanning with rayon. Filenames are arena-allocated: all names are concatenated into a single `SearchIndex.names: String` buffer, and each `SearchEntry` stores `name_offset: u32` + `name_len: u16` instead of an owned `String`. During load, `row.get_ref(col).as_str()` borrows directly from SQLite's internal buffer (zero per-row heap allocations), then pushes into the arena. `name_folded` is NOT stored in the search index — instead, the search pattern is NFD-normalized at query time on macOS (APFS filenames are already NFD). `SearchIndex::name(&self, entry)` retrieves a `&str` slice from the arena. `search()` is a pure function: compiles glob/regex patterns, parallel-filters entries, sorts by recency. Global `SEARCH_INDEX` state with `Arc<SearchIndex>`, idle timer (5 min after dialog close), backstop timer (10 min with no activity), and load cancellation via `AtomicBool` checked every 100K rows. `WRITER_GENERATION` in writer.rs tracks mutations; stale indexes are detected on search. IPC commands in `commands/search.rs`: `prepare_search_index` (emits `search-index-ready` event when load completes), `search_files`, `release_search_index`, `translate_search_query` (AI natural language → structured query).
24+
- **search.rs** -- In-memory search index for whole-drive file search. Lazily loads all entries from the index DB into a `Vec<SearchEntry>` for fast parallel scanning with rayon. Filenames are arena-allocated: all names are concatenated into a single `SearchIndex.names: String` buffer, and each `SearchEntry` stores `name_offset: u32` + `name_len: u16` instead of an owned `String`. During load, `row.get_ref(col).as_str()` borrows directly from SQLite's internal buffer (zero per-row heap allocations), then pushes into the arena. `name_folded` is NOT stored in the search index — instead, the search pattern is NFD-normalized at query time on macOS (APFS filenames are already NFD). `SearchIndex::name(&self, entry)` retrieves a `&str` slice from the arena. `search()` is a pure function: compiles glob/regex patterns, parallel-filters entries, sorts by recency. Global `SEARCH_INDEX` state with `Arc<SearchIndex>`, idle timer (5 min after dialog close), backstop timer (10 min with no activity), and load cancellation via `AtomicBool` checked every 100K rows. `WRITER_GENERATION` in writer.rs tracks mutations; stale indexes are detected on search. Scope filtering: `SearchQuery` accepts optional `include_paths` (absolute paths — search only within these subtrees) and `exclude_dir_names` (directory names/patterns to exclude at any depth). `prepare_scope_filter()` pre-resolves include paths to entry IDs and compiles exclude patterns as regexes. `ScopeFilter::matches()` walks the ancestor chain via `id_to_index` (O(1) per level) after all other filters pass. `parse_scope()` parses a user-typed comma-separated scope string (with quoting, escaping, `~` expansion, `!` excludes) into a `ParsedScope` struct. IPC commands in `commands/search.rs`: `prepare_search_index` (emits `search-index-ready` event when load completes), `search_files`, `release_search_index`, `translate_search_query` (AI natural language → structured query), `parse_search_scope` (scope string → structured `ParsedScope`).
2525

2626
IPC commands in `commands/indexing.rs` -- thin wrappers over `IndexManager` methods.
2727

0 commit comments

Comments
 (0)