Skip to content

Commit aff2046

Browse files
committed
Indexing: truncate before every full scan
Move truncate from the `StaleIndex` path into `start_scan()` so every full-scan caller gets it — orphaned subtrees from `INSERT OR REPLACE` accumulate on every scan cycle, not just stale ones.
1 parent 96323e9 commit aff2046

File tree

2 files changed

+15
-12
lines changed

2 files changed

+15
-12
lines changed

apps/desktop/src-tauri/src/indexing/CLAUDE.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,13 +33,14 @@ App startup
3333
|-- start_indexing(): create IndexManager, open SQLite, spawn writer thread
3434
|-- resume_or_scan():
3535
| |-- macOS: Has existing index + last_event_id?
36-
| | |-- Pre-check: event gap > 1M? -> emit index-rescan-notification (StaleIndex), truncate entries+dir_stats, full scan
36+
| | |-- Pre-check: event gap > 1M? -> emit index-rescan-notification (StaleIndex), full scan
3737
| | |-- Otherwise -> sinceWhen replay (FSEvents journal)
3838
| |-- Linux: Always full rescan (no event journal; existing DB used for instant enrichment)
3939
| |-- Incomplete previous scan (has data but no scan_completed_at)? -> notify + fresh scan
4040
| |-- Otherwise -> fresh full scan
4141
|
42-
Full scan:
42+
Full scan (start_scan):
43+
|-- Truncate entries + dir_stats (TruncateData + flush_blocking)
4344
|-- Start DriveWatcher (sinceWhen=0, buffers events)
4445
|-- ScanContext initialized: root -> ROOT_ID, next_id from DB
4546
|-- jwalk parallel walk -> ScanContext assigns IDs -> batched InsertEntriesV2 -> writer -> SQLite
@@ -125,7 +126,7 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
125126

126127
## Gotchas
127128

128-
**INSERT OR REPLACE on a populated DB is catastrophically slow**: The `platform_case` collation (NFD + case fold on macOS) runs for every B-tree comparison during unique index lookups. On an empty DB a full scan takes ~2.5 min; on a populated DB with 5.5M entries the same scan takes ~30 min because each `INSERT OR REPLACE` triggers ~20 collation calls to traverse the B-tree. The `StaleIndex` path truncates `entries` and `dir_stats` via `TruncateData` + `flush_blocking()` before starting the scan to avoid this. Never do a full rescan into a populated DB without clearing first.
129+
**INSERT OR REPLACE on a populated DB is catastrophically slow**: The `platform_case` collation (NFD + case fold on macOS) runs for every B-tree comparison during unique index lookups. On an empty DB a full scan takes ~2.5 min; on a populated DB with 5.5M entries the same scan takes ~30 min because each `INSERT OR REPLACE` triggers ~20 collation calls to traverse the B-tree. `start_scan()` truncates `entries` and `dir_stats` via `TruncateData` + `flush_blocking()` before every scan to avoid this. Additionally, without truncation, old rows accumulate as orphaned subtrees (3-4x DB bloat per scan cycle) because `INSERT OR REPLACE` only deduplicates at the root level.
129130

130131
**Cold-start replay enters live mode immediately after flush**: The `run_replay_event_loop` doesn't emit `index-dir-updated` during Phase 1 (replay). It collects affected paths, flushes the writer (ensuring all writes are committed), emits a single batched notification, re-enables micro-scans, and enters live mode right away (~100ms from startup). Post-replay verification (`verify_affected_dirs`) runs in a background task (`run_background_verification`) concurrently with live events. This is safe because the writer serializes all writes. Any corrections found by verification are emitted as a separate `index-dir-updated` batch.
131132

apps/desktop/src-tauri/src/indexing/mod.rs

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -432,15 +432,6 @@ impl IndexManager {
432432
The app likely hasn't run for a long time."
433433
),
434434
);
435-
// Truncate entries + dir_stats before scanning. INSERT OR REPLACE on a
436-
// populated DB with the `platform_case` collation is extremely slow
437-
// (30 min vs 2.5 min on empty). The stale data is useless anyway.
438-
if let Err(e) = self.writer.send(WriteMessage::TruncateData) {
439-
log::warn!("Failed to send TruncateData: {e}");
440-
}
441-
if let Err(e) = self.writer.flush_blocking() {
442-
log::warn!("Failed to flush after TruncateData: {e}");
443-
}
444435
return self.start_scan();
445436
}
446437

@@ -588,6 +579,17 @@ impl IndexManager {
588579
return Err("Scan already running".to_string());
589580
}
590581

582+
// Step 0: Truncate entries + dir_stats so the scan inserts into an empty DB.
583+
// Without this, INSERT OR REPLACE on a populated table with the `platform_case`
584+
// collation is ~12x slower (30 min vs 2.5 min), and old rows with stale IDs
585+
// accumulate as orphaned subtrees, bloating the DB 3-4x per scan cycle.
586+
if let Err(e) = self.writer.send(WriteMessage::TruncateData) {
587+
log::warn!("Failed to send TruncateData: {e}");
588+
}
589+
if let Err(e) = self.writer.flush_blocking() {
590+
log::warn!("Failed to flush after TruncateData: {e}");
591+
}
592+
591593
// Step 1: Start the FSEvents watcher BEFORE the scan so we don't miss events
592594
let (event_tx, event_rx) = tokio::sync::mpsc::channel(WATCHER_CHANNEL_CAPACITY);
593595
let scan_start_event_id = watcher::current_event_id();

0 commit comments

Comments
 (0)