Indexing: auto-rescan on FSEvents channel overflow

vdavid · vdavid · commit ca7cece36ca4 · 2026-03-12T15:28:22.000+01:00
- Increase `cmdr-fsevent-stream` internal channel from 1,024 to 32,768 batches
- Add `Arc&lt;AtomicBool&gt;` overflow flag set on `try_send` failure (logs once, then silent)
- `DriveWatcher` exposes `overflow_flag()` for passing to async event loop tasks
- Both `run_live_event_loop` and `run_replay_event_loop` check the flag every 1s flush tick — on overflow, emit `WatcherChannelOverflow` rescan notification, drain channel, exit
- Add `WatcherChannelOverflow` variant to `RescanReason` enum and frontend toast map
diff --git a/apps/desktop/src-tauri/src/indexing/CLAUDE.md b/apps/desktop/src-tauri/src/indexing/CLAUDE.md
@@ -118,17 +118,17 @@ Key test files are alongside each module (test functions within `#[cfg(test)]` b
 
 **Subtree aggregation uses scoped queries**: `scoped_get_children_stats_by_id` and `scoped_get_child_dir_ids` in `aggregator.rs` use recursive CTEs scoped to the target subtree, not full-table scans. This keeps subtree aggregation O(subtree_size) regardless of total DB size.
 
-**Bounded buffers prevent OOM**: All buffers have capacity limits. Reconciler buffer: 500K events (overflow triggers full rescan). Writer channel: 20K messages (bounded `sync_channel`, backpressure). Replay `affected_paths`: 50K entries (overflow emits full refresh). Replay `pending_rescans`: 1K entries (overflow triggers full rescan). Replay event count: 1M events max (overflow falls back to full scan). Memory watchdog: warns at 8 GB, stops indexing at 16 GB. The index is a disposable cache, so dropping events and rescanning is always safe.
+**Bounded buffers prevent OOM**: All buffers have capacity limits. FSEvents channel: 32K batches (bounded `try_send` in cmdr-fsevent-stream; overflow sets atomic flag, triggers rescan). Reconciler buffer: 500K events (overflow triggers full rescan). Writer channel: 20K messages (bounded `sync_channel`, backpressure). Replay `affected_paths`: 50K entries (overflow emits full refresh). Replay `pending_rescans`: 1K entries (overflow triggers full rescan). Replay event count: 1M events max (overflow falls back to full scan). Memory watchdog: warns at 8 GB, stops indexing at 16 GB. The index is a disposable cache, so dropping events and rescanning is always safe.
 
 **Disposable cache pattern**: The index DB is a cache, not a source of truth. Any corruption or error triggers delete+rebuild. No user-facing errors for DB issues.
 
-**cmdr-fsevent-stream fork (macOS only)**: Our fork of `fsevent-stream` (v0.3.0) provides direct access to FSEvents event IDs, `sinceWhen` replay, and `MustScanSubDirs` flags. Only used on macOS. On Linux, the `notify` crate (inotify backend) provides recursive directory watching with `RecursiveMode::Recursive`.
+**cmdr-fsevent-stream fork (macOS only)**: Vendored in `crates/fsevent-stream/` (forked from `fsevent-stream` v0.3.0). Provides direct access to FSEvents event IDs, `sinceWhen` replay, and `MustScanSubDirs` flags. Only used on macOS. On Linux, the `notify` crate (inotify backend) provides recursive directory watching with `RecursiveMode::Recursive`.
 
 **Linux inotify watch limits**: Default `fs.inotify.max_user_watches` is ~8192. The `notify` crate's recursive mode adds one inotify watch per directory. Power users with large directory trees may hit this limit; the workaround is `sysctl fs.inotify.max_user_watches=524288`. The watcher gracefully handles watch errors without crashing.
 
 **APFS firmlinks**: Scan from `/` only, skip `/System/Volumes/Data`. Normalize all paths via firmlink prefix map so DB lookups work regardless of how the user navigated to a path.
 
-**Rescan notification system (`RescanReason` enum)**: Every code path that falls back to a full rescan emits an `index-rescan-notification` event with a `RescanReason` variant and human-readable details. The frontend maps each reason to a user-friendly toast message. Seven reasons: `StaleIndex` (pre-check gap), `JournalGap` (in-loop gap), `ReplayOverflow` (>1M events), `TooManySubdirRescans` (>1K MustScanSubDirs), `WatcherStartFailed`, `ReconcilerBufferOverflow` (>500K buffered events during scan), `IncompletePreviousScan` (has data but no `scan_completed_at`). The pre-check in `resume_or_scan()` catches stale indexes before starting the FSEvents stream, preventing the cmdr-fsevent-stream channel (1024 capacity, `try_send`) from being overwhelmed.
+**Rescan notification system (`RescanReason` enum)**: Every code path that falls back to a full rescan emits an `index-rescan-notification` event with a `RescanReason` variant and human-readable details. The frontend maps each reason to a user-friendly toast message. Eight reasons: `StaleIndex` (pre-check gap), `JournalGap` (in-loop gap), `ReplayOverflow` (>1M events), `TooManySubdirRescans` (>1K MustScanSubDirs), `WatcherStartFailed`, `ReconcilerBufferOverflow` (>500K buffered events during scan), `IncompletePreviousScan` (has data but no `scan_completed_at`), `WatcherChannelOverflow` (FSEvents channel full, events dropped). The pre-check in `resume_or_scan()` catches stale indexes before starting the FSEvents stream, preventing the cmdr-fsevent-stream channel (32K capacity, `try_send`) from being overwhelmed.
 
 ## Gotchas
 
diff --git a/apps/desktop/src-tauri/src/indexing/mod.rs b/apps/desktop/src-tauri/src/indexing/mod.rs
@@ -273,6 +273,8 @@ pub enum RescanReason {
     ReconcilerBufferOverflow,
     /// Previous scan didn't complete (app crashed or was force-quit).
     IncompletePreviousScan,
+    /// FSEvents channel overflowed — events were dropped.
+    WatcherChannelOverflow,
 }
 
 #[derive(Debug, Clone, Serialize, Deserialize)]
@@ -469,8 +471,10 @@ impl IndexManager {
         let (event_tx, event_rx) = tokio::sync::mpsc::channel(WATCHER_CHANNEL_CAPACITY);
         let current_id = watcher::current_event_id();
 
+        let watcher_overflow: Option<Arc<AtomicBool>>;
         match DriveWatcher::start(&self.volume_root, since_event_id, event_tx) {
             Ok(watcher) => {
+                watcher_overflow = Some(watcher.overflow_flag());
                 self.drive_watcher = Some(watcher);
                 log::debug!("DriveWatcher started for replay (sinceWhen={since_event_id}, current={current_id})");
             }
@@ -523,6 +527,7 @@ impl IndexManager {
                 },
                 fallback_tx,
                 micro_scans,
+                watcher_overflow,
             )
             .await;
 
@@ -597,12 +602,16 @@ impl IndexManager {
         let (event_tx, event_rx) = tokio::sync::mpsc::channel(WATCHER_CHANNEL_CAPACITY);
         let scan_start_event_id = watcher::current_event_id();
 
+        // watcher_overflow is None if the watcher failed to start (non-fatal).
+        let watcher_overflow: Option<Arc<AtomicBool>>;
         match DriveWatcher::start(&self.volume_root, 0, event_tx) {
             Ok(watcher) => {
+                watcher_overflow = Some(watcher.overflow_flag());
                 self.drive_watcher = Some(watcher);
                 log::debug!("DriveWatcher started (scan_start_event_id={scan_start_event_id})");
             }
             Err(e) => {
+                watcher_overflow = None;
                 // Watcher failure is non-fatal: scan works without it, just no live updates
                 log::warn!("Failed to start DriveWatcher (scan will proceed without watcher): {e}");
             }
@@ -665,6 +674,7 @@ impl IndexManager {
         let micro_scans = self.micro_scans.clone();
         let scanning = Arc::clone(&self.scanning);
         let live_event_task_slot = Arc::clone(&self.live_event_task);
+        let watcher_overflow_flag = watcher_overflow;
         tauri::async_runtime::spawn(async move {
             // Wait for scan to complete
             let join_result = tokio::task::spawn_blocking(move || join_handle.join()).await;
@@ -716,6 +726,22 @@ impl IndexManager {
                         );
                     }
 
+                    // Check if the FSEvents channel overflowed (events dropped
+                    // before reaching the forward task). If so, our buffered events
+                    // are incomplete — the reconciler replay will miss changes.
+                    // We still proceed (the scan data itself is fine), but log a
+                    // warning. The live event loop will detect the overflow flag
+                    // and trigger a rescan at that point, since a fresh scan is
+                    // the only way to recover from dropped events.
+                    if let Some(ref flag) = watcher_overflow_flag {
+                        if flag.load(Ordering::Relaxed) {
+                            log::info!(
+                                "FSEvents channel overflowed during scan — some watcher \
+                                 events were dropped. Live event loop will trigger a rescan."
+                            );
+                        }
+                    }
+
                     // Emit scan-complete first, then start the flushing phase.
                     // Order matters: the frontend's scan-complete handler calls
                     // resetAggregation(), so the saving_entries event must come
@@ -794,8 +820,13 @@ impl IndexManager {
                     // Step 5: Start live event processing loop
                     let writer_live = writer.clone();
                     let app_live = app.clone();
+                    let volume_id_live = volume_id.clone();
+                    let overflow_live = watcher_overflow_flag.clone();
                     let handle = tauri::async_runtime::spawn(async move {
-                        run_live_event_loop(event_rx, reconciler, writer_live, app_live).await;
+                        run_live_event_loop(
+                            event_rx, reconciler, writer_live, app_live,
+                            volume_id_live, overflow_live,
+                        ).await;
                     });
 
                     // Store the handle so shutdown() can wait for it to drain
@@ -1118,6 +1149,8 @@ async fn run_live_event_loop(
     mut reconciler: EventReconciler,
     writer: IndexWriter,
     app: AppHandle,
+    volume_id: String,
+    watcher_overflow: Option<Arc<AtomicBool>>,
 ) {
     log::debug!("Live event processing started");
 
@@ -1177,6 +1210,28 @@ async fn run_live_event_loop(
                 }
             }
             _ = flush_interval.tick() => {
+                // Check if the FSEvents channel overflowed — events were dropped
+                // between FSEvents and our forward task. The only safe recovery is
+                // a full rescan.
+                if let Some(ref flag) = watcher_overflow {
+                    if flag.load(Ordering::Relaxed) {
+                        emit_rescan_notification(
+                            &app,
+                            &volume_id,
+                            RescanReason::WatcherChannelOverflow,
+                            format!(
+                                "The filesystem watcher's event channel overflowed after \
+                                 {event_count} live events. Some file changes were lost."
+                            ),
+                        );
+                        // Drain and discard remaining events — they're a partial
+                        // picture and processing them before a rescan is pointless.
+                        event_rx.close();
+                        while event_rx.recv().await.is_some() {}
+                        break;
+                    }
+                }
+
                 process_live_batch(
                     &mut pending_events, &mut reconciler, &conn,
                     &writer, &mut pending_paths,
@@ -1272,6 +1327,7 @@ async fn run_replay_event_loop(
     config: ReplayConfig,
     fallback_tx: tokio::sync::oneshot::Sender<()>,
     micro_scans: MicroScanManager,
+    watcher_overflow: Option<Arc<AtomicBool>>,
 ) -> Result<(), String> {
     let ReplayConfig {
         volume_id,
@@ -1584,6 +1640,28 @@ async fn run_replay_event_loop(
                 }
             }
             _ = flush_interval.tick() => {
+                // Check if the FSEvents channel overflowed
+                if let Some(ref flag) = watcher_overflow {
+                    if flag.load(Ordering::Relaxed) {
+                        emit_rescan_notification(
+                            &app,
+                            &volume_id,
+                            RescanReason::WatcherChannelOverflow,
+                            format!(
+                                "The filesystem watcher's event channel overflowed after \
+                                 {event_count} replay + {live_count} live events. Some file \
+                                 changes were lost."
+                            ),
+                        );
+                        if let Some(tx) = fallback_tx.take() {
+                            let _ = tx.send(());
+                        }
+                        event_rx.close();
+                        while event_rx.recv().await.is_some() {}
+                        return Ok(());
+                    }
+                }
+
                 process_live_batch(
                     &mut live_pending_events, &mut reconciler, &conn,
                     &writer, &mut live_pending_paths,
diff --git a/apps/desktop/src-tauri/src/indexing/watcher.rs b/apps/desktop/src-tauri/src/indexing/watcher.rs
@@ -12,14 +12,13 @@
 //! `DriveWatcher::start` returns `WatcherError::StreamCreate` and
 //! `current_event_id` returns `0`.
 
+use std::sync::Arc;
+use std::sync::atomic::{AtomicBool, Ordering};
+
 #[cfg(target_os = "macos")]
 use std::path::Path;
-#[cfg(any(target_os = "macos", target_os = "linux"))]
-use std::sync::Arc;
 #[cfg(target_os = "macos")]
 use std::sync::atomic::AtomicU64;
-#[cfg(any(target_os = "macos", target_os = "linux"))]
-use std::sync::atomic::{AtomicBool, Ordering};
 #[cfg(target_os = "macos")]
 use std::time::Duration;
 
@@ -109,6 +108,8 @@ pub struct DriveWatcher {
     running: Arc<AtomicBool>,
     /// Last processed event ID (atomically updated as events arrive).
     last_event_id: Arc<AtomicU64>,
+    /// Set to `true` when the FSEvents channel overflows and events are dropped.
+    overflow: Arc<AtomicBool>,
     /// Handle to abort the FSEvents run loop thread.
     handler: Option<EventStreamHandler>,
     /// Task that reads the event stream and forwards events.
@@ -149,6 +150,8 @@ impl DriveWatcher {
         )
         .map_err(WatcherError::Io)?;
 
+        let overflow = event_stream.overflow_flag();
+
         log::debug!("DriveWatcher started on {} (sinceWhen={since_when})", root.display());
 
         // Spawn a task to read the async event stream and forward events.
@@ -178,6 +181,7 @@ impl DriveWatcher {
         Ok(Self {
             running,
             last_event_id,
+            overflow,
             handler: Some(handler),
             forward_task: Some(forward_task),
         })
@@ -212,6 +216,11 @@ impl DriveWatcher {
     pub fn is_running(&self) -> bool {
         self.running.load(Ordering::Relaxed)
     }
+
+    /// Returns a shared handle to the overflow flag for passing to async tasks.
+    pub fn overflow_flag(&self) -> Arc<AtomicBool> {
+        Arc::clone(&self.overflow)
+    }
 }
 
 #[cfg(target_os = "macos")]
@@ -327,6 +336,13 @@ impl DriveWatcher {
     pub fn is_running(&self) -> bool {
         self.running.load(Ordering::Relaxed)
     }
+
+    /// Returns a shared handle to the overflow flag for passing to async tasks.
+    /// Linux never overflows (backpressure via `blocking_send`), but the API is
+    /// cross-platform.
+    pub fn overflow_flag(&self) -> Arc<AtomicBool> {
+        Arc::new(AtomicBool::new(false))
+    }
 }
 
 #[cfg(target_os = "linux")]
@@ -434,6 +450,10 @@ impl DriveWatcher {
     pub fn is_running(&self) -> bool {
         false
     }
+
+    pub fn overflow_flag(&self) -> Arc<AtomicBool> {
+        Arc::new(AtomicBool::new(false))
+    }
 }
 
 // ── Helpers ──────────────────────────────────────────────────────────
diff --git a/apps/desktop/src/lib/indexing/index-state.svelte.ts b/apps/desktop/src/lib/indexing/index-state.svelte.ts
@@ -79,6 +79,8 @@ const rescanReasonToMessage: Record<string, string> = {
         'Heavy filesystem activity overwhelmed the event buffer. Running a fresh scan to stay accurate.',
     incomplete_previous_scan:
         "The previous scan didn't finish (the app may have been closed). Restarting the scan from scratch.",
+    watcher_channel_overflow:
+        'A burst of filesystem activity overflowed the watcher channel. Running a fresh scan to stay accurate.',
 }
 
 // Event listener cleanup handles

Original file line number	Diff line number	Diff line change
`@@ -79,6 +79,8 @@ const rescanReasonToMessage: Record<string, string> = {`
`79`	`79`	`'Heavy filesystem activity overwhelmed the event buffer. Running a fresh scan to stay accurate.',`
`80`	`80`	`incomplete_previous_scan:`
`81`	`81`	`"The previous scan didn't finish (the app may have been closed). Restarting the scan from scratch.",`
	`82`	`+ watcher_channel_overflow:`
	`83`	`+ 'A burst of filesystem activity overflowed the watcher channel. Running a fresh scan to stay accurate.',`
`82`	`84`	`}`
`83`	`85`
`84`	`86`	`// Event listener cleanup handles`