Optimize top n selection with min heaps in scan#38
Conversation
Replaces the get_top_n_per_category function with an in-place min-heap approach (push_top_n) for efficient top-N selection per category, reducing memory usage and improving performance. Updates scan_files_and_dirs to use min-heaps, refactors related tests, and removes now-unnecessary code. Also fixes a typo in a user message and clarifies SKIP_DIRS usage in config.
Updated to version 0.4.2. Improved streaming Top-N performance by integrating min-heap filtering during scan, reducing memory usage. Removed unused function, added clarifying comment, and introduced new tests for heap helper and integration.
⸜(。˃ ᵕ ˂ )⸝♡ Thank you for opening this Pull Request, AzisK!( ˶°ㅁ°) !! It's Trivia Time!Here are 3 trivia questions to keep you entertained while CI runs. 🧩 Q1: What is the name of the villian in the 2015 Russian-American Sci-Fi Movie "Hardcore Henry"?A) Akan 🧩 Q2: Who had a 1981 hit with the song "Japanese Boy"?A) Aneka 🧩 Q3: What three movies, in order from release date, make up the "Dollars Trilogy"?A) "The Good, the Bad, and the Ugly", "For A Few Dollars More", "A Fistful of Dollars"
|
|
This is an excellent pull request! The changes provide clear improvements in both performance and code quality, and your detailed explanation in the PR description makes it easy to understand the purpose and methodology behind your modifications. Pros and Strengths Identified
Constructive Feedback & Suggestions
Overall FeedbackThis pull request is outstanding! You've addressed performance bottlenecks effectively by implementing a widely accepted and efficient algorithmic choice (min-heap). The refactoring significantly reduces memory overhead, and the removal of redundant code increases maintainability. Your thorough tests are particularly praiseworthy — this shows your attention to detail and commitment to code quality. 🎉 With just minor improvements (some additional comments, edge case test coverage, and input type validation), this PR is practically perfect. Fantastic work — keep up the great engineering! 🚀 |
Expanded documentation to describe the DEEPEST_SKIP_LEVEL optimization and the new streaming Top-N approach using min-heaps. Updated function signatures and added details for new and modified functions in zpace/core.py to reflect recent performance improvements.


Performance
heapq.nlargestwith in-scan min-heap filtering. Memory usage is nowO(categories × top_n)instead ofO(files_over_min_size), and large file lists are no longer built.Code Quality
get_top_n_per_categoryfunction (top-N logic now integrated into scan)DEEPEST_SKIP_LEVELoptimizationpush_top_nheap helper and top-N integration behavior