Skip to content

Optimize top n selection with min heaps in scan#38

Merged
AzisK merged 3 commits intomainfrom
Optimize-top-N-selection-with-min-heaps-in-scan
Jan 17, 2026
Merged

Optimize top n selection with min heaps in scan#38
AzisK merged 3 commits intomainfrom
Optimize-top-N-selection-with-min-heaps-in-scan

Conversation

@AzisK
Copy link
Owner

@AzisK AzisK commented Jan 17, 2026

Performance

  • Streaming Top-N: Replaced post-scan heapq.nlargest with in-scan min-heap filtering. Memory usage is now O(categories × top_n) instead of O(files_over_min_size), and large file lists are no longer built.

Code Quality

  • Removed unused get_top_n_per_category function (top-N logic now integrated into scan)
  • Added clarifying comment for DEEPEST_SKIP_LEVEL optimization
  • Added tests for push_top_n heap helper and top-N integration behavior

AzisK added 2 commits January 17, 2026 23:34
Replaces the get_top_n_per_category function with an in-place min-heap approach (push_top_n) for efficient top-N selection per category, reducing memory usage and improving performance. Updates scan_files_and_dirs to use min-heaps, refactors related tests, and removes now-unnecessary code. Also fixes a typo in a user message and clarifies SKIP_DIRS usage in config.
Updated to version 0.4.2. Improved streaming Top-N performance by integrating min-heap filtering during scan, reducing memory usage. Removed unused function, added clarifying comment, and introduced new tests for heap helper and integration.
@github-actions
Copy link

⸜(。˃ ᵕ ˂ )⸝♡ Thank you for opening this Pull Request, AzisK!

( ˶°ㅁ°) !! It's Trivia Time!

Here are 3 trivia questions to keep you entertained while CI runs.
(Feel free to demonstrate your knowledge and reply!)

🧩 Q1: What is the name of the villian in the 2015 Russian-American Sci-Fi Movie "Hardcore Henry"?

A) Akan
B) Henry
C) Jimmy
D) Estelle

🧩 Q2: Who had a 1981 hit with the song "Japanese Boy"?

A) Aneka
B) Madonna
C) Sandra
D) Toyah

🧩 Q3: What three movies, in order from release date, make up the "Dollars Trilogy"?

A) "The Good, the Bad, and the Ugly", "For A Few Dollars More", "A Fistful of Dollars"
B) "A Fistful of Dollars", "For a Few Dollars More", "The Good, the Bad, and the Ugly"
C) "For a Few Dollars More", "The Good, the Bad, and the Ugly", "A Fistful of Dollars"
D) "For a Few Dollars More", "A Fistful of Dollars", "The Good, the Bad, and the Ugly"

You got this! Remember, every bug is just a feature in disguise.

@github-actions
Copy link

This is an excellent pull request! The changes provide clear improvements in both performance and code quality, and your detailed explanation in the PR description makes it easy to understand the purpose and methodology behind your modifications.


Pros and Strengths Identified

  1. Significant Performance Improvement:

    • Using a min-heap for in-scan filtering of the top-N largest files/directory items is a well-considered optimization.
    • The shift from (post-scan) to (in-scan) reduces memory overhead and circumvents the need to store large intermediary lists. This improvement is especially important for situations involving large datasets.
  2. Logical Refactoring:

    • Removing the now-unused further simplifies the codebase (which improves overall maintainability). The logic has been successfully integrated into the scanning process itself through min-heaps, avoiding an extra post-processing step.
  3. Good Unit Testing:

    • Comprehensive unit tests have been added for the new function, including edge cases (e.g., empty or size-one heaps, suboptimal input). Tests also verify the correctness of behavior across different categories and edge scenarios. Great work on covering these scenarios!
    • The descriptions of each test function are clear, adding to the readability and maintainability of the test cases.
  4. Code Readability:

    • The code is structured well, and the new helper function is concise, purposeful, and well-documented.
    • The code changes are consistent in style and formatting with the existing codebase.
    • Emphasis on performance optimizations is complemented by a clarifying comment for the configuration, which improves transparency for other developers reviewing the code in the future.
  5. Backward Compatibility:

    • Consideration was clearly given to maintaining backward compatibility, as the modified methods default to the previous behavior when new arguments (e.g., ) aren't used explicitly.

Constructive Feedback & Suggestions

  1. **Error Handling for **:

    • Potential Issue at line 8:

      Here, if is a tuple and contains non-tuple elements (though unlikely in this PR), the comparison might break, throwing an exception. You might want to explicitly validate the type of inputs or ensure 's data consistency with additional assertions in .

      Suggestion:
      Add a check at the beginning of :

  2. Test Coverage:

    • While your tests for the min-heap helper are excellent, consider adding some edge cases for , such as:
      • A test case where exceeds the number of files in a category (e.g., top_n=10 but category only contains 2 files).
      • Testing with a large number of small files to validate memory usage with the min-heap optimization.
      • Explicitly testing behavior when (if supported; otherwise, document that it is unsupported).
  3. Code Documentation:

    • In (line 120 in ), adding a brief comment to clarify the filter and usage would enhance code clarity for future contributors. For example:
  4. Docstrings:

    • The function's docstring is good, but it could be expanded further to detail the structure of the returned dictionaries ( and ) after the heaps are converted to sorted lists. For instance:
  5. Configuration Defaults:

    • The top-N value now defaults to in several places. Consider documenting the default value and offering configuration examples in the main README or a dedicated documentation section. This would be valuable for users who might want to tweak the new parameter.

Overall Feedback

This pull request is outstanding! You've addressed performance bottlenecks effectively by implementing a widely accepted and efficient algorithmic choice (min-heap). The refactoring significantly reduces memory overhead, and the removal of redundant code increases maintainability. Your thorough tests are particularly praiseworthy — this shows your attention to detail and commitment to code quality. 🎉

With just minor improvements (some additional comments, edge case test coverage, and input type validation), this PR is practically perfect. Fantastic work — keep up the great engineering! 🚀

@github-actions
Copy link

🎉 All tests passed! Here's a dog for you! 🐶

Dog

Expanded documentation to describe the DEEPEST_SKIP_LEVEL optimization and the new streaming Top-N approach using min-heaps. Updated function signatures and added details for new and modified functions in zpace/core.py to reflect recent performance improvements.
@github-actions
Copy link

🎉 All tests passed! Here's a dog for you! 🐶

Dog

@AzisK AzisK merged commit ba2bd6f into main Jan 17, 2026
32 checks passed
@AzisK AzisK deleted the Optimize-top-N-selection-with-min-heaps-in-scan branch February 6, 2026 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant