Semantic clustering tool for finding patterns in AI-generated code review feedback.
- Embeds text using sentence-transformers (all-MiniLM-L6-v2)
- Computes pairwise cosine similarity
- Clusters using three methods: graph-based, agglomerative, HDBSCAN
- Visualizes with t-SNE projections
- Includes threshold sweep for parameter tuning
Built to identify duplicate/similar issues in large sets of AI-generated code review comments. Reduces noise, surfaces patterns.
python main.py "data/tasks.*.md" --cluster-threshold 0.75 --sweep --dumpPython, NumPy, scikit-learn, HDBSCAN, NetworkX, sentence-transformers, matplotlib