A high-performance Python toolkit for large-scale fuzzy string matching using MinHash LSH (Locality-Sensitive Hashing) and multiple similarity metrics.
StringMatcher is designed to efficiently match similar strings across large datasets by combining:
- MinHash LSH for fast candidate pair generation with O(n) time complexity
- RapidFuzz for computing multiple fuzzy matching scores on candidate pairs
- Q-gram tokenization for robust string comparison
This two-stage approach dramatically reduces computation time compared to naive pairwise comparison, making it ideal for matching names, addresses, or any text data across millions of records.
- β‘ Scalable: Handles millions of string comparisons efficiently using LSH indexing
- π Multiple Metrics: Computes 6 different similarity scores (ratio, partial ratio, token sort/set ratios)
- π― Configurable: Adjustable LSH threshold, number of permutations, and Q-gram size
- π Two-Stage Pipeline: Fast candidate generation followed by precise similarity scoring
- π Progress Tracking: Built-in progress bars with tqdm
- π§Ή Text Normalization: Automatic case normalization and special character handling
pip install pandas dask[dataframe] datasketch rapidfuzz tqdmpandas: Data manipulation and analysisdask: Large-scale data processingdatasketch: MinHash and LSH implementationrapidfuzz: Fast fuzzy string matchingtqdm: Progress bars
from string_matcher import StringMatcher
# Initialize the matcher
matcher = StringMatcher(
num_perm=128, # Number of hash functions (higher = more accurate)
lsh_threshold=0.6, # LSH similarity threshold (0.0-1.0)
qgram_size=3 # Q-gram size for tokenization
)
# Two arrays of strings to match
array_A = ["John Smith", "Jane Doe", "Robert Johnson"]
array_B = ["Jon Smith", "Jane D.", "Bob Johnson", "Alice Williams"]
# Perform matching
results = matcher.match(
array_A,
array_B,
name_A_col='name_A',
name_B_col='name_B'
)
# View results sorted by similarity
print(results.head())The match() method returns a pandas DataFrame with the following columns:
| Column | Description |
|---|---|
name_A |
String from array A |
name_B |
Matched string from array B |
partial_ratio |
Partial string matching score (0-100) |
ratio |
Basic Levenshtein distance ratio (0-100) |
token_sort_ratio |
Token-sorted comparison (0-100) |
token_set_ratio |
Token-set comparison (0-100) |
partial_token_sort_ratio |
Partial token-sorted score (0-100) |
partial_token_set_ratio |
Partial token-set score (0-100) |
- Convert strings to Q-grams (character n-grams)
- Generate MinHash signatures for each string
- Build LSH index with array B
- Query the index with array A to find candidate pairs
Time Complexity: O(n + m) instead of O(n Γ m) for naive comparison
- Compute multiple similarity metrics for each candidate pair
- Return sorted results with all similarity scores
Main class for string matching operations.
Initialize the string matcher.
Parameters:
num_perm(int): Number of MinHash hash functions. Higher values increase accuracy but reduce speed. Default: 128lsh_threshold(float): LSH similarity threshold (0.0-1.0). Higher values increase precision but may reduce recall. Default: 0.6qgram_size(int): Q-gram length for tokenization. Default: 3
Execute the full string matching pipeline.
Parameters:
arr_A(List[str]): First string arrayarr_B(List[str]): Second string arrayname_A_col(str): Column name for array A in output DataFramename_B_col(str): Column name for array B in output DataFrame
Returns:
pd.DataFrame: DataFrame with matched pairs and similarity scores
Find candidate similar pairs using MinHash LSH.
Parameters:
arr_A(List[str]): First string arrayarr_B(List[str]): Second string array
Returns:
Set[Tuple[str, str]]: Set of candidate pairs
Compute fuzzy matching scores for candidate pairs.
Parameters:
candidates(Set[Tuple[str, str]]): Set of candidate pairsname_A_col(str): Column name for array Aname_B_col(str): Column name for array B
Returns:
pd.DataFrame: DataFrame with similarity metrics
- 32-64: Fast but less accurate, suitable for very large datasets
- 128 (default): Good balance of speed and accuracy
- 256+: High accuracy for critical applications
- 0.3-0.5: High recall, catches more potential matches (more candidates)
- 0.6 (default): Balanced precision and recall
- 0.7-0.9: High precision, fewer false positives (fewer candidates)
- 2: Better for very short strings or when high sensitivity is needed
- 3 (default): Standard choice, works well for most text
- 4-5: Better for longer strings with less noise
- Adjust LSH threshold: Lower threshold = more candidates = higher recall but slower
- Filter results: Use a score threshold on the output DataFrame (e.g.,
partial_ratio >= 85) - Deduplicate input: Remove duplicates from input arrays to reduce comparisons
- Batch processing: For extremely large datasets, process in chunks and concatenate results
# Get only high-confidence matches
results = matcher.match(array_A, array_B)
high_confidence = results[results['partial_ratio'] >= 85]
# Get best match for each string in array A
best_matches = results.sort_values('partial_ratio', ascending=False).groupby('name_A').first()- Entity Resolution: Match company names, person names, or addresses across datasets
- Data Deduplication: Find and merge duplicate records
- Record Linkage: Connect records from different data sources
- Name Matching: Match variations of names (nicknames, misspellings, different formats)
- Data Integration: Align entities across multiple databases
name_A name_B partial_ratio ratio token_sort_ratio ...
0 John Smith Jon Smith 95.0 90.0 90.0 ...
1 Robert Johnson Bob Johnson 85.0 75.0 75.0 ...
2 Jane Doe Jane D. 88.0 70.0 70.0 ...
This project is available for use under your chosen license.
Contributions are welcome! Please feel free to submit issues or pull requests.
If you use this toolkit in your research, please cite:
@software{stringmatcher,
author = {Xiangyu Chen},
title = {StringMatcher: Large-Scale Fuzzy String Matching with MinHash LSH},
year = {2025},
url = {https://github.com/yourusername/StringMatcher}
}For questions or feedback, please open an issue on GitHub.