Skip to content

Document clustering algorithms #3033

@tfmorris

Description

@tfmorris

The documentation on the clustering algorithms is scant, and in some cases, non-existent. The user interface should provide a brief description of each algorithm as well as guidance on how to pick the most appropriate one for their application as well as any pertinent tradeoffs.

Some are obvious from their names (or at least easily Google-able), but other are much more opaque.

PPM is a particularly egregious example, but since I just looked at the implementation, here's what it does:

PPM stands for Prediction by Partial Matching which is an arithmetic coding technique. The implementation that OpenRefine uses is based on the MIT SIMILE project's Vicino package, which, in turn, uses Bob Carpenter's arithcode package, which has a tutorial available here: https://raw.githubusercontent.com/bob-carpenter/java-arithcode/master/arithcode-1.2/doc/tutorial.html. The two strings to be compared are concatenated together and then "compressed" using the coder. The PPMModel uses 8 bytes of context and returns as the score the length of the encoded output. This is used four times to compute distance scores for (x,y), (y,x), (x,x), (y,y) (because the scores are directional) which are then combined together in this fashion to generate a final score:

10.0d * ((d(x,y) + d(y,x)) / (d(x,x) + d(y,y)) - 1.0d)

This score is relatively computationally expensive to compute and has to be done in a pairwise fashion for every pair of strings. To prevent an N^2 explosion, both the "nearest neighbor" clustering algorithms first divide the strings into manageable sized blocks and only does the pairwise comparison for members of the block. Depending on the blocking algorithm used, it's possible for two similar strings to end up in different blocks and never get compared to each other.

Metadata

Metadata

Assignees

Labels

Type: BugIssues related to software defects or unexpected behavior, which require resolution.Type: DocumentationIssues related to improving project documentation or tutorials.clusteringIssues related to the clustering operation, to merge similar values in a text column

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions