Document clustering algorithms

The documentation on the clustering algorithms is scant, and in some cases, non-existent. The user interface should provide a brief description of each algorithm as well as guidance on how to pick the most appropriate one for their application as well as any pertinent tradeoffs.

Some are obvious from their names (or at least easily Google-able), but other are much more opaque.

PPM is a particularly egregious example, but since I just looked at the implementation, here's what it does:

PPM stands for [Prediction by Partial Matching](https://en.wikipedia.org/wiki/Prediction_by_partial_matching) which is an arithmetic coding technique. The implementation that OpenRefine uses is based on the MIT SIMILE project's Vicino package, which, in turn, uses Bob Carpenter's arithcode package, which has a tutorial available here: https://raw.githubusercontent.com/bob-carpenter/java-arithcode/master/arithcode-1.2/doc/tutorial.html. The two strings to be compared are concatenated together and then "compressed" using the coder. The PPMModel uses 8 bytes of context and returns as the score the length of the encoded output. This is used four times to compute distance scores for (x,y), (y,x), (x,x), (y,y) (because the scores are directional) which are then combined together in this fashion to generate a final score: 

`10.0d * ((d(x,y) + d(y,x)) / (d(x,x) + d(y,y)) - 1.0d)`

This score is relatively computationally expensive to compute and has to be done in a pairwise fashion for every pair of strings. To prevent an N^2 explosion, both the "nearest neighbor" clustering algorithms first divide the strings into manageable sized blocks and only does the pairwise comparison for members of the block. Depending on the blocking algorithm used, it's possible for two similar strings to end up in different blocks and never get compared to each other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document clustering algorithms #3033

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Document clustering algorithms #3033

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions