Skip to content

tesseract:: when OCR-ing multilingual books / pages, try languages in order of descending importance / occurrence #19

@MrBonkers

Description

@MrBonkers

This is about bulk processing pages which contain arbitrary text / language and we want tesseract to "auto-discover" the languages actually used on each page.


Sideways related to #15, but specific to using multiple language models.

1: check how tesseract does this exactly right now

2: these multi-lang ocr runs take a very long time, so the thought is similar to the (hacky) way white text on black background is currently handled: when the result-thus-far has a confidence below threshold 0.7, only then is the image inverted and OCR attempted again. Can we speed up "unknown languages" OCR runs using a similar threshold heuristic? --> when OCR results thus far have a confidence that's below threshold C, do try the next language in the set, in order from most to least important/frequent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions