tesseract:: when OCR-ing multilingual books / pages, try languages in order of descending importance / occurrence

This is about bulk processing pages which contain arbitrary text / language and we want tesseract to "auto-discover" the languages actually used on each page. 

---

Sideways related to #15, but specific to using multiple language models.

1: check how tesseract does this *exactly* right now

2: these multi-lang ocr runs take a very long time, so the thought is similar to the (hacky) way white text on black background is currently handled: when the result-thus-far has a confidence below threshold 0.7, only then is the image inverted and OCR attempted again. Can we speed up "unknown languages" OCR runs using a similar threshold heuristic? --> when OCR results thus far have a confidence that's below threshold C, do try the next language in the set, in order from most to least important/frequent.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tesseract:: when OCR-ing multilingual books / pages, try languages in order of descending importance / occurrence #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

tesseract:: when OCR-ing multilingual books / pages, try languages in order of descending importance / occurrence #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions