Korean Text Data Generator for OCR tasks. Heavily depends on TRDG by Belval.
You should install trdg and Pillow package with pip, if not installed yet. You might want to use venv or conda.
pip install trdg Pillow
All resources including dictionaries, fonts, and background images are located in respective folders at resources/.
You may add more resources in the corresponding folders.
Run builds/ksx*.sh to build preconfigured datasets.
You can run ./run.py to run kotdg. ./run.py --help will show (shockingly long) list of options.
Behaviors of the options are generally same as the original TRDG
-cor--countis the only required option. It specifies the number of the images to be generated.-lor--languagespecifies the languages to be used. You can refer to the ISO standard- FYI: Korean is
ko, and English isen.
- FYI: Korean is
-oor--output_dir <dir>specifies the location where the generated images and labels will be saved.- Default to
out
- Default to
- Specifying text source
--dict <name>will load dictionary named<name>fromresources/dicts/- If none of the source-specifying options are set, this option (with
<name>aswords.txt) is used by default. - Each line of the file is separated, and at most 200 characters (from the beginning) are used.
- Lines are randomly picked.
- e.g.
./run.py --dict anthem.txt -c 5
- If none of the source-specifying options are set, this option (with
--input_file <path>will load a file from the given path.- Note that
<path>is not related toresources/, unlike other cases. - Lines are sequentially used.
- e.g.
./run.py --input_file resources/dicts/ksx1001.txt -c 5
- Note that
--randomwill load randomly picked letters in the corresponding pool- 한글은 유니코드 0xAC00부터 0xD7A3 영역에서 임의 추출합니다. 이는 가능한 모든 조합형 글자의 영역입니다.
--include_**options will configure what letters can be included in the poolkoandcneach have their own pool, and other latin languages will use ASCII pool
--wikipediawill load random words from random Wikipedia,page, corresponding to the given language code.
- Text options
--font <name>specifies which font to be used- Font will be loaded from
resources/fonts - The default value (even if this options is not set) is
'NanumGothic.ttf'. - Note: all fonts should be
.ttfor.otfformat.- TRDG uses
PIL.Imagefont.truetypefunction. Check corresponding documents for compatibility.
- TRDG uses
- e.g.
./run.py -c 5 --font "Maplestory Bold.ttf"
- Font will be loaded from
--font_dir <dir>specifies the directory where fonts are located- All fonts in the directory will be tried to be used.
- Note that, of course, the total number of generated images is equal to the number passed to the
--count. - e.g.
./run.py -c 5 --font_dir resources/fonts
- Image options
--formatspecifies the 'height' of each image- If the text orientation is vertical, this specifies the width.
- Default to 32.
--widthspecifies the width of each image.- If not set, the width will be 10 +
<width of text> - Not yet tested with vertical texts
- If not set, the width will be 10 +
-b <mode>or--background <mode>specify which backgrounds will be used.- options for
<mode>: 0 (gaussian noise), 1 (plain white), 2 (quasi-crystal), 3 (custom image) - Note that
<mode>=3needs--image_dirto locate the custom background images
- options for
--image_dirspecifies the location of background images.- Only used when
--background 3. - Ideally, this should be
resources/images. Planning to make that as a default. - e.g.
./run.py --background 3 --image_dir resources/images -c 5
- Only used when
--text_color "color_code"specifies the color of the text.- Color code should be in hex type (e.g.
#00FFFF), and it must be enclosed by double quotes. ("") - e.g.
./run.py -c 5 --text_color "#00FFFF"
- Color code should be in hex type (e.g.
- Output options
--name_formatspecifies the format of the generated file names.<text>: text written on the image,<idx>: integer index of the image,<ext>: extension of the image- 0:
<text>_<idx>.<ext> - 1:
<idx>_<text>.<ext> - 2:
<idx>.<ext>, andlabels.txtcontaining<idx>-<text>mapping
Import KoreanTextGenerator with from kotdg.generator import KoreanTextGenerator.
Refer to demo.py for general usage.
resources/- fonts, dictionaries, and background images.
- font file should be
.ttffile
- font file should be
- fonts, dictionaries, and background images.
kotdg/- core files.
KoreanTextGeneatoringenerator.pywraps TRDG generators- use with same parameters as TRDG generators, but with extra meta-parameter
source sourceshould have a value amongsource_options = ("string", "random", "dict", "wiki", "file")
- use with same parameters as TRDG generators, but with extra meta-parameter
demo.py- testing scripts for development
- refer to this when you want to call
kotdgwith code
out/- default image saving directory
- ignored by git by default
Based on TextRecognitionDataGenerator
Some files are from:
딥러닝을 활용한 한글문서 OCR 연구,
Jeongeun So
Note that I do not have rights of most resources in this repository.