Rationale

Swahili is an agglitunative language that builds words up from semantically meaningful morphemes. For example the word "kinachohitajika", meaning "the thing which is needed" or "what is required" can be analyzed like this:

"ki" is a subject prefix that refers to a Class 7 noun, typically "thing" or "object." This is third person, singular, that is "it" in English.
"na" is a temporal verb marker that expresses present tense.
"cho" is an object infix that adds specificity to the noun class in the relative clause.
"hitaj" is the verb stem. It means "need" or "require."
"ik" is a suffix that expresses reflexivity in the passive voice. It morphs the verb to mean "be needed" or "be required."
"a" is the Bantu verb ending vowel

Currently available Large Language Models don't tokenize text on such a semantic basis, so they are not aware of this grammatical structure. Therefore they cannot develop "understanding" for it. The model embeddings of the tokens won't reflect this semantic knowledge about Swahili. The word will be probably tokenized into a token sequence, for example, like "kinac", "ho", "hit", "aji", "ka". Using a semantic tokenizer for the Swahili language should allow us building a more proficient Swahili-speaking LLM model because the predicted token sequences will take the semantic meaning of the morphemes "ki", "na", etc. into account.

Another reason for the creating this tool was that common tokenizers such as Spacy or Snowball don't support Swahili. So this morphological tokenizer, that is compatible with Google Gemma tokenization, fills a gap.

Installation

poetry install

Status

Swahili is a morphologically rich language with complex rules. Creating a perfect Swahili morphological analyzer would be very difficult, if it's possible at all, and actually it's not even necessary to achieve our goals. Usingcoverage_report.py on the Swahili portion of Wikipedia, currently more than half of the words were analyzed successfully. Unsuccessful cases will be handled by a traditional BPE tokenization method. Coverage could be increased by adding even more noun and verb stems and adding even more grammar rules to handle less common use cases. Please study tests\sw_morphs.yaml to see what works now and feel free to extend it.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
tests		tests
.gitignore		.gitignore
README.md		README.md
coverage_report.py		coverage_report.py
failures.txt		failures.txt
find_missing_tokens.py		find_missing_tokens.py
morphemes.csv		morphemes.csv
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
success_rate_report.png		success_rate_report.png
successes.txt		successes.txt
swahili_tokenizer.py		swahili_tokenizer.py
token_analyzer.py		token_analyzer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rationale

Installation

Status

About

Uh oh!

Releases

Packages

Languages

PrisTech-Ltd/SwTokenizer

Folders and files

Latest commit

History

Repository files navigation

Rationale

Installation

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages