Appendix D. Tokenizers

1. Introduction
2. Languages selection

1. Introduction

Tokenizers (or stemmers) improve the quality of matches by recognizing inflected words in source and translation memory data. They also improve glossary matching.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish". This is especially useful in case of languages that use pre- and postfix forms for the stem words. Borrowing an example from Slovenian, here "good" in all possible grammatically correct forms:

  • lep, lepa, lepo - singular, masculine, feminine, neutral

  • lepši, lepša, lepše . - comparative, nominative, masculine, feminine, neutral, resp. Plural form of the adjective

  • najlepših - superlative, plural, genitive for M,F,N

2. Languages selection

Tokenizers are included in OmegaT and active by default. OmegaT automatically selects a tokenizer for the source and the target language according to the language settings of the project. It is possible to select another tokenizer (Language Tokenizer) or a different version of the tokenizer (Tokenizer Behavior) from the Project Properties window.

In case no tokenizer exists for the current languages, OmegaT uses Hunspell instead (in that case, make sure that relevant Hunspell dictionaries are installed).

Incompatibilities

OmegaT will not launch if tokenizers are found in the /plugin folder. Remove all the tokenizers from the /plugin folder before starting OmegaT.