Chapter 14. Source segmentation

1. Segmentation rules
2. Rule priority
3. Creating a new rule
4. A few simple examples

Translation memory tools work with textual units called segments. OmegaT has two ways to segment a text: by paragraph or by sentence segmentation (also referred to as “rule-based segmentation”). In order to select the type of segmentation, select ProjectProperties... from the main menu and tick or untick the check box provided. Paragraph segmentation is advantageous in certain cases, such as highly creative or stylistic translations in which the translator may wish to change the order of entire sentences; for the majority of projects, however, sentence segmentation is a choice to be preferred, since it delivers better matches from previous translations. If sentence segmentation has been selected, you can setup the rules by selecting OptionsSegmentation...from the main menu.

Dependable segmentation rules are already available for many languages, so it is likely that you will not need to get involved with writing your own segmentation rules. On the other hand this functionality can be very useful in special cases, where you can increase your productivity by tuning the segmentation rules to the text to be translated.

Warning: because the text will segment differently after filter options have been changed, so you may have to start translating from scratch. At the same time the previous valid segments in the project translation memory will turn into orphan segments. If you change segmentation options when a project is open, you must reload the project in order for the changes to take effect.

OmegaT uses the following sequence of steps:

Structure level segmentation

OmegaT first parses the text for structure-level segmentation. During this process it is only the structure of the source file that is used to produce segments.

For example, text files may be segmented on line breaks, empty lines, or not be segmented at all. Files containing formatting (ODF documents, HTML documents, etc.) are segmented on the block-level (paragraph) tags. Translatable object attributes in XHTML or HTML files can be extracted as separate segments.

Sentence level segmentation

After segmenting the source file into structural units, OmegaT will segment these blocks further into sentences.

1. Segmentation rules

The process of segmenting can be pictured as follows: the cursor moves along the text, one character at a time. At each cursor position rules, consisting of a Before and After pattern, are applied in their given order to see if any of the Before patterns are valid for the text on the left and the corresponding After pattern for the text on the right of the cursor. If the rule matches, either the cursor moves on without inserting a segment break (for an exception rule) or a new segment break is created at the current cursor position (for the break rule).

The two types of rules behave as follows:

Break rule

Separates the source text into segments. For example, "Did it make sense? I was not sure." should be split into two segments. For this to happen, there should be a break rule for "?", when followed by spaces and a capitalized word. To define a rule as a break rule, tick the Break/Exception check box.

Exception rule

specify what parts of text should NOT be separated. In spite of the period, "Mrs. Dalloway " should not be split in two segments, so an exception rule should be established for Mrs (and for Mr, for Dr, for prof etc), followed by a period. To define a rule as an exception rule, leave the Break/Exception check box unticked.

The predefined break rules should be sufficient for most European languages and Japanese. In view of the flexibility, you may consider defining more exception rules for your source language in order to provide more meaningful and coherent segments.

2. Rule priority

All segmentation rule sets for a matching language pattern are active and are applied in the given order of priority, so rules for specific language should be higher than default ones. For example, rules for Canadian French (FR-CA) should be set higher than rules for French (FR.*), and higher than Default (.*) ones. Thus, when translating from Canadian French the rules for Canadian French - if any - will be applied first, followed by the rules for French and lastly, by the Default rules.

3. Creating a new rule

Major changes to the segmentation rules should be generally avoided, especially after completion of the first draft, but minor changes, such as the addition of a recognized abbreviation, can be advantageous.

In order to edit or expand an existing set of rules, simply click on it in the top table. The rules for that set will appear in the bottom half of the window.

In order to create an empty set of rules for a new language pattern click Add in the upper half of the dialog. An empty line will appear at the bottom of the upper table (you may have to scroll down to see it). Change the name of the rule set and the language pattern to the language concerned and its code (see Appendix A, Languages - ISO 639 code list for a list of language codes). The syntax of the language pattern conforms to regular expression syntax. If your set of rules handles a language-country pair, we advise you to move it to the top using the Move Up button.

Add the Before and After patterns. To check their syntax and their applicability, it is advisable to use tools which allow you to see their effect directly. See the chapter on Regular expressions. A good starting point will always be the existing rules.

4. A few simple examples

Intention Before After Note
Set the segment start after a period ('.') followed by a space, tab ... \. \s "\." stands for the period character. "\s" means any white space character (space, tab, new page etc.)
Do not segment after Mr. Mr\. \s This an exception rule, so the rule check box must not be ticked
Set a segment after "。" (Japanese period)   Note that after is empty
Do not segment after M. Mr. Mrs. and Ms. Mr??s??\. \s Exception rule - see the use of ? in regular expressions