Chapter 13. Translation memories

1. Translation memories in OmegaT
1.1. tmx folders - location and purpose
1.2. tmx backup
1.3. tmx files and language
1.4. Orphan segments
2. Reusing translation memories
2.1. Importing and exporting translation memories
2.2. Creating a translation memory for selected documents
2.3. Sharing translation memories
2.4. Using TMX with alternative language pairs
3. Sources with existing translations
4. Pseudo-translated memory
5. Upgrading translation memories

1. Translation memories in OmegaT

1.1. tmx folders - location and purpose

OmegaT projects can have translation memory files - i.e. files with the extension tmx - in five different places:

omegat folder

The omegat folder contains the project_save.tmx and possibly a number of backup TMX files. The project_save.tmx file contains all the segments that have been recorded in memory since you started the project. This file always exists in the project. Its contents will always be sorted alphabetically by the source segment.

main project folder

The main project folder contains 3 tmx files, project_name-omegat.tmx, project_name-level1.tmx and project_name-level2.tmx (project_name being the name of your project).

  • The level1 file contains only textual information.

  • The level2 file encapsulates OmegaT specific tags in correct tmx tags so that the file can be used with its formatting information in a translation tool that supports tmx level 2 memories, or OmegaT itself.

  • The OmegaT file includes OmegaT specific formatting tags so that the file can be used in other OmegaT projects

These files are copies of the file project_save.tmx, i.e. of the project's main translation memory, excluding the so-called orphan segments. They carry appropriately changed names, so that its contents still remain identifiable, when used elsewhere, for instance in the tm subfolder of some other project (see below).

tm folder

The /tm/ folder can contain any number of ancillary translation memories - i.e. tmx files. Such files can be created in any of the three varieties indicated above. Note that other CAT tools can export (and import as well) tmx files, usually in all three forms. The best thing of course is to use OmegaT-specific TMX files (see above), so that the in-line formatting within the segment is retained.

The contents of translation memories in the tm subfolder serve to generate suggestions for the text(s) to be translated. Any text, already translated and stored in those files, will appear among the fuzzy matches, if it is sufficiently similar to the text currently being translated.

If the source segment in one of the ancillary TMs is identical to the text being translated, OmegaT acts as defined in the OptionsEditing Behavior... dialog window. For instance (if the default is accepted), the translation from the ancillary TM is accepted and prefixed with [fuzzy], so that the translator can review the translations at a later stage and check whether the segments tagged this way, have been translated correctly (see the Editing behavior chapter) .

It may happen, that translation memories, available in the tm subfolder, contain segments with identical source text, but differing targets. TMX files are read sorted by their names and segments within a given TMX file line by line. The last segment with the identical source text will thus prevail (Note: of course it makes more sense to avoid this to happen in the first place).

Note that the TMX files in the tm folder can be compressed with gzip.

tm/auto folder

If it is clear from the very start, that translations in a given TM (or TMs) are all correct, one can put them into the tm/auto folder and avoid confirming a lot of [fuzzy] cases.

  1. Put the TMX in /tm/auto.

  2. Open the project. The changes are displayed.

  3. Make a slight change anywhere in the project. This modifies project_save.tmx (by adding proper Translation Units from "auto" TMX)

Note: if TMX is removed from /tm/auto before step 3, no extra Translation Unit is added.

tm/enforce folder

If you have no doubt that a TMX is more accurate than the project_save.tmx of OmegaT, put this TMX in /tm/enforce to overwrite existing default translations unconditionally.

  1. Put the TMX in /tm/enforce.

  2. Open the project. The changes are displayed.

  3. Make a slight change anywhere in the project. This modifies project_save.tmx.

  4. Make decision about immunity of the enforced segments:

    • If they don't need to stay immune from further changes, then remove the TMX from /tm/enforce.

    • If they need to stay immune from further changes, then keep the TMX in /tm/enforce.

Note: if TMX is removed from /tm/enforce before step 3, enforcements aren't kept at all.

tm/mt folder

In the editor pane, when a match is inserted from a TMX contained in a folder named mt, the background of the active segment is changed to red. The background is restored to normal when the segment is left.

tm/penalty-xxx folders

Sometimes, it is useful to distinguish between high-quality translation memories and those that are, because of the subject matter, client, revision status, etc., less reliable. For translation memories in folders with a name "penalty-xxx" (with xxx between 0 and 100), matches will be degraded according to the name of the folder: a 100% match in any of TMs, residing in a folder called Penalty-30 for instance, will be lowered to a 70% match. The penalty applies to all three match percentages: matches 75, 80, 90 will in this case be lowered to 45, 50, 60.

Optionally, you can let OmegaT have an additional tmx file (OmegaT-style) anywhere you specify, containing all translatable segments of the project. See pseudo-translated memory below.

Note that all the translation memories are loaded into memory when the project is opened. Back-ups of the project translation memory are produced regularly (see next chapter), and project_save.tmx is also saved/updated when the project is closed or loaded again. This means for instance that you do not need to exit a project you are currently working on if you decide to add another ancillary TM to it: you simply reload the project, and the changes you have made will be included.

The locations of the various different translation memories for a given project are user-defined (see Project dialog window in Project properties)

Depending on the situation, different strategies are thus possible, for instance:

several projects on the same subject: keep the project structure, and change source and target folders (Source = source/order1, target = target/order1 etc). Note that you segments from order1, that are not present in order2 and other subsequent jobs, will be tagged as orphan segments; however, they will still be useful for getting fuzzy matches.

several translators working on the same project: split the source files into source/Alice, source/Bob... and allocate them to team members (Alice, Bob ...). They can then create their own projects and, deliver their own project_save.tmx, when finished or when a given milestone has been reached. The project_save.tmx files are then collected and possible conflicts as regards terminology for instance get resolved. A new version of the master TM is then created, either to be put in team members' tm/autosubfolders or to replace their project_save.tmx files. The team can also use the same subfolder structure for the target files. This allows them for instance to check at any moment, whether the target version for the complete project is still OK

1.2. tmx backup

As you translate your files, OmegaT stores your work continually in project_save.tmx in the project's /omegat subfolder.

OmegaT also backups translation memory to project_save.tmx.YEARMMDDHHNN.bak in the same subfolder whenever a project is opened or reloaded. YEAR is 4-digit year, MM is a month, DD day of the month, HH and NN are hours and minutes when the previous translation memory was saved.

If you believe you have lost translation data, follow the following procedure:

  1. Close the project

  2. Rename the current project_save.tmx file ( e.g. to project_save.tmx.temporary)

  3. Select the backup translation memory that is most likely - e.g. the most recent one, or the last version from the day before) to contain the data you are looking for

  4. Copy it to project_save.tmx

  5. Open the project

1.3. tmx files and language

Tmx files contain translation units, made of a number of equivalent segments in several languages. A translation unit comprises at least two translation unit variants (TUV). Either can be used as the source or target.

The settings in your project indicate which is the source and which the target language. OmegaT thus takes the TUV segments corresponding to the project's source and target language codes and uses them as the source and target segments respectively. OmegaT recognizes the language codes using the following two standard conventions :

  • 2 letters (e.g. JA for Japanese), or

  • 2- or 3-letter language code followed by the 2-letter country code (e.g. EN-US - See Appendix A, Languages - ISO 639 code list for a partial list of language and country codes).

If the project language codes and the tmx language codes fully match, the segments are loaded in memory. If languages match but not the country, the segments still get loaded. If neither the language code not the country code match, the segments will be ignored.

TMX files can generally contain translation units with several candidate languages. If for a given source segment there is no entry for the selected target language, all other target segments are loaded, regardless of the language. For instance, if the language pair of the project is DE-FR, it can be still be of some help to see hits in the DE-EN translation, if there's none in the DE-FR pair.

1.4. Orphan segments

The file project_save.tmx contains all the segments that have been translated since you started the project. If you modify the project segmentation or delete files from the source, some matches may appear as orphan strings in the Match Viewer: such matches refer to segments that do not exist any more in the source documents, as they correspond to segments translated and recorded before the modifications took place.

2. Reusing translation memories

Initially, that is when the project is created, the main TM of the project, project_save.tmx is empty. This TM gradually becomes filled during the translation. To speed up this process, existing translations can be reused. If a given sentence has already been translated once, and translated correctly, there is no need for it to be retranslated. Translation memories may also contain reference translations: multinational legislation, such as that of the European Community, is a typical example.

When you create the target documents in an OmegaT project, the translation memory of the project is output in the form of three files in the root folder of your OmegaT project (see the above description). You can regard these three tmx files (-omegat.tmx, -level1.tmx and -level2.tmx) as an "export translation memory", i.e. as an export of your current project's content in bilingual form.

Should you wish to reuse a translation memory from a previous project (for example because the new project is similar to the previous project, or uses terminology which might have been used before), you can use these translation memories as "input translation memories", i.e. for import into your new project. In this case, place the translation memories you wish to use in the /tm or /tm/auto folder of your new project: in the former case you will get hits from these translation memories in the fuzzy matches viewer, and in the latter case these TMs will be used to pre-translate your source text.

By default, the /tm folder is below the project's root folder (e.g. .../MyProject/tm), but you can choose a different folder in the project properties dialog if you wish. This is useful if you frequently use translation memories produced in the past, for example because they are on the same subject or for the same customer. In this case, a useful procedure would be:

  • Create a folder (a "repository folder") in a convenient location on your hard drive for the translation memories for a particular customer or subject.

  • Whenever you finish a project, copy one of the three "export" translation memory files from the root folder of the project to the repository folder.

  • When you begin a new project on the same subject or for the same customer, navigate to the repository folder in the Project > Properties > Edit Project dialog and select it as the translation memory folder.

Note that all the tmx files in the /tm repository are parsed when the project is opened, so putting all different TMs you may have on hand into this folder may unnecessarily slow OmegaT down. You may even consider removing those that are not required any more, once you have used their contents to fill up the project-save.tmx file.

2.1. Importing and exporting translation memories

OmegaT supports imported tmx versions 1.1-1.4b (both level 1 and level 2). This enables the translation memories produced by other tools to be read by OmegaT. However, OmegaT does not fully support imported level 2 tmx files (these store not only the translation, but also the formatting). Level 2 tmx files will still be imported and their textual content can be seen in OmegaT, but the quality of fuzzy matches will be somewhat lower.

OmegaT follows very strict procedures when loading translation memory (tmx) files. If an error is found in such a file, OmegaT will indicate the position within the defective file at which the error is located.

Some tools are known to produce invalid tmx files under certain conditions. If you wish to use such files as reference translations in OmegaT, they must be repaired, or OmegaT will report an error and fail to load them. Fixes are trivial operations and OmegaT assists troubleshooting with the related error message. You can ask the user group for advice if you have problems.

OmegaT exports version 1.4 TMX files (both level 1 and level 2). The level 2 export is not fully compliant with the level 2 standard, but is sufficiently close and will generate correct matches in other translation memory tools supporting TMX Level 2. If you only need textual information (and not formatting information), use the level 1 file that OmegaT has created.

2.2. Creating a translation memory for selected documents

In case translators need to share their TMX bases while excluding some of their parts or including just translations of certain files, sharing the complete ProjectName-omegat.tmx is out of question. The following recipee is just one of the possibilities, but simple enough to follow and without any dangers for the assets.

  • Create a project, separate for other projects, in the desired language pair, with an appropriate name - note that the TMXs created will include this name.

  • Copy the documents, you need the translation memory for, into the source folder of the project.

  • Copy the translation memories, containing the translations of the documents above, into tm/auto subfolder of the new project.

  • Start the project. Check for possible Tag errors with Ctrl+T and untranslated segments with Ctrl+U. To check everything is as expected, you may press Ctrl+D to create the target documents and check their contents.

  • When you exit the project. the TMX files in the main project folder (see above) now contain the transltions in the selected language pair, for the files, you have copied into the source folder. Copy them to a safe place for future referrals.

  • To avoid reusing the project and thus possibly polluting future cases, delete the project folder or archive it away from your workplace.

2.3. Sharing translation memories

In cases where a team of translators is involved, translators will prefer to share common translation memories rather than distribute their local versions.

OmegaT interfaces to SVN and Git, two common team software versioning and revision control systems (RCS), available under an open source license. In case of OmegaT complete project folders - in other words the translation memories involved as well as source folders, project settings etc - are managed by the selected RCS. see more in Chapter

2.4. Using TMX with alternative language pairs

There may be cases where you have done a project with e.g. Dutch sources, and a translation in say English. Then you need a translation in e.g. Chinese, but your translator does not understand Dutch; she, however, understands perfectly English. In this case, the NL-EN translation memory can serve as a go-between to help generate NL to ZH translation.

The solution in our example is to copy the existing translation memory into the tm/tmx2source/ subfolder and rename it to ZH_CN.tmx to indicate the target language of the tmx. The translator will be shown English translations for source segments in Dutch and use them to create the Chinese translation.

Important: the supporting TMX must be renamed XX_YY.tmx, where XX_YY is the target language of the tmx, for instance to ZH_CN.tmx in the example above. The project and TMX source languages should of course be identical - NL in our example. Note that only one TMX for a given language pair is possible, so if several translation memories should be involved, you will need to merge them all into the XX_YY.tmx.

3. Sources with existing translations

Some types of source files (for instance PO, TTX, etc.) are bilingual, i.e. they serve both as a source and as a translation memory. In such cases, an existing translation, found in the file, is included in the project_save.tmx. It is treated as a default translation, if no match has been found, or as an alternative translation, in case the same source segment exists, but with a target text. The result will thus depend on the order in which the source segments have been loaded.

All translations from source documents are also displayed in the Comment pane, in addition to the Match pane. In case of PO files, a 20% penalty applied to the alternative translation (i.e., a 100% match becomes an 80% match). The word [Fuzzy] is displayed on the source segment.

When you load a segmented TTX file, segments with source = target will be included, if "Allow translation to be equal to source" in Options → Editing Behavior... has been checked. This may be confusing, so you may consider unchecking this option in this case.

4. Pseudo-translated memory

Note

Of interest for advanced users only!

Before segments get translated, you may wish to pre-process them or address them in some other way than is possible with OmegaT. For example, if you wish to create a pseudo-translation for testing purposes, OmegaT enables you to create an additional tmx file that contains all segments of the project. The translation in this tmx can be either

  • translation equals source (default)

  • translation segment is empty

The tmx file can be given any name you specify. A pseudo-translated memory can be generated with the following command line parameters:

java -jar omegat.jar --pseudotranslatetmx=<filename> [pseudotranslatetype=[equal|empty]]

Replace <filename> with the name of the file you wish to create, either absolute or relative to the working folder (the folder you start OmegaT from). The second argument --pseudotranslatetype is optional. Its value is either equal (default value, for source=target) or empty (target segment is empty). You can process the generated tmx with any tool you want. To reuse it in OmegaT rename it to project_save.tmx and place it in the omegat-folder of your project.

5. Upgrading translation memories

Very early versions of OmegaT were capable of segmenting source files into paragraphs only and were inconsistent when numbering formatting tags in HTML and Open Document files. OmegaT can detect and upgrade such tmx files on the fly to increase fuzzy matching quality and leverage your existing translation better, saving you the work of doing this manually.

A project's tmx will be upgraded only once, and will be written in upgraded form into the project-save.tmx; legacy tmx files will be upgraded on the fly each time the project is loaded. Note that in some cases changes in file filters in OmegaT may lead to totally different segmentation; as a result, you will have to upgrade your translation manually in such rare cases.