Chapter 9. Files to translate

1. File formats
1.1. Plain text files
1.2. Formatted text files
1.3. PDF files
2. Other file formats
3. Right to left languages
3.1. Mixing RTL and LTR strings in segments
3.2. OmegaT tags in RTL segments
3.3. Creating translated RTL documents

1. File formats

You can use OmegaT to translate files in a number of file formats. There are basically two types of file formats, plain text and formatted text.

1.1. Plain text files

Plain text files contain text only, so their translation is as simple as typing the translation. There are several methods to specify the file's encoding so that its contents are not garbled when opened in OmegaT. Such files do not contain any formatting information beyond the "white space" used to align text, indicate paragraphs or insert page breaks. They are not able to contain or retain information regarding the color, font etc of the text. Currently, OmegaT supports the following plain text formats:

  • ASCII text (.txt, etc.)

  • Encoded text (*.UTF8)

  • Java resource bundles (*.properties)

  • PO files (*.po)

  • INI (key=value) files (*.ini)

  • DTD files (*.DTD)

  • DokuWiki files (*.txt)

  • SubRip title files (*.srt)

  • Magento CE Locale CSV files (*.csv)

Other plain text file types can be handled by OmegaT by associating their file extension to a supported file type (for example, .pod files can be associated to the ASCII text filter) and by pre-processing them with specific segmentation rules.

PO files can contain both the source and the target text. Seen from this point of view, they are plain text files plus translation memories. If for a given source segment there is as yet no existing translation in the project translation memory (project_save.tmx), the current translation will be saved in the project_save.tmx as the default translation. In case, however, the same source segment already exists with a different translation, the new translation will be saved as an alternative.

1.2. Formatted text files

Formatted text files contain information such as font type, size, color etc. as well as text. They are commonly created in word processors or HTML editors. Such file formats are designed to hold formatting information. The formatting information can be as simple as “this is bold”, or as complex as table data with different font sizes, colors, positions, etc. In most translation jobs, it is considered important for the formatting of the original text to be retained in the translation. OmegaT allows you to do this by marking the characters/words that have a special formatting with easy-to-handle tags. Simplifying the original text formatting greatly contributes to reducing the number of tags. Where possible, unifying the fonts, font sizes, colors, etc. used in the document simplifies the task of translation and reduces the possible number of tag errors. Each file type is handled differently in OmegaT. Specific behavior can be set up in the file filters. At the time of writing, OmegaT supports the following formatted text formats:

  • ODF - OASIS Open Document Format (*.ods, *.ots, *.odt, *.ott, *.odp, *.otp)

  • Microsoft Office Open XML (*.docx, *.dotx, *.xlsx, *.xltx, *.pptx)

  • (X)HTML (*.html, *.xhtml,*.xht)

  • HTML Help Compiler (*.hhc, *.hhk)

  • DocBook (*.xml)

  • XLIFF (*.xlf, *.xliff, *.sdlxliff) - of the source=target variety

  • QuarkXPress CopyFlowGold (*.tag, *.xtg)

  • ResX files (*.resx)

  • Android resource (*.xml)

  • LaTex (*.tex, *.latex)

  • Help (*.xml) and Manual (*.hmxp) files

  • Typo3 LocManager (*.xml)

  • WiX Localization (*.wxl)

  • Iceni Infix (*.xml)

  • Flash XML export (*.xml)

  • Wordfast TXML (*.txml)

  • Camtasia for Windows (*.camproj)

  • Visio (*.vxd)

  • Java property XML (*.xml)

  • Schematron (*.sch)

Other formatted text file types may also be handled by OmegaT by associating their file extensions to a supported file type, assuming that the corresponding segmentation rules will segment them correctly.

1.3. PDF files

PDF files are a special case. They contain text formatting information, but such information cannot be reused by OmegaT in order to create target files. Thus, PDF files are handled as plain text files, and output files are plain text files.

If you need to reproduce text formatting (as well as other things such as drawings) in your translation, there are three ways to try:

  1. Use OmegaT’s default filter (PDF input), translate, create a target file (it will be a plain text file), add relevant formatting and items manually.

  2. Use the Iceni Infix filter. See Howto - Translating PDF files with Iceni Infix and OmegaT.

  3. Import the source file to LibreOffice Draw, save it as an ODG file, translate, export to PDF as needed.

Note: the above information applies only to PDF files with a text layer. If you have a PDF file made of scanned pages (sometimes such files are referred to as ‘dead’ PDFs), you need to use an OCR (optical character recognition) program to recognize the text and convert it to a format that can be handled by OmegaT.

2. Other file formats

Other plain text or formatted text file formats suitable for processing in OmegaT may also exist.

External tools can be used to convert files to supported formats. The translated files will then need to be converted back to the original format. For example, if you have an outdated Microsoft Word version, that does not handle the ODT format, here's a round trip for Word files with the DOC extension:

  • import the file into ODF writer

  • save the file in ODT format

  • translate it into the target ODT file

  • load the target file in ODF writer

  • save the file as a DOC file

The quality of formatting of the translated file will depend on the quality of the round-trip conversion. Before proceeding with such conversions, be sure to test all options. Check the OmegaT home page for an up-to-date listing of auxiliary translation tools.

3. Right to left languages

Justification of source and target segments depends upon the project languages. By default, left justification is used for Left-To-Right (LTR) languages and right justification for Right-To-Left (RTL) languages. You can toggle between different display modes by pressing Shift+Ctrl+O (this is the letter O and not the numeral 0). The Shift+Ctrl+O toggle has three states:

  • default justification, that is as defined by the language

  • left justification

  • right justification

Using the RTL mode in OmegaT has no influence whatsoever on the display mode of the translated documents created in OmegaT. The display mode of the translated documents must be modified within the application (such as Microsoft Word) commonly used to display or modify them (check the relevant manuals for details). Using Shift+Ctrl+O causes both text input and display in OmegaT to change. It can be used separately for all three panes (Editor, Fuzzy Matches and Glossary) by clicking on the pane and toggling the display mode. It can also be used in all the input fields found in OmegaT - in the search window, for segmentation rules etc.

Mac OS X users, note: use Shift+Ctrl+O shortcut and not cmd+Ctrl+O.

3.1. Mixing RTL and LTR strings in segments

When writing purely RTL text, the default (LTR) view may be used. In many cases, however, it is necessary to embed LTR text in RTL text. For example, in OmegaT tags, product names that must be left in the LTR source language, place holders in localization files, and numbers in text. In cases like these it becomes necessary to switch to RTL mode, so that the RTL (in fact bidirectional) text is displayed correctly. It should be noted that when OmegaT is in RTL mode, both source and target are displayed in RTL mode. This means that if the source language is LTR and the target language is RTL, or vice versa, it may be necessary to toggle back and forth between RTL and LTR modes to view the source and enter the target easily in their respective modes.

3.2. OmegaT tags in RTL segments

As stated above, OmegaT tags are LTR strings. When translating between RTL and LTR languages, correctly reading the tags from the source and entering them properly in the target may require the translator to toggle between LTR and RTL modes numerous times.

If the document allows, the translator is strongly encouraged to remove style information from the original document so that as few tags as possible appear in the OmegaT interface. Follow the indications given in Hints for tags management. Frequently validate tags (see Tag validation) and produce translated documents (see below and Menu) at regular intervals to make it easier to catch any problems that arise. A hint: translating a plain text version of the text and adding the necessary style in the relevant application at a later stage may turn out to be less hassle.

3.3. Creating translated RTL documents

When the translated document is created, its display direction will be the same as that of the original document. If the original document was LTR, the display direction of the target document must be changed manually to RTL in its viewing application. Each output format has specific ways of dealing with RTL display; check the relevant application manuals for details.

For .docx files, a number of changes are however done automatically:

  • Paragraphs, sections and tables are set to bidi
  • Runs (text elements) are set to RTL

To avoid changing the target files display parameters each time the files are opened, it may be possible to change the source file display parameters such that such parameters are inherited by the target files. Such modifications are possible in ODF files for example.