Chapter 11. Working with plain text

1. Default encoding
2. The OmegaT solution

1. Default encoding

Plain text files - in most cases files with a txt extension - contain just textual information and offer no clearly defined way to inform the computer which language they contain. The most that OmegaT can do in such a case, is to assume that the text is written in the same language the computer itself uses. This is no problem for files encoded in Unicode using a 16 bit character encoding set. If the text is encoded in 8 bits, however, one can be faced with the following awkward situation: instead of displaying the text, for Japanese characters...

...the system will display it like this for instance:

The computer, running OmegaT, has Russian as the default language, and thus shows the characters in the Cyrillic alphabet and not in Kanji.

2. The OmegaT solution

There are basically three ways to address this problem in OmegaT. They all involve the application of file filters in the Options menu.

Change the encoding of your files to Unicode

open your source file in a text editor that correctly interprets its encoding and save the file in "UTF-8" encoding. Change the file extension from .txt to .utf8. OmegaT will automatically interpret the file as a UTF-8 file. This is the most common-sense alternative, sparing you problems in the long run.

Specify the encoding for your plain text files

- i.e. files with a .txt extension - : in the Text files section of the file filters dialog, change the Source File Encoding from <auto> to the encoding that corresponds to your source .txt file, for instance to .jp for the above example.

Change the extensions of your plain text source files

for instance from .txt to .jp for Japanese plain texts: in the Text files section of the file filters dialog, add new Source Filename Pattern (*.jp for this example) and select the appropriate parameters for the source and target encoding

OmegaT has by default the following short list available to make it easier for you to deal with some plain text files:

  • .txt files are automatically (<auto>) interpreted by OmegaT as being encoded in the computer's default encoding.

  • .txt1 files are files in ISO-8859-1, covering most Western Europe languages.

  • .txt2 files are files in ISO-8859-2, that covers most Central and Eastern Europe languages

  • .utf8 files are interpreted by OmegaT as being encoded in UTF-8 (an encoding that covers almost all languages in the world).

You can check that yourself by selecting the item File Filters in the menu Options. For example, when you have a Czech text file (very probably written in the ISO-8859-2 code) you just need to change the extension .txt to .txt2 and OmegaT will interpret its contents correctly. And of course, if you wish to be on the safe side, consider converting this kind of file to Unicode, i.e. to the .utf8 file format.