Segmenters for Chinese, Thai and Japanese languages

Traditional Chinese, Thai and Japanese writing have no spaces between words in phrase, unlike western languages. Thus, while indexing documents in these languages, the indexer needs to segment phrases into words.

Japanese language phrase segmenter

For Japanese language phrase segmenting, one of ChaSen, a morphological system for japanes language, or MeCab, a Japanese morphological analyzer, is used. Thus, you need one of these systems to be installed before configuring and building mnoGoSearch.

To enable Japanese language phrase segmenting, use --with-chasen or --with-mecab switch for configure.

Chinese language phrase segmenter

For Chinese language phrase segmenting, the frequency dictionary of Chinese words is used. Segmenting itself is done by a dynamic programming method to maximize the cumulative frequency of produced words.

To enable Chinese language phrase segmenting, you need to enable the GB2312 charset support while configuring mnoGoSearch, if you want to use mandarin.freq, a simplified Chinese dictionary, or enable the Big5 charset support, to use TraditionalChinese.freq, a traditional Chinese dictionary. You also need to specify the frequency dictionary of Chinese words with LoadChineseList in indexer.conf file.


LoadChineseList [charset dictionaryfilename]
The GB2312charset and mandarin.freqdictionary are used by default.

Thai language phrase segmenter

For Thai language phrase segmenting, the frequency dictionary of Thai words is used. And segmenting itself is done as for Chinese language.

To enable Thai language phrase segmenting, you need to specify the frequency dictionary for Thai words with LoadThaiList in indexer.conf file.


LoadThaiList [charset dictionaryfilename]
The TIS-620charset and thai.freqdictionary are used by default