Fuzzy search

Ispell

When mnoGoSearch is used with ispell support, all words are normalized. This allows finding different grammatical forms of the same words. During indexing, all words are stored "as is" in the database. During the search, all forms of the given keyword are selected and are taken into account. E.g. the search front-end will try to find the word "test" if "testing" or "tests" is given in the search query.

Two types of ispell files

MnoGoSearch understands two types of ispell files: affixes and dictionaries. Ispell affixes file contains rules for words and has approximately the following format:


Flag V:
       E   > -E, IVE      # As in create> creative
      [^E] > IVE          # As in prevent > preventive
Flag *N:
       E   > -E, ION      # As in create > creation
       Y   > -Y, ICATION  # As in multiply > multiplication
     [^EY] > EN           # As in fall > fallen

Ispell dictionary file contains words themselves and has the following format:


wop/S
word/DGJMS
wordage/S
wordbook
wordily
wordless/P

Using Ispell

To make mnoGoSearch support ispell, you must specify Affix and Spell commands in the search.htm file. The format of commands:


Affix [lang] [charset] [ispell affixes file name]
Spell [lang] [charset] [ispell dictionary filename]

The first parameter of both commands is a two letters language abbreviation. The second parameter is the ispell file character set. The third one is the filename. File names are relative to mnoGoSearch's /etc directory. Absolute paths can also be specified.

Note: Simultaneous loading of several languages is supported, e.g.:


Affix en iso-8859-1 en.aff
Spell en iso-8859-1 en.dict
Affix de iso-8859-1 de.aff
Spell de iso-8859-1 de.dict

...will load support for both English and German languages.

Customizing dictionaries

It is possible that several rare words are found in your site which are not in Ispell dictionaries. You may create the list of such words in plain text file with the following format (one word per line):


rare.dict:
----------
webmaster
intranet
.......
www
http
---------
			

You may also use ispell flags in this file (for Ispell flags refer to Ispell documentation). This will allow not writing the same word with different endings to the rare words file, for example "webmaster" and "webmasters". You may choose the word which has the same changing rules from an existing Ispell dictionary word and just copy flags from it. For example, English dictionary has this line:

postmaster/MS

So, webmaster with MS flags will probably be OK:

webmaster/MS

Then copy this file to the /etc directory of mnoGoSearch and add this file by using the Spell command in Ispell tab of mnoGoSearch:

During next re-indexing using of all documents' new words will be considered as words with correct spelling. The only really incorrect words will remain.

Synonyms

Starting from mnoGoSearch version 3.2, synonyms-based fuzzy search is supported.

Synonyms files are installed into the etc/synonym subdirectory of mnoGoSearch's installation.

To enable synonyms, add search template commands like Synonym <filename> to search.htm, e.g.:


Synonym synonym/english.syn
Synonym synonym/russian.syn
  

Filenames are relative to the etc directory of mnoGoSearch's installation or absolute if they begin with /

Please feel free to send us your own synonyms lists at .

Please use English synonym file as an example. In the beginning of the file the following two commands must be specified:


Language: en
Charset:  us-ascii

The further lines contain synonyms, one group of synonyms per line. For example:


car auto automobile

All words written on the same line are considered to be equal. If you type one of the words in the search form, all other words from the same line are also searched.

An optional "Mode" command can be used inside a synonym file. It understands three values: "roundtrip", "oneway" and "return", with "roundtrip" value as default.

If "Mode: oneway" is specified then the words written on the same line are not considered as equal synonyms anymore. Only the leftmost word is expanded to other words. For example:


Mode: oneway
car auto automobile
Searching for "car" will also search for "auto" and "automobile", but searching for "auto" will not find neither "car" nor "automobile", and searching for "automobile" will not find neither "car" not "auto".

If "Mode: return" is specified then all words are expanded only to the leftmost word, while the leftmost word itself is not expanded. For example:


Mode: return
car auto automobile
Searching for "car" won't search neither for "auto" nor for "automobile", but searching for "auto" will also search for "car", and searching for "automobile" will also search for "car".

It's possible to use multiple "Mode" commands in the same synonym file and thus switch between "oneway", "return" and "roundtrip" style of synonyms for different lines:


Mode: roundtrip
colour color
Mode: oneway
car auto automobile

Since 3.2.34, mnoGoSearch also supports simple type of phrase synonyms:


president "george bush"

That means, if you type the word "president", the phrase "george bush" will also be searched.

Currently, word-to-phrase synonyms are only supported. Phrase-to-phrase and pharse-to-word synonyms do not work yet. I.e. if you type the phrase "george bush" (in quotes), the word "president" will not be searched. This feature will be implemented later.

Dehyphenation

Searching for both hyphenated and dehyphenated compound words at the same time is also possible. Refer to Dehyphenate command description for details.

Loading synonyms and word forms from SQL database

It is also possible to load synonyms or word forms from the database. Refer to SQLWordForms command description for details.

Dumping ispell data

To dump ispell data in a format suitable for loading into a SQL table for further use with SQLWordForms, copy all Affix and Spell commands from search.htm into indexer.conf then run "indexer -Edumpspell > dump.txt". indexer will write all word forms to the given "dump.txt" file in this format:


...
abate/abate
abate/abating
abate/abated
abate/abater
abate/abates
...

Use database specific tools or SQL syntax to load the newly created dump file into a SQL table, e.g. with MySQL:


CREATE TABLE spell
(
  word varchar(64) not null,
  form varchar(64) not null,
  key(word),
  key(form)
);
LOAD DATA INFILE 'dump.txt' INTO TABLE spell FIELDS TERMINATED BY '/';

Transliteration

Starting from 3.2.34, mnoGoSearch supports transliteration

Use tl=yes parameter to search.cgi to activate transliteration.

Currently, Latin-to-Cyrillic and Cyrillic-to-Latin transliteration is implemented. I.e. if you type a word in Latin script, a Cyrillic word with the same spelling is also searched, and visa versa.

Searching numbers

Starting from 3.2.36, mnoGoSearch supports numeric operators.

When UseNumericOperators is set to "yes", the "<" and ">" signs are treated as numeric comparison operators, e.g. "<100" finds all documents which have numbers less than 100 in their body or title or other sections according to the "wf" settings. Numeric operators can currently work only with the databases which support automatic comparison between VARCHAR and INT and do not require an explicit type cast. MySQL, PostgreSQL and SQLite are know to work.

If you specify two operators in the same search query, e.g. ">100 <200", then the documents having numbers more than 100 and, at the same time, having numbers less than 200 will be found. I.e. the above query does not strictly mean "a number between 100 and 200". A "between"-alike operator will be implemented later.

Accent insensitive search

When doing searches, mnoGoSearch relies on the database collation settings, thus accent insensitive searches will be available if the database software supports and is configured to use an accent insensitive collation.

Accent insensitive search with MySQL

To configure mnoGoSearch for accent insensitive searches for German, French, Italian, Portuguese and some other Western languages, use the latin1_general_ci collation when creating the database you're going to use with mnoGoSearch:


CREATE DATABASE mnogosearch CHARACTER SET latin1 COLLATE latin1_general_ci;
With this collation, MySQL totally ignores all diacritic marks, so for example, searches for French "cote" will also find "coté" and vice versa.

Accent insensitive search with Firebird

To configure mnoGoSearch for accent insensitive searches for German, French, Italian, Portuguese etc. with Firebird, use the PT_BR collation. Firebird doesn't have the global database default collation, so it must be set in the CREATE TABLE statement for the table "bdict". In order to do so, open the file /usr/local/mnogosearch/share/ibase/create.blob.sql in your favorite text editor and add CHARACTER SET and COLLATE clauses into the "word" column definition:


CREATE TABLE bdict (
        word VARCHAR(64) CHARACTER SET ISO8859_1 NOT NULL COLLATE PT_BR,
        ...
);

Highlighting collation matches

Starting with version 3.3.3, mnoGoSearch can recognize the word forms returned by the underlying SQL collation, and use them for generating excertps and highlighting. For example, if your database is configured to use German DIN2 based collation (e.g. latin1_german2_ci in MySQL), then searches for "gross" will also return "groß". Both word forms will be highlighted. Prior to 3.3.3, only the exact word forms were used for excerpts and highlighting.

Note: Highlighting collation matches works only with DBMode=blob. Adding this feature for DBMode=single and DBMode=multi would have serious search performance impact.

Accent insensitive search with other databases

To make accent insensitive searches possible with databases not supporting accent insensitive collations, mnoGoSearch provides the StripAccents yes command. When StripAccents is set to yes, mnoGoSearch converts all accented letters to their non-accented counterparts. Conversion happens both during indexing (before storing data into the word index), and during search (before looking up in the word index). For example, the French word "coté" is converted into "cote".

Removing accents is only done for the word index. Accents are not removed from section values, so sections (e.g. "title", "body", "CachedCopy") are stored with their original accented letters, providing better search results presentation.