Chapter 5. Storing mnoGoSearch data

Table of Contents
SQL storage types
Cache mode storage
mnoGoSearch cluster
mnoGoSearch performance issues
Oracle notes
IBM DB2 notes

SQL storage types

Various modes of words storage

The different modes of word storage currently supported by mnoGoSearch are: "single", "multi" and "blob". Default mode is "single". The mode can be selected using the DBMode part of DBAddr command in both indexer.conf and search.htm files.


Examples:
DBAddr mysql://localhost/test/?DBMode=single
DBAddr mysql://localhost/test/?DBMode=multi
DBAddr mysql://localhost/test/?DBMode=blob

Storage mode - single

When "single" is specified, all words are stored in a single table of structure (url_id,word,weight), where url_id is the ID of the document which is referenced by rec_id field in "url" table. Word has the variable char(32) SQL type. Each appearance of the same word in a document produces a separate record in the table. DBMode=single supports live updates - a document updated by "indexer" is immediately visible for searches with its new content.

Storage mode - multi

If "multi" is selected, words are located in 256 separate tables using hash function for distribution. Structures of these tables are almost the same with "single" mode, but all word appearances are grouped into a single binary array, instead of producing multiple records. This fact makes "multi" mode much faster comparing with "single" mode. DBMode="multi" supports live updates as well.

Storage mode - blob

DBMode=blob is the fastest mode currently available in mnoGoSearch for both purposes: indexing and searching. DBMode=blob is know to work fine with DB2, Mimer, MSSQL, MySQL, PostgreSQL, Oracle, Sybase, Firebird/Interbase. It's currently not supported with SQLite.

When DBMode=blob is selected, words are located in a single table "bdict" with structure (word, secno, intag), where intag is a binary array which includes information about all documents this word presents in (using 32-bit IDs of the documents), as well as positions of the word in each document (for pharse search). Words from different sections (e.g. title and body) are written in separate records, which is done to optimize searches like "find only in title".

This data structure is highly optimized for search, however it is very unsuitable for updates. So, indexing is actually done in two steps. The first step is crawling web space by running "indexer". On this step indexer puts word information about each document into the table "bdicti".

The second step is creating fast search index using information collected on the first step. Creating search index is done by launching "indexer -Eblob". At this step, indexer loads word information from table "dicti", groups together all appearances of the same word in different documents and writes word information into the table "bdict". Also, additional arrays of data are written into "bdict" table:

  • #rec_id - a list of 32-bit document IDs

  • #last_mod_time - an array of 32-bit "Last-Modified" values (in Unix timestamp format) - for fast limiting searches by date.

  • #pop_rank - an array of 32-bit float popularity rank values.

  • #site_id - an array of 32-bit site IDs, for GroupBySite.

  • #limit#name - an list of document IDs, covered by a user defined limit with name "name". A separate '#limit#xxx' record is created for every user defined Limit configured in indexer.conf.

  • #ts - the date when "indexer -Eblob" was executed last time, in textual representation of Unix timestamp.

  • #version - a string representing the version ID of "indexer" which created search index. For example, indexer from mnoGoSearch 3.3.0 will write the string "30300".

Note, creating fast search index is also possible for databases using DBMode=single and DBMode=multi. This is useful when you need to quickly switch to DBMode=blob when search performance became bad - without even having to re-index your web space. However, in these cases consider to completely switch to using DBMode=blob in both indexer.conf and search.htm, and run indexing from the very beginning.

The bad side of using DBMode=blob is that it does not support live updates. New or updated documents, crawled by "indexer" are not visible for search until "indexer -Eblob" is run again. Creating search index takes about 6 minutes on a collection with 200000 HTML documents, with 10Gb total size (on a Intel Core Duo 2.13GHz CPU), which can be unacceptably long for some applications (especially when using mnoGoSearch as a full-text indexing engine for SQL tables using HTDB).

Live index updates with DBMode=blob

Starting from 3.3.1 release, mnoGoSearch supports live updates by using direct read of word information from the table "bdicti" for the newly added or updated documents. It allows to add or update up to about 10.000 documents without having to run "indexer -Eblob". To activate using live updates, please add LiveUpdates=yes parameter to the DBAddr command.

Example:


DBAddr mysql://root@localhost/test/?DBMode=blob&LiveUpdates=yes

Extended features of DBMode=blob

Since 3.3.0, "indexer -Eblob" can be used in combination with URL and Tag limits, as well as in combimation with a user defined limit described by a Limit command. It allows to generate search index over a subset of all documents collected by indexer while crawling.

Examples:


indexer -Eblob -u %/subdir/%
indexer -Eblob -t tag
indexer -Eblob --fl=limitname

Since 3.2.36 additional "indexer -Erewriteurl" parameter is available. When indexer is invoked with this parametr it rewrites URL data for DBMode=blob. It's useful for very quick rewrite of URL data after adding "Deflate=yes", without touching word information.

Maximum amount of words collected from a document

mnoGoSearch uses 16-bit integers to store word positions both on disk (in SQL tables) and in memory (e.g. when doing search). In the versions prior to 3.3.0, it was possible to store up to 64K words from a single document. Starting from the version 3.3.0, word positions are enumerated per section, starting from the beginning of each section separately. This exteded the word number limit from 64K words per document to 64K words per section.

Substring search notes

"single", "multi" and "blob" modes support substring search. An SQL query containing a LIKE predicate is executed internally in order to do substring search. Substring search is usually slower than searching for full words.