mnoGoSearch 3.3.7 reference manual

Full-featured search engine software


Table of Contents
1. Introduction
mnoGoSearch Features
Where to get mnoGoSearch.
Disclaimer
Authors
Contributors (in no particular order)
Frequently Asked Questions
2. Installation
SQL database requirements
Supported operating systems
Tools required for installation
Installing mnoGoSearch
Running search.cgi from inetd/xinetd
Possible installation problems
Creating binary distribution
Installation registration
3. Indexing
Indexing in general
Configuration
Running indexer
SQL back-end notes
How to create SQL table structure
How to drop SQL table structure
Subsection control
How to clear database
Database Statistics
Link validation
Parallel indexing
Supported HTTP response codes
Content-Encoding support
indexer configuration
Specifying WEB space to be indexed
Aliases
ServerTable
FlushServerTable
External parsers
Extended indexing features
News extensions
Indexing SQL database tables (htdb: virtual URL scheme)
Indexing binaries output (exec: and cgi: virtual URL schemes)
Mirroring
Using syslog
Disabling Apache logging
Storing cached copies
Configuring cached copies
Using cached copies at search time
4. mnoGoSearch HTML parser
Tag parser
Special characters
META tags
Links
Comments
5. Storing mnoGoSearch data
SQL storage types
Various modes of words storage
Storage mode - single
Storage mode - multi
Storage mode - blob
Live index updates with DBMode=blob
Extended features of DBMode=blob
Maximum amount of words collected from a document
Substring search notes
Cache mode storage
mnoGoSearch cluster
Introduction
How it works
Operations done on the database machines
How a typical XML response looks like
Operations done on the front-end machine
Cluster types
Installing and configuring a "merge" cluster
Installing and configuring a "distributed" cluster
Cluster limitations
mnoGoSearch performance issues
MySQL performance
Post-indexing optimization
Oracle notes
Introduction
Compilation, Installation and Configuration
IBM DB2 notes
6. Subsections
Categories
Tags
Tags in SQL version
7. Multiple languages support
Character sets
Supported character sets
Many languages in the same database
Character set conversion
Character set conversion at search time
Character sets aliases
Document character set detection
Automatic character set guesser
Default character set
Default Language
Making multi-language search pages
How does it work?
Possible troubles
Segmenters for Chinese, Thai and Japanese languages
Japanese language phrase segmenter
Chinese language phrase segmenter
Thai language phrase segmenter
Multilingual servers support
8. Searching documents
Using search front-ends
Performing search
Search parameters
Changing different document parts weights at search time
Using front-end with an shtml page
Using several templates
Advanced boolean search
Restrict searched words to a section
Phrase search
Exact section match
How search handles expired documents
How to write search result templates
Template sections
Template operators
Includes in templates
Security issues
Designing search.html
How is the results page created
Your HTML
Forms considerations
Relative links in search.htm
Adding Search form to other pages
Relevancy
Ordering documents
Analyzing score values
Crosswords
Search query tracking
Search results cache
Fuzzy search
Ispell
Synonyms
Dehyphenation
Loading synonyms and word forms from SQL database
Dumping ispell data
Transliteration
Searching numbers
Accent insensitive search
9. Miscellaneous
Environment variables
Using libmnogosearch library
udm-config script
mnoGoSearch API
MySQL fulltext parser plugin
Database schema
Reporting bugs
Currently known bugs
Core dump reports
I. Reference
I. mnoGoSearch commands reference
AddType -- associates file names or extensions with mime types
Affix -- loads an Ispell affix file
Alias -- associates master and mirror sites
AliasProg -- calls external URL parser
Allow -- allows to index defined URLs
AlwaysFoundWord -- defines word that is always treated as found
AuthBasic -- defines basic HTTP authorization user name and password
BrowserCharset -- defines browser charset
Cache -- enables or disables cache search results
CaseFolding -- choose alternative case mapping
Category -- defines documents category
CheckMP3 -- checks for MP3 meta information
CheckMP3Only -- check for MP3 meta information
CheckOnly -- checks for file existence only
CollectLinks -- enables or disables storing links between pages - for popularity rank.
CrossWords -- specifies whether to use crosswords
CustomLog -- logging to stdout using the given format
CVSIgnore -- enables or disables indexing internal CVS files
DateFactor -- giving less score to old documents
DateFormat -- defines date format
DBAddr -- sets database address
DefaultContentType -- defines default Content-Type
Dehyphenate -- enables searching for dehyphenated forms of compound words
DefaultLang -- defines default language
DetectClones -- enables or disables clone detection
Disallow -- disallows indexing defined URLs
DocMemCacheSize -- this command is obsolete
DocSizeWeight -- change document size impact on the document score
DocTimeOut -- defines maximal time for document downloading
ExcerptSize -- defines maximal length of excerpt
ExcerptStopword -- whether to hightlight stopwords.
ExcerptPadding -- defines excerpt padding length
FlushServerTable -- flushes server.active to inactive
FollowSymLinks -- whether to dereference symlinks
ForceIISCharset1251 -- assume windows-1251 charset
GuesserUseMeta -- enables or disables using meta tags
GroupBySite -- enables grouping search results by site
HlBeg -- configures search results highlighting
HlEnd -- configures search results highlighting
HoldBadHrefs -- defines timeout for holding bad URLs
HrefOnly -- scan HTML pages only for URLs
HTDBAddr -- describes a remote SQL data source connection string
HTDBDoc -- describes a query to fetch document content from a SQL source
HTDBLimit -- limits the amount of document IDs fetched in a single HTDBList query
HTDBList -- describes a query to fetch document IDs from a SQL data source
HTTPHeader -- adds desired headers in indexer HTTP request
ImportEnv -- imports an environment variable
Include -- includes additional configuration file
Index -- prevents indexer from storing words into database
IndexIf -- allows indexing documents whose section matches the given pattern
IndexTime -- Enables or disables Last-Modified HTTP header processing.
IspellUsePrefixes -- allows to use ispell prefixes while searching
LangMapFile -- loads language map for charset and language guesser
LangMapUpdate -- no description available yet
Limit -- describes a fast limit
LoadChineseList -- loads Chinese word frequency list
LoadTagInfo -- load tag values to display in search results
LoadThaiList -- loads Thai word frequency list
LoadURLInfo -- load section values to display in search results
LocalCharset -- defines local charset
Locale -- sets a desired locale
LogLevel -- Verbosity level
MaxDocSize -- defines maximal document size
MaxDocPerSite -- defines maximal document number to pick up from each site
MaxHops -- defines maximal way in "mouse clicks"
MaxNetErrors -- defines maximal network errors
MaxWordLength -- defines maximal word length
Mime -- defines external parser for given mime-type
MinCoordFactor -- giving more score to documents having found words closer to the beginning
MinWordLength -- defines minimal word length
MirrorHeadersRoot -- defines root directory of mirrored document's headers
MirrorPeriod -- defines period for mirrored files
MirrorRoot -- defines root directory to enable sites mirroring
NetErrorDelayTime -- defines document processing delay
NewsExtensions -- enables news extensions
NoIndexIf -- disallows indexing documents whose section matches the given pattern.
NumSections -- specifying the number of sections configured in indexer.conf
WordDensityFactor -- giving more score to documents having higher word density
WordFormFactor -- giving more score to the original query word form (as opposite to generated synonym or ispell forms)
NumDistinctWordFactor -- giving more score to documents having more distinct words
NumWordFactor -- giving more score to documents having more found words
ParserTimeOut -- defines amount of time for parser execution
Period -- defines reindex period
PopRankFeedBack -- calculates sites weights
PopRankShowCntRatio -- PopRankShowCntRatio
PopRankShowCntWeight -- PopRankShowCntWeight
PopRankSkipSameSite -- skips links from same site
PopRankUseShowCnt -- PopRankUseShowCnt
PopRankUseTracking -- PopRankUseTracking
Proxy -- defines HTTP proxy address
ProxyAuthBasic -- defines HTTP proxy user name and password
R0 - R9 -- sets random number
ReadTimeOut -- defines stalled connections timeout
Realm -- describes web-space to index using regex/wild patterns
RemoteCharset -- defines default character set for next Server command(s)
RemoteFileNameCharset -- defines default character set of file and directory names
ReplaceVar -- creates or modifies a variable
ResultContentType -- specifies the "Content-Type" header produced by search.cgi
ResultsLimit -- ResultsLimit
ReverseAlias -- ReverseAlias
Robots -- allows using robots.txt
SaveSectionSize -- use section sizes for better relevancy quality
Section -- defines document's section
Server -- describes web-space you want to index
ServerTable -- loads servers from database
ServerWeight -- defines a server's weight for calculation of popularity
Skip -- skip indexing of the matching URLs
Spell -- loads an Ispell dictionary file
SQLWordForms -- load synonyms or word forms from the database
StartHops -- 'Hops' value for start URLs.
StopwordFile -- loads stopwords file
StrictModeThreshold -- threshold to switch to a less strict search mode
StripAccents -- convert letters to their non-accented counterparts
Subnet -- Subnet
SubstringMatchMinWordLength -- defines minimal word length for substring match
Suggest -- Display misspelled search word suggestions
Synonym -- loads a synonym list from a file
SyslogFacility -- sets syslog facility
Tag -- generic grouping tag
URL -- inserts URL into database
URLDataThreshold -- improves search performance for queries returning small number of results
URLSelectCacheSize -- sets URLs cache size for indexer
UseCookie -- activates using per-session cookies during indexing
UseCRC32URLId -- enables generation CRC32 URL IDs
UseNumericOperators -- activates interpretic numeric operators in a search query
UseRemoteContentType -- specifies if the indexer should get content type from server
UserScore -- specifies a SQL query to calculate user defined score for desired documents.
UserScoreFactor -- set the effect of "UserScore" command
VarDir -- defines mnogosearch var directory
VaryLang -- defines languages for multilingual indexing
wf -- sets the default weights of different document parts
WordCacheSize -- defines maximal in-memory words cache size
WordDistanceWeight -- change word distance impact on the document score
II. mnoGoSearch C API function reference
UdmEnvInit -- Allocates or initializes a search context variable
UdmEnvFree -- Closes a search context
UdmAgentInit -- Allocates or initializes a search session variable
UdmAgentFree -- Closes a search session
UdmAgentAddLine -- Adds a configuration command
UdmFind2 -- Executes a search query
UdmResultFree -- Frees a search result
A. mnoGoSearch change history
Changes in 3.3
Changes in 3.3.7 (11 April 2008)
Changes in 3.3.6 (27 November 2007)
Changes in 3.3.5 (17 October 2007)
Changes in 3.3.4 (27 July 2007)
Changes in 3.3.3 (8 May 2007)
Changes in 3.3.2 (19 April 2007)
Changes in 3.3.1 (18 March 2007)
Changes in 3.3.0 (06 March 2007)
Index
List of Tables
3-1. Verbose levels
7-1. Supported character sets
7-2. Charsets aliases
8-1. Available search parameters
9-1. Supported environment variables
9-2. server table schema
9-3. Several server parameters values in srvinfo table
List of Examples
1. UdmEnvInit example #1
2. UdmEnvInit example #2
1. UdmEnvFree example #1
2. UdmEnvFree example #2
1. UdmAgentInit example #1
2. UdmAgentInit example #2
1. UdmAgentFree example #1
2. UdmAgentFree example #2
1. UdmAgentAddLine example
1. UdmFind2 example
2. UdmFind2 - a complete search application example
3. Makefile example
1. UdmResultFree example #1