Appendix A. mnoGoSearch change history

Changes in 3.3

Changes in 3.3.7 (11 April 2008)

  • New synonym file command "Mode: return" was added. The words written on the same line in a synonym file are expanded only to the leftmost words in this mode.

  • Synonym file command "Mode: roundtrip" was added as a synonym to "Mode: reverse" to avoid ambiguity. The old version (e.g. reverse) will be removed eventually.

  • search.cgi now can work as an inetd or xinetd service. See the Section called Running search.cgi from inetd/xinetd in Chapter 2 for details.

  • -s flag now understands status range, e.g. "indexer -s200-299" will crawl documents having status in the range 200..299.

  • C-API description was added into the manual. See Reference II, mnoGoSearch C API function reference for details.

  • A possibility to debug score values was added. See the Section called Analyzing score values in Chapter 8 for details.

  • ppthtml PPT-to-HTML parser configuration instructions were added into the manual.

  • Performance improvement: when processing wild-card patterns like *.txt or *.htm (e.g. file extensions in the AddType, Allow, Disallow commands etc), comparison code now automatically switches from "wild-card comparison" to "string ending comparison" for this type of patterns.

  • Performance improvements were made in creating search index ("indexer -Eblob"), which is now about 30% faster with Firebird, 80% faster with SQLite3, 60% faster with Mimer, 30% faster with Sybase ASE.

  • Minor performance improvements were made in various pieces of the sources.

  • Search now returns at most 1000 results by default, to avoid flood attacks.

  • Fixed that user-defined sections didn't respect

    <META NAME="Robots" CONTENT="NOINDEX">
    tags.

Changes in 3.3.6 (27 November 2007)

  • The default word storage mode was changed to DBMode=blob.

  • DBMode=blob now works with SQLite3.

  • Fixed that the "flags" commands in Ispell affix files were expected to start immediately after the "new line" character. Some affix files available on the Internet have leading spaces and tabs before these commands. Previously mnoGoSearch didn't read these files correctly.

  • Bug#2023 "--disable-mysql-fulltext-plugin doesn't work" was fixed.

  • Fixed that "GroupBySite=yes" didn't work with DBMode=multi correctly.

  • Search and indexing performance improvements were made.

  • search.cgi now uses less memory in DBMode=blob, especially for huge results.

Changes in 3.3.5 (17 October 2007)

  • Fixed an XSS (cross-site scripting) security problem in the default template search.htm-dist. Passing special values of the "t" query string variable to search.cgi resulted in bad code injection near the OPTION tags of the <SELECT NAME="t"> option list in extended search form.

    This problem happened only with <SELECT NAME="t"> which is inside a HTML comment in the default template. Other SELECT lists were not affected, if you didn't put them into a HTML comment.

    To prevent this problem, search.cgi was modified to understand variable references with "HTML-encoded" output format:

    
<OPTION VALUE="val" SELECTED="$&(var)">
    
    Previously only non-encoded variable references worked in OPTION tags:
    
<OPTION VALUE="val" SELECTED="$(var)">
    
    The default template search.htm-dist was modified to use HTML-encoded output format in variable references in all OPTION tags.

    After upgrade to this release, modify the existing templates by replacing all <OPTION VALUE="val" SELECTED="$(var)"> to <OPTION VALUE="val" SELECTED="$&(var)">.

  • Thread concurrency for resolving host names and processing robot.txt files was significantly improved, which makes "indexer -Nnum" work much faster when indexing multiple sites.

  • The SubstringMatchMinWordLength search.htm command was added. Thanks to Matthias Pigulla for contribution.

  • The Skip indexer.conf command was added.

  • The CaseFolding command was added to allow alternative lower case mapping for some languages (e.g. Turkish).

  • In search queries with boolean operator ~ (NOT), e.g. "usa & ~chicago", boolean operator & (AND) is not required anymore. This syntax now works as well: "usa ~chicago". search.cgi automatically assumes & before ~.

  • Udm_Set_Agent_Param_Ex() function in PHP extension module now understands search.htm compatible commands:

                
       Udm_Set_Agent_Param_Ex($udm_agent, "Section body  1 1");
       Udm_Set_Agent_Param_Ex($udm_agent, "Section title 2 1");
    

  • The default value of the VaryLang was changed from "en" to empty.

  • Cluster now honors the ReadTimeOut command in search.htm to skip the nodes which currently are not available, e.g. because of network problems. Previously, search waited 30 seconds before returning results if one of the nodes was unavailable.

  • Performance improvements in phrase search were made.

  • search.cgi now doesn't try to find clones for a document if value of its "url.crc32" is 0.

  • Column type of "qcache.doclist" was changed from BLOB to LONGBLOB in MySQL structure, to allow storing of longer cached results.

  • Fixed that indexer crashed in some cases when running with many threads.

  • Fixed that <!INCLUDE> didn't work when the CONTENT parameter started with a variable reference, e.g.:

                
      <!SET NAME="x" CONTENT="http://hostname/">
      <!INCLUDE CONTENT="$(x)">
    

  • Bug#1903 "$(tag) doesn't work in cluster" was fixed.

  • Bug#1959 "Confusing message "Unable to find working zlib library" on missing libdmalloc" was fixed. Configure parameter "--enable-dmalloc" was changed to "--with-dmalloc", to be able to specify non-standard dmalloc location.

  • Bug#2022 "search.cgi crashes when searching for a single word with 'Dehyphenate yes' and DBMode=blob" was fixed.

  • Fixed a bug in HTTP content negotiation which made indexer after receiving a "Vary: accept-language" response header download the same URL several times again, even though indexer.conf didn't specify any languages to vary (i.e. when the VaryLang command was not set or was empty).

Changes in 3.3.4 (27 July 2007)

  • mnoGoSearch now works better for huge documents. Maximum number of words collected from each document was changed from "64K words per section" to "2048K words per section". Data format in DBMode=single was changed, users of DBMode=single have to reindex their documents from the beginning. Data format in DBMode=multi and DBMode=blob was not changed, reindexing in these modes is only necessary for huge documents (bigger than approximately 512K) - to make indexer collect more words from these documents. New limit allows to fully index documents with text size up to about 16Mb.

  • The LoadTagInfo search.htm command was added, to make tag values available in search results using $(tag).

  • The LoadURLInfo search.htm command was added, to switch off loading extra section values from the urlinfo table for performance purposes.

  • The StripAccents yes/no command was added into indexer.conf and search.htm to make accent insensitive searches possible with the databases not supporting accent insensitive collations. When StripAccent is set to yes, all accented letters are converted to their non-accented counterparts when writing or looking up the word index.

  • Content-Type "application/http" is now understood - a HTTP response with headers.

  • Content-Type "application/http" now work external parsers: if result type of a parser is "application/http", then indexer consider it is a full HTTP response and parses both headers and content.

  • PostgreSQL driver now understands the "setnames" DBAddr parameter to set client encoding. If a non-empty "setnames" parameter is given, PQsetClientEncoding() is executed immediately after establishing a connection to the server.

  • Fixed that highlighting didn't work in some cases when a search query contained two or more phrases.

Changes in 3.3.3 (8 May 2007)

  • Performance improvement: the "sorting results by score" step is now much faster on big results (0.01 second vs 1.00 second on results returning one million documents).

  • Performance improvement: searching for a single word is now about three times faster on big results.

  • Some indexes were added into SQL schema to make searches with tag and category limit faster (Feature request #772).

  • Feature request #1364 "highlight collation matches" was implemented. Now when using an accent insensitive collations (for example, latin1_general_ci with MySQL), search.cgi will take into account all word forms for excerpts and highlighting. For example, searches for French "cote" will also highlight "coté" and vice versa, if the non-exact word form generated hits.

  • MySQL driver now understands setnames parameter in DBAddr (feature request #1326).

  • MySQL driver now understands sqllogbin parameter in DBAddr (feature request #697).

  • DebugSQL parameter to DBAddr is now understood. When DebugSQL is set to yes, indexer and search.cgi print all SQL queries sent to the database. mnoGoSearch must be compiled using ./configure --with-debug ... to make this feature work.

  • MinCoordFactor and MaxCoordFactor impact is now calculated separately for each section.

  • "nwf" parameter is now understood in DBAddr string, to set its value per database.

  • "HoldBadHrefs 0" now means never delete unavailable documents from the database automatically (e.g. when remote host is down), which improves indexing speed, and which is now default behavior. Only positive HoldBadHrefs values activate automatic deletion.

  • Data type of urlinfo.sval was changed from TEXT to MEDIUMTEXT in MySQL table structure, to allow storing sections longer than 64K.

  • Bug#1733 "'indexer -Ewordstat' problem with PostgreSQL" was fixed.

  • Bug #1054 "indexer does not index html files without body tag" was fixed. A new special section with name "nobody" is now understood. If this section is configured, then indexer collects words outside the <body>...</body> tags. The default behavior is still not to index words outside these tags.

  • Bug#768 "User defined section is too short (1Kb limit)" was fixed.

  • Bug#1654 "SQLWordForms doesn't work with cluster" was fixed. Those using cluster should upgrade node.xml using the latest version of node.xml-dist.

  • Bug#1713 "Square brackets in DOCTYPE makes XML parser fail" was fixed.

  • Bug#1739 "indexer doesn't understand Content-Encoding for robots.txt" was fixed.

  • Bug#1740 "'UseRemoteContentType yes' doesn't work." was fixed.

  • Bug#1741 "'indexer.conf -Eblob -t tag' fails with 'Unknown table 's' in WHERE clause'" was fixed.

  • Fixed that indexer ignored the LogLevel command.

  • Fixed that popularity rank calculation didn't work with Interbase/Firebird. A missing column "url.shows" was added into SQL schema.

  • Fixed that phrase search didn't work in some cases (a bug since 3.3.0).

Changes in 3.3.2 (19 April 2007)

  • "ResultContentType none" is now understood to suppress printing of the "Content-Type" HTTP header by search.cgi. This is useful if you execute search.cgi from another Web application which sends HTTP headers itself.

  • ue search.cgi is now understood again to exclude documents with the given URL pattern from search results. This feature was broken in 3.2.x.

  • indexer now uses UDM_TMP_DIR and TMPDIR environment variables when creating temporary files (e.g for external parsers) instead of the default /tmp.

  • Fixed that standalone dash character was considered as a separate word with "Dehyphenate yes", so for the queries like "a - b", search.cgi incorrectly searched for three words: "a", "-", "b", which never returned results in "find all words" mode.

  • Fixed that "UseCookie yes" made indexer crash when fetching data from HTDB sources.

  • Fixed that excerpts generated from cached copy of TEXT files didn't work (bug since 3.3.0).

  • Bug#746 "Stopwords in a long boolean query" was fixed.

  • Bug#1016 "Indexer is selecting wrong Content-Type" was fixed.

  • Bug#1024 "Clear database limitations do not work: error ORA-01795" was fixed.

  • Bug#1044 "-Ewordstat: incorrect unicode sequence" was fixed.

  • Bug#1110 "'invalid UTF-8 byte sequence detected' when INSERT INTO dictXX" was fixed. This error happened when indexing into PostgreSQL with DBMode=multi. The "intag" column type was changed from TEXT to BYTEA in the tables "dict00".."dictFF".

  • Bug#1182 "Indexer crashes with -a -y 'content/type'" was fixed.

  • Bug#1427 "ORA-01785: maximum number of expressions in a list is 1000" was fixed.

  • Bug#1436 "Cannot run -Ewordstat, ORA-01400: cannot insert NULL" was fixed.

  • Bug#1615 "The identifier "PATH_MAX" is undefined" wad fixed.

  • Bug#1641 "Documentation problem" was fixed.

  • Bug#1659 "GroupBySite doesn't work in cluster mode" was fixed.

  • Bug#1679 "search.cgi dumps core on OpenBSD 4.0 when I search for non existing word" was fixed.

  • Bug#1693 "User defined sections don't work for text/plain files" was fixed.

  • Bug#1716 "Can't limit indexer to documents matching language" was fixed.

  • Bug#1725 "Navigation doesn't work when using a single cluster node" was fixed.

  • Bug#1726 "DateFormat doesn't work in cluster" was fixed.

Changes in 3.3.1 (18 March 2007)

  • Relevancy improvement: Fixed that average word distance was considered to be very big in the case when words were found in different sections (e.g. one word in "body" and one word in "title"). Word pairs from different sections are not taken into account anymore for distance calculation.

  • Relevancy improvement: Average word distance is now calculated taking into account "wf" values for the sections - the final score is now more sensitive to word distances in the sections with higher "wf" values.

  • DBMode=blob&LiveUpdates=yes is now understood in DBAddr parameter. If LiveUpdates=yes is specified, it's possible to crawl up to several thousand documents without full recreating of search index by running "indexer -Eblob". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • The "text" and "html" keywords were added into the "Section" command syntax, to apply either text or HTML parser for data returned from a "simple" HTDBDoc query. This option is useful if the source SQL table stores data in HTML format. The default value is "text". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • Column with name "last_mod_time" is now considered as modification time of the documents, returned from "simple" HTDBDoc queries.

  • A new syntax to display N rightmost characters from a template variable was added. For example, $(URL:-10). Thanks to Eggert Ehmke for the idea and the original patch.

  • Performance improvements in score calculation with non-empty "nwf" parameter were made.

  • Fixed that "simple" HTDBDoc queries didn't work with Interbase/Firebird, because the driver returned empty column names.

  • Fixed a bug which made search.cgi crash when generating a link to "cached copy" with a template having multiple DBAddr commands.

  • Fixed a bug in character set conversion, which made indexer crash in rare cases.

  • Fixed that "indexer -Cw" didn't empty the "bdict" table.

  • Fixed a bug in cluster code which made search.cgi crash on processing of a front-end template with "Suggest yes" when search didn't return any results.

Changes in 3.3.0 (06 March 2007)

  • Cluster support was added. A typical cluster consists of several database machines and a single front-end machine. The front-end machine receives HTTP requests from a user's browser, forwards search queries to the database machines using HTTP protocol, receives back a limited number of top best search results (using a simple XML format, based on OpenSearch specifications) from every database machine, then parses and merges the results, and displays them according to score and applying HTML template. This approach distributes operations with high CPU and hard disk consumption between the database machines in parallel, leaving simple merge and HTML template processing functions to the the front-end machine. As of version 3.3.0, mnoGoSearch allows to join up to 256 database machines into a single cluster.

  • node.xml-dist is now installed into /etc directory - an XML template for a cluster database machine.

  • "DBAddr http://hostname/search.cgi/node.xml" search.htm command was added, to specify an URL of a cluster database machine interface with XML format.

  • "DBAddr file:///path/to/node.xml" search.htm command was added, to specify a static XML search response. This is mostly for test purposes.

  • Two cluster types were implemented - a merge cluster to join results from several independent databases, each created by its own indexer.conf, as well as a distributed cluster - created by a single indexer.conf when indexer automatically distributes search index between database machines.

  • Changing default distribution type from "reminder" to "quotient". Thus, for indexer.conf having three DBAddr command, distribution is done as follows:

    • URLs with seed 0..85 go to the first DBAddr

    • URLs with seed 85..170 go to the second DBAddr

    • URLs with seed 171..255 go to the third DBAddr

    This distribution style simplifies manual redistribution of an existing clustered database when adding a new DBAddr (i.e. a new database machine). Future releases will provide an automatic tool for redistribution when adding and deleting machines in an existing cluster, as well as more configuration commands to control distribution.

  • Maximum amount of words collected from a document was changed from 64K words per document to 64K words per section - positions are now enumerated per section, starting from the beginning of each section separately.

  • "SaveSectionSize yes/no" indexer.conf and search.htm command was added. When SaveSectionSize is set to yes, indexer stores additional information about section sizes, making it possible to generate better score values, as well as to do "exact section match" searches. Default value is "yes".

  • Relevancy improvement: "WordDensityFactor num" search.htm command was added. Num is a number in the range 0..255 to specify impact of word frequency on the result score. This feature works with "SaveSectionSize yes". The default value is 25.

  • Exact section match syntax was added:

                
    title="Apache web server"
    
    This feature works with "SaveSectionSize yes".

  • "WordFormFactor num" search.htm command was added to give more weight to the word forms originally written in the search query and less weight to generated word forms using ispell dictionaries and synonyms. Use with a number 0..255. Default value is 255. 255 means to give the same weight to the original and generated forms. 0 means maximum effect, i.e. weight for a generated word form is much smaller than weight for the original word form.

  • Excerpt generating code performance improvements were done. Excerpt generation from CachedCopy is now about 6-12% faster.

  • Using URL and Tag limits is now possible with "indexer -Eblob", e.g.:

                
    ./indexer -Eblob -u "%subdir%"
    ./indexer -Eblob -t tag
    
    This is to generate a search index over a subset of all documents collected during crawling.

  • Using "Limit" command is also possible with "indexer -Eblob", e.g.:

    indexer.conf command:

                
    Limit subdir "SELECT rec_id FROM url WHERE url LIKE '%/subdir/%'"
    

    command line:

                
    ./indexer -Eblob --fl=subdir
    

  • "ResultContentType type" search.htm command was added to specify Content-Type header generated by search.cgi. The default value is "text/html".

  • "Dehyphenate yes/no" search.htm command was added. When "Dehyphenate yes" is specified, searching for "peace-making" also will return documents having "peacemaking". Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • Clone template variables were changed: clones are now returned in the same row with the document itself, using CloneN prefix, e.g.: $(Clone0.URL). The "<!--clone-->" search.htm section and the $(CL) variable are not supported anymore.

  • DetectClones is now "no" by default, for performance purposes.

  • "CollectLinks yes/no" indexer.conf command was added. The default value is "no" which improves indexing performance by not pupulating the "links" table. As a side effect PopRank calculation is not possible in the default configuration. If PopRank is important for your installation, specify "CollectLinks yes" in indexer.conf.

  • Default sort order was changed from "RP" (score, then popularity) to "R" (score). This change improves search performance for the installations where PopRank is not important.

  • Indexer now honors <a rel="nofollow"> tags. Thanks to Jeff Veit for contribution.

  • A simplified format of HTDBDoc command was added:

                
    HTDBDoc "SELECT title, body FROM docs WHERE id=$2"
    
    SQL column names are associated with "Section" names. Thanks to Oz Basarir and Natural Capital Institute for sponsoring this feature.

  • It's now possible to specify wf as a parameter for DBAddr search.htm command, which is useful when merging two or more databases - to give more score to results coming from a desired database.

                
    DBAddr mysql://root@localhost/db1/?wf=FFFF
    DBAddr mysql://root@localhost/db2/?wf=1111
    DBAddr mysql://root@localhost/db3/?wf=1111
    

  • MaxResults parameter was added for DBAddr, which is useful to add a limited number of sponsored links in the top of search results:

    
DBAddr mysql://root@localhost/avd/?wf=FFFF&MaxResults=1
    DBAddr mysql://root@localhost/db1/?wf=1111
    DBAddr mysql://root@localhost/db2/?wf=1111
    

  • $(DBOrder) template variable was added to display the original order of a document in its database result, before multiple DBAddr search results were merged into the final result. It is equal to $(Order) when using only a single DBAddr command in search.htm.

  • FOR template operator was added. Loop limits can be both constants:

    
  <!FOR NAME="a" FROM="10" TO="20">a=$(a)<!ENDFOR>
    
    and variables that were previously set, for example by the SET operator:
    
  <!SET NAME="from" CONTENT="80">
      <!SET NAME="to" CONTENT="90">
      <!FOR NAME="a" FROM="$(from)" TO="$(to)">a=$(a)<!ENDFOR>
    

  • "[no title]" is not added automatically anymore: an empty string is printed instead. One can use IF template operator to reproduce 3.2.x behaviour:

    
<!IF NAME="title" CONTENT="">[no title]<!ELSE>$&(title)<!ENDIF>
    

  • Various indexing and search performance improvements were made.

  • Fixed that indexer didn't work with MySQL-5.1.15-GPL.

  • "indexer -?" now prints its help page to stdout instead of stderr.

  • A "#version" record is now put into the table "bdict" when running "indexer -Eblob". mnoGoSearch version ID is put as its value. For example, mnoGoSearch 3.3.0 will put "30300" string.

  • Preliminary implementation for DBMode=rawblob in search.htm was added. This mode is designed for direct search from the table "bdicti" without having to run "indexer -Eblob" and is intended for use with small search databases as a replacement for DBMode=single. In the future releases it will also be reused for real-time index updates - to avoid running "indexer -Eblob" when only a small number of documents were changed.