Chapter 8. Searching documents

Table of Contents
Using search front-ends
How to write search result templates
Designing search.html
Relevancy
Search query tracking
Search results cache
Fuzzy search

Using search front-ends

Performing search

Open your preferred front-end in you web browser:


http://your.web.server/path/to/search.cgi
or
http://your.web.server/path/to/search.php3
or
http://your.web.server/path/to/search.pl

To find something just type words you want to find and press the SUBMIT button. For example, "mysql odbc". You should not use any quotes " in your query, they are written here only to separate a query from other text. mnoGoSearch will find all documents that contain the word "mysql" and/or the word "odbc". Best documents having bigger weights will be displayed first.

Search parameters

mnoGoSearch front-ends support the following parameters given in CGI query string. You may use them in the HTML form on the search page.

Table 8-1. Available search parameters

qtext parameter with search query
scharacters sequence, specifying results sorting order. Small letters mean ascending order, capital letter mean descending order. Following characters can be used: R or r - for sorting by score, P or p - for sorting by PopularityRank, D or d - for sorting by last modified date. U or u - for sorting by URL. S or s - for sorting by section (see "su" parameter). The default value is R (by score).
susection name to sort results. This parameter must be used with "s=S|s".
sl.*section limits. You can limit searches with defined section value. E.g.: sl.title=Top
flLoads a fast limit with the given name pattern. The limits should be previously created using the Limit commands. If "fl" value starts with minus character, then the limit is considered as excluding limit. For example, fl=-name restricts search to those documents not covered by the limit "name". SQL LIKE operator is used when loading fast limits during search time, so % and _ wildcards can be used in "fl" pattern. If the pattern matches to more than one limit, then search is restricted to those documents covered by either of them. If excluding limit pattern matches to several limits, then search is restricted to those documents covered by non of the them.
pspage size, number of search results displayed on one page, 10 by default.
nppage number, 0 by default (first page)
msearch mode. "all" and "any" values are supported. Default value is "all".
wmword match. You can use this parameter to choose word match type. There are "wrd", "beg", "end" and "sub" values which respectively mean whole word, word beginning, word ending and word substring match, with whole word match being default. Mininum word length for substring match is controlled by the SubstringMatchMinWordLength command in search.htm.
ttag limit. Limits search through documents with given tag only. This parameter has the same effect with -t indexer option
catCategory limit. Take a look into the Section called Categories in Chapter 6 for details.
ul

Limiting search results by an URL pattern. If the ul value represents a relative URL, then search.cgi automatically adds % wildcards before and after the ul value. For example:


<OPTION VALUE="/manual/">
will add (url LIKE '%/manual/%') condition into SQL query. If the ul value is an absolute URL with schema, then search.cgi will add % sign only after the value. For example for:

<OPTION VALUE="http://localhost/">
search.cgi will add (url LIKE 'http://localhost/%') condition.

Note: Using absolute URLs is more efficient - as it can use SQL indexes for optimization.

Additionally to the automatically added wildcards, can use your own % and _ wildcards in the pattern. For example:


<OPTION VALUE="http://localhost/%/archive/">

Multiple ul values can be given in the query string, which allows to use SELECT MULTIPLE input type in HTML search form. Multiple values are joined using OR operator. For example, if users selects both options from this list:


<SELECT NAME="ul" MULTIPLE>
<OPTION VALUE="/dir1/">Dir1</OPTION>
<OPTION VALUE="/dir2/">Dir2</OPTION>
</SELECT>
search.cgi will add (url LIKE '%/dir1/%' OR url LIKE '%/dir2/%') condition into search query.

ue

Limiting search results by excluding documents with the given URL pattern.

The ue parameter detects absolute and relative URL patterns and automatically adds wildcards, and supports your own wildcards, similarly to the ul parameter.

Multiple ue parameters are understood - to exclude several URL patterns at the same time. Multiple parameters are joined using AND SQL operator. For example, if user selects both options from this list:


<SELECT NAME="ue" MULTIPLE>
<OPTION VALUE="/dir1/">Dir1</OPTION>
<OPTION VALUE="/dir2/">Dir2</OPTION>
</SELECT>
search.cgi will add (url NOT LIKE '%/dir1/%' AND url NOT LIKE '%/dire2/%') condition into search query.

Note: ul and ue parameters can be given at the same time.

wf Weight factor vector. It allows changing different document sections weights at search time. Should be passed in the form of hex number. Check the explanation below.
nwf "No section" weight factor vector. See the explanation below.
gLanguage limit. Language abbreviation to limit search results by url.lang field.
tmpltTemplate filename (without path). To specify template file other standard search.htm.
typeContent-Type limit. Content-type to limit search results by url.content_type field. For cache mode storage this should be exact match. For SQL-modes it may be SQL-like pattern.
spWords forms limit. =1, if you need search all forms for entered words. =0, if you need search only entered words. Default value is 1.
sySynonyms limit. =1, if you need add synonyms for entered words. =0, do not use synonyms. Default value is 1.
tlTransliteration. =1, or =yes, if you want transliteration. =0, or =no, if you do not want transliteration. Default value is 0.
dtLimit by time. Three types are supported.

If dt is set to back, that means you want to limit result to recent pages, and you should specify this recentness in variable dp.

If dt is set to er, that means the search will be limited to pages newer or older than the date given. Variable dx is newer/older flag (1 means newer or after, -1 means older or before). Date is specified in variables dy, dm, dd.

If dt is set to range, that means search withing given range of dates. Variables db and de are used here and stand for beginning and end date.

All times in cached mode are measured with a one hour precision.
dpLimit by recentness, if dt value is back. It should be specified in xxxA[yyyB[zzzC]] format. Spaces are allowed between xxx and A and yyy and so on). xxx, yyy, zzz are numbers (can be negative!), A, B, C can be one of the following (the letters are the same as in strptime/strftime functions): s - second, M - minute, h - hour, d - day, m - month, y - year. Examples:

  4h30m 	  - 4 hours and 30 minutes
  1Y6M-15d        - 1 year and six month minus 15 days
  1h-60m+1s       - 1 hour minus 60 minutes plus 1 second
dxis newer/older flag (1 means newer or after, -1 means older or before), if dt value is er.
dmMonth, if dt value is er. 0 - January, 1 - February, ... 11 - December.
dyYear, if dt value is er. Four digits. For example, 1999 or 2001.
ddDay, if dt value is er. 1...31.
dbBeginning date, if dt value is range. Each date is a string of the form dd/mm/yyyy, where dd is the day, mm is the month and yyyy is a four-digits year.
deEnd date, if dt value is range. Each date is a string of the form dd/mm/yyyy, where dd is the day, mm is the month and yyyy is a four-digits year.
usSpecifies the name of the user defined score list which should be mixed with the scores internally calculated by mnoGoSearch, according to UserScore and UserScoreFactor configuration. if us value is empty, or there's no a "UserScore" command with this name found, this parameter is ignored.
GroupBySiteEnables or disables grouping results by site. Can be set to yes or no, with the default value no. This parameter has the same effect with the GroupBySite search.htm command.

Changing different document parts weights at search time

It is possible to pass the "wf" HTML form variable to search.cgi. "wf" variable represents weight factors for specific document parts. Currently body, title, keywords, description, url parts, crosswords as well as user defined META and HTTP headers are supported. Take a look in the "Section" part of indexer.conf-dist.

To be able to use this feature, it is recommended to set different section IDs for different document parts in the "Section" command of indexer.conf. Currently up to 256 different sections are supported.

Imagine that we have these default sections in indexer.conf:


  Section body        1  256
  Section title       2  128
  Section keywords    3  128
  Section description 4  128

"wf" value is a string of hex digits ABCD. Each digit is a factor for corresponding section's weight. The most right digit corresponds to section 1. For the given above sections configuration:


      D is a factor for section 1 (body)
      C is a factor for section 2 (title)
      B is a factor for section 3 (keywords)
      A is a factor for section 4 (description)

Examples:


   wf=0001 will search through body only.

   wf=1110 will search through title,keywords,description but not 
through the body.

   wf=F421 will search through:
          Description with factor 15  (F hex)
          Keywords with factor 4
          Title with factor  2
          Body with factor 1

It is also possible to set the default "wf" value using the wf search.htm command. If "wf" is omitted in the query and the wf command is not specified in search.htm, all sections factors are 1, which means that all sections have the same weight.

Since version 3.3.0, it is also possible to specify "wf" value as a DBAddr search.htm command parameter.

The "nwf" search parameter uses the same format with "wf". If all found words appear in a only one section, then resulting score becomes lower. It can be used for example to ignore spam in KEYWORDS meta tag. I.e. if you use high "wf" and "nwf" values for the section corresponding to KEYWORDS, then score will high only if a word appeared in KEYWORDS and also in title/section, but not only in KEYWORDS. Since version 3.3.3, "nwf" can also be set as a DBAddr search.htm command parameter.

Using front-end with an shtml page

When using a dynamic shtml page containing SSI that calls search.cgi, i.e. search.cgi is not called directly as a CGI program. It is necessary to override Apache's SCRIPT_NAME environment attribute so that all the links on search pages lead to the dynamic page and not to search.cgi.

For example, when a shtml page contains a line <--#include virtual="search.cgi">, SCRIPT_NAME variable will still point to search.cgi, but not to the shtml page.

To override the SCRIPT_NAME variable, we implemented a UDMSEARCH_SELF variable that you may add to Apache's httpd.conf file. Thus search.cgi will check UDMSEARCH_SELF variable first and then SCRIPT_NAME. Here is an example of using UDMSEARCH_SELF environment variable with SetEnv/PassEnv Apache's httpd.conf command:


SetEnv UDMSEARCH_SELF /path/to/search.cgi
PassEnv UDMSEARCH_SELF

Using several templates

It is often required to use several templates with the same search.cgi. There are several ways to do it. They are given here in the order search.cgi detects the template name.

  1. search.cgi checks the environment variable UDMSEARCH_TEMPLATE. So you can put a path to the desired search template in UDMSEARCH_TEMPLATE.

  2. search.cgi also supports Apache's internal redirect. It checks REDIRECT_STATUS and REDIRECT_URL environment variables. To activate this way of template usage you may add these lines in Apache srm.conf:

    
AddType text/html .zhtml
    AddHandler zhtml .zhtml
    Action zhtml /cgi-bin/search.cgi
    

    Put search.cgi into your /cgi-bin/ directory. Then put the HTML template into your site's directory structure under any name with .zthml extension, for example template.zhtml. Now you may open the search page: http://www.site.com/path/to/template.zhtml You may use any unused extension instead of .zthml, of course.

  3. search.cgi also checks URL part after "search.cgi", available in the PATH_INFO environment variable. I.e. if you point your browser to http://site/search.cgi/search1.html, it uses search1.htm as its template, if you point to http://site/search.cgi/search2.html uses search2.htm, and so on.

  4. If the above three ways didn't work, search.cgi opens a template which has the same name than the script being executed using SCRIPT_NAME environment variable. search.cgi will open a template ETC/search.htm, search1.cgi will open ETC/search1.htm and so on, where ETC is mnoGoSearch's /etc directory (usually /usr/local/mnoGoSearch/etc). So, you can use the same search.cgi with different templates without having to recompile it. Just create one or several hard or symbolic links for search.cgi or copy it and put the corresponding search templates into /etc directory of the mnoGoSearch installation.

    Take a look also into Making multi-language search pages section

Advanced boolean search

If you want more advanced results, you can use boolean query language.

mnoGoSearch understands the following boolean operators:

& - logical AND. For example, "mysql & odbc". mnoGoSearch will find any URLs that contain both "mysql" and "odbc". You can also use + for this operator.

| - logical OR. For example "mysql|odbc". mnoGoSearch will find any URLs that contain the word "mysql" or the word "odbc".

~ - logical NOT. For example "mysql & ~odbc". mnoGoSearch will find URLs that contain the word "mysql" and do not contain the word "odbc" at the same time. Note that ~ just excludes the given word from the results. The query "~mysql & ~odbc" will find nothing!

() - group command to compose more complex queries. For example "(mysql | msql) & ~postgres". Query language is simple and powerful at the same time. Just consider the query as a usual boolean expression.

Note: Boolean operators work only in queries having two or more words. search.cgi ignores boolean operators in queries consisting of a single word. Thus, the query "~odbc" will just search for the word "odbc", without treating the "~" sign as NOT operator.

Restrict searched words to a section

Since 3.2.39, search query syntax undestands section name specifiers. For example, "title:web body:server" will find those documents having the word "web" in their titles and at the same time the word "server" in their bodies. To make search recognize section names, one needs to copy all Section commands from indexer.conf into search.htm.

Note: Section name references can be combined with boolean operators

Phrase search

Phrase search is activated by using quote characters around the words. For example, the query `"search engine"' will return only those documents having the word "search" immediately followed by the word "engine", while the query `search engine' (i.e. without surrounding quotes) will not require both words to be close to each other.

Note: It is possible to combine two or more phrases in the same query, as well as combine phrases with boolean operators.

Since 3.2.39, automatic phrase search is forced for complex words having dots, dashes, underscores, commas and slashes (-_.,/) as delimiters between word parts. For example, `max_allowed_packet' now automatically searches for phrase `"max allowed packet"', not just for three separate words.

Exact section match

Since 3.3.0, exact section match syntax is available. An exact section match query consists of a section specifier (as described in the Section called Restrict searched words to a section ), followed by the EQUAL signs, and followed by a phrase in quotes. For example, the search query `title="search engine"' will return only those documents having title equal to two words "search engine".

Exact section match is possible only with SaveSectionSize set to yes.

How search handles expired documents

Expired documents are still searchable with their old content.