Extended indexing features

News extensions

Installation

  1. Compile:

    Unpack the mnoGoSearch distribution archive. Start the configure script with the --enable-news option. Type make and make install as described in the regular install instructions

  2. Create Database.

  3. Install indexer.conf.

  4. Now you are set and can run indexer for the first time according to the instructions you can find in indexer.conf.

Indexing SQL database tables (htdb: virtual URL scheme)

mnoGoSearch can index SQL database text fields - the so called htdb: virtual URL scheme.

Using htdb:/ virtual scheme, you can build a full text index of your SQL tables as well as index your database driven WWW server.

Note: The table you want to index must have a PRIMARY KEY.

HTDB indexer.conf commands

Four indexer.conf commands provide HTDB. They are HTDBAddr, HTDBList, HTDBLimit and HTDBDoc.

HTDBAddr is used to specify a database connection. Its syntax is identical to DBAddr command. If no HTDBAddr command is specified then the data will be fetched using the same connection specified in the DBAddr command.

HTDBList is the SQL query to generate a list of all URLs which correspond to records in the table using the PRIMARY KEY field. You may use either absolute or relative URLs in the HTDBList command:

For example:


HTDBList "SELECT CONCAT('htdb:/',id) FROM messages"
    or
HTDBList "SELECT id FROM messages"

HTDBLimit may be used to specify the maximal number of records in one SELECT operation. It reduces memory usage for big tables indexing. For example:


HTDBLimit 512

HTDBDoc is a SQL query to get only certain records from the database using the PRIMARY KEY's value.

The HTDBList SQL query is used for all URLs which end with the '/' sign. For other URLs, the SQL query given in HTDBDoc is used.

The HTDBDoc SQL query must return a single row. If there is no result from HTDBDoc, or the query returns several records, HTDB retrieval system generates a "HTTP 404 Not Found" response. This can happen at reindex time if the record was deleted from your table since last reindexing. You can use HoldBadHrefs 0 to delete such records from mnoGoSearch tables as well.

Three types of HTDBDoc SQL queries are understood.

  • A single-column result with a fully formatted HTTP response, including standard HTTP response line with status. Take a look the Section called Supported HTTP response codes to learn about indexer behavior when it gets various HTTP status. A HTDBDoc SQL query can also optionally include HTTP headers understood by indexer: Content-Type, Last-Modified, Content-Encoding and other headers. So you can build a very flexible indexing system by returning different HTTP status and headers.

    Example:

    
HTDBDoc "SELECT CONCAT(\
    'HTTP/1.0 200 OK\\r\\n',\
    'Content-type: text/plain\\r\\n',\
    '\\r\\n',\
    msg) \
    FROM messages WHERE id='$1'"
    

  • A multiple-column result, with "HTTP/" substring in the beginning of the first column. All columns are concatenated using "\r\n" delimiters to generate HTTP response. The first column returning an empty string is considered to be a delimiter between the header and the content part of the HTTP response, and is replaced by the "\r\n" string. This is just a simplier way of the previous type of HTDBDoc response - without concatenation operators/functions and "\r\n" header delimiters.

    Example:

    
HTDBDoc "SELECT 'HTTP/1.0 200 OK','Content-type: text/plain','',msg \
    FROM messages WHERE id='$1'"
    

  • A single or multiple-column result without "HTTP/" header. This is the simpliest HTDBDoc response type. SQL column names returned by the query are associated with the Section names which present in indexer.conf.

    Example:

    
Section body  1 256
    Section title 2 256
    HTDBDoc "SELECT title, body FROM messages WHERE id='$1'"
    

    In this example, the values of the columns "title" and "body" will be associated with the sections "title" and "body" respectively.

    Column with names "status" and "last_mod_time" have special meaning - status, and the document modification time respectively. Status should be an integer code according to HTTP notation, and modification time should be in Unix timestamp format - the number of seconds since January, 1, 1970.

    Example:

    
HTDBDoc "SELECT title, body, \
    CASE WHEN messages.deleted THEN 404 ELSE 200 END as status,\
    timestamp as last_mod_time FROM messages WHERE id='$1'"
    

    The above example demonstrates using of the special columns. The SQL query will return status 404 (Not found) for all documents marked as deleted, which will make indexer remove these documents from search index when re-indexing the data. Also, this query make indexer use the column "timestamp" as the document modification time.

    If a column contains data in HTML format, you can specify the "html" keyword in the corresponding Section command, which will make indexer apply HTML parser to this column and thus remove all HTML tags and comments:

    Example:

    
Section title      1 256
    Section wiki_text  2 16000 html
    HTDBDoc "SELECT title, wiki_text FROM messages WHERE id='$1'"
    

HTDB variables

You may use PATH parts of URLs as parameters of both HTDBList and HTDBDoc SQL queries. All parts are to be used as $1, $2, ... $n, where number is the number of PATH part:


htdb:/part1/part2/part3/part4/part5
         $1    $2    $3    $4    $5

For example, you have the indexer.conf command:


HTDBList "SELECT id FROM catalog WHERE category='$1'"

When htdb:/cars/ URL is indexed, $1 will be replaced with 'cars':


SELECT id FROM catalog WHERE category='cars'

You may use long URLs to provide several parameters to both HTDBList and HTDBDoc queries. For example, htdb:/path1/path2/path3/path4/id with query:


HTDBList "SELECT id FROM table WHERE field1='$1' AND field2='$2' and field3='$3'"

This query will generate the following URLs:


htdb:/path1/path2/path3/path4/id1
...
htdb:/path1/path2/path3/path4/idN

for all values of the field "id" which are in HTDBList output.

Using multiple HTDB sources

You can index multiple HTDB sources by specifying several HTDBDoc, HTDBList and Server commands in the same indexer.conf.


Section body  1 256
Section title 2 256

HTDBList "SELECT id FROM t1"
HTDBDoc  "SELECT title, body FROM t1 WHERE id=$2"
Server htdb:/t1/

HTDBList "SELECT id FROM t2"
HTDBDoc  "SELECT title, body FROM t2 WHERE id=$2"
Server htdb:/t2/

HTDBList "SELECT id FROM t3"
HTDBDoc  "SELECT title, body FROM t3 WHERE id=$2"
Server htdb:/t3/

Creating a full text index

Using htdb:/ scheme you can create a full text index and use it further in your SQL application. Imagine you have a big SQL table which stores web board messages in plain text format, and you want to build an application with message search facility. Let's say messages are stored in "messages" table with two fields "id" and "msg". "id" is an integer PRIMARY KEY and "msg" is a big text field containing messages themselves. Using usual SQL LIKE search may take long time to answer:


SELECT id, message FROM messages WHERE message LIKE '%someword%'

Using mnoGoSearch htdb: scheme you have a possibility to create a full text index on "messages" table. Install mnoGoSearch in usual order. Then edit your indexer.conf:


DBAddr mysql://foo:bar@localhost/mnogosearch/?dbmode=single

HTDBAddr mysql://foofoo:barbar@localhost/database/

HTDBList "SELECT id FROM messages"

HTDBDoc "SELECT CONCAT(\
'HTTP/1.0 200 OK\\r\\n',\
'Content-type: text/plain\\r\\n',\
'\\r\\n',\
msg) \
FROM messages WHERE id='$1'"

Server htdb:/

When started, indexer will insert 'htdb:/' URL into database and will run an SQL query given in HTDBList. It will produce 1,2,3, ..., N values as a result. Those values will be considered as links relative to 'htdb:/' URL. A list of new URLs in the form htdb:/1, htdb:/2, ... , htdb:/N will be added into database. Then HTDBDoc SQL query will be executed for each new URL. HTDBDoc will produce a HTTP document for each document in the form:


HTTP/1.0 200 OK
Content-Type: text/plain

<some text from 'message' field here>

This document will be used to create a full text index using words from 'message' fields. Words will be stored in 'dict' table assuming that we are using the 'single' storage mode.

After indexing you can use mnoGoSearch tables to perform search:


SELECT url.url 
FROM url,dict 
WHERE dict.url_id=url.rec_id 
AND dict.word='someword';

As far as mnoGoSearch 'dict' table has an index on the 'word' field this query will be executed much faster than queries which use SQL LIKE searches on the 'messages' table.

You can also use several words in search:


SELECT url.url, count(*) as c 
FROM url,dict
WHERE dict.url_id=url.rec_id 
AND dict.word IN ('some','word')
GROUP BY url.url
ORDER BY c DESC;

Both queries will return 'htdb:/XXX' values in url.url field. Then your application has to cut the leading 'htdb:/' from those values to get the PRIMARY KEY values of your 'messages' table.

Indexing SQL database driven web server

You can also use htdb:/ scheme to index your database driven WWW server. It allows to create indexes without having to invoke your web server while indexing. So, it is much faster and requires less CPU resources when direct indexing from WWW server.

The main idea of indexing database driven web server is to build full text index in usual order. The only thing is that search must produce real URLs instead of URLs in 'htdb:/...' form. This can be achieved using mnoGoSearch aliasing tools.

Take a look at sample indexer.conf in doc/samples/htdb.conf It is an indexer.conf used to index our webboard.

HTDBList command generates URLs in the form:


http://search.mnogo.ru/board/message.php?id=XXX

where XXX is a "messages" table PRIMARY KEY values.

For each PRIMARY KEY value HTDBDoc command generates a text/html document with HTTP headers and content like this:


<HTML>
<HEAD>
<TITLE> ... subject field here .... </TITLE>
<META NAME="Description" Content=" ... author here ...">
</HEAD>
<BODY> ... message text here ... </BODY>

At the end of doc/samples/htdb.conf we wrote three commands:


Server htdb:/
Realm  http://search.mnogo.ru/board/message.php?id=*
Alias  http://search.mnogo.ru/board/message.php?id=  htdb:/

The first command tells the indexer to execute the HTDBList query, which will generate a list of messages in the form:


http://search.mnogo.ru/board/message.php?id=XXX

The second command allows the indexer to accept such message URLs using string match with '*' wildcard at the end.

The third command replaces the "http://search.mnogo.ru/board/message.php?id=" substring in the URL with "htdb:/" when the indexer retrieves documents with messages. It means that "http://mysearch.udm.net/board/message.php?id=xxx" URLs will be shown in search result, but "htdb:/xxx" URLs will be indexed instead, where xxx is the PRIMARY KEY value, the ID of record in "messages" table.

Indexing binaries output (exec: and cgi: virtual URL schemes)

mnoGoSearch supports exec: and cgi: virtual URL schemes. They allow running an external program. This program must return a result to sdtout. The result must be in HTTP standard, i.e. HTTP response header followed by document's content.

For example, when indexing both cgi:/usr/local/bin/myprog and exec:/usr/local/bin/myprog, the indexer will execute the /usr/local/bin/myprog program.

Passing parameters to cgi: virtual scheme

When executing a program given in cgi: virtual scheme, the indexer emulates the fact that this program is running under a HTTP server. It creates REQUEST_METHOD environment variable with "GET" value and QUERY_STRING variable according to HTTP standards. For example, if cgi:/usr/local/apache/cgi-bin/test-cgi?a=b&d=e is being indexed, the indexer creates QUERY_STRING with a=b&d=e value. cgi: virtual URL scheme allows indexing your site without having to invoke web servers even if you want to index CGI scripts. For example, you have a web site with static documents under /usr/local/apache/htdocs/ and with CGI scripts under /usr/local/apache/cgi-bin/. Use the following configuration:


Server http://localhost/
Alias  http://localhost/cgi-bin/	cgi:/usr/local/apache/cgi-bin/
Alias  http://localhost/		file:/usr/local/apache/htdocs/

Passing parameters to exec: virtual scheme

The indexer does not create a QUERY_STRING variable like in cgi: scheme. It creates a command line with the arguments given in URL after ? sign. For example, when indexing exec:/usr/local/bin/myprog?a=b&d=e, this command will be executed:


/usr/local/bin/myprog "a=b&d=e" 

Using exec: virtual scheme as an external retrieval system

exec: virtual scheme can be used as an external retrieval system. It allows using protocols which are not supported natively by mnoGoSearch. For example, you can use curl program which is available from http://curl.haxx.se/ to index HTTPS sites.

Put this short script to /usr/local/mnogosearch/bin/ under curl.sh name.


#!/bin/sh
/usr/local/bin/curl -i $1 2>/dev/null

This script takes an URL given as command line argument and executes curl program to download it. -i argument says curl to output result together with HTTP headers.

Now use these commands in your indexer.conf:


Server https://some.https.site/
Alias  https://  exec:/usr/local/mnogosearch/etc/curl.sh?https://

When indexing https://some.https.site/path/to/page.html, the indexer will translate this URL to


exec:/usr/local/mnogosearch/etc/curl.sh?https://some.https.site/path/to/page.html

then execute the curl.sh script:


/usr/local/mnogosearch/etc/curl.sh "https://some.https.site/path/to/page.html"

and take it's output.

Mirroring

You may specify a path to the root directory to enable sites mirroring


MirrorRoot /path/to/mirror
	

You may as well specify the root directory of mirrored document's headers. The indexer will store HTTP headers to local disk too.


MirrorHeadersRoot /path/to/headers
	

You may specify the period during which earlier mirrored files will be used while indexing instead of really downloading.


MirrorPeriod <time>
	

It is very useful when you do some experiments with mnoGoSearch, indexing the same hosts and not wanting much traffic from/to Internet. If MirrorHeadersRoot is not specified and headers are not stored to local disk, then the default Content-Type's given in the AddType commands will be used. Default value of the MirrorPeriod is -1, which means do not use mirrored files.

<time> is in the form xxxA[yyyB[zzzC]] (Spaces are allowed between xxx and A and yyy and so on) where xxx, yyy, zzz are numbers (can be negative!). A, B, C can be one of the following:


		s - second
		M - minute
		h - hour
		d - day
		m - month
		y - year

(these letters are the same as in strptime/strftime functions)

Examples:


15s - 15 seconds
4h30M - 4 hours and 30 minutes
1y6m-15d - 1 year and six month minus 15 days
1h-10M+1s - 1 hour minus 10 minutes plus 1 second

If you specify only numbers without any character, it is assumed that time is given in seconds (this behavior is for compatibility with versions prior to 3.1.7).

The following command will force using local copies for one day:


MirrorPeriod 1d

If your pages are already indexed, when you re-index with -a, the indexer will check the headers and only download those files that have been modified since the last indexing. Thus, all pages that are not modified will not be downloaded and therefore not mirrored either. To create the mirror you need to either (a) start again with a clean database or (b) use the -m switch.

You can actually use the created files as a full featured mirror to you site. However be careful: indexer will not download a document that is larger than MaxDocSize. If a document is larger it will be only partially downloaded. If you site has no large documents, everything will be fine.