Chapter 3. Indexing

Table of Contents
Indexing in general
Supported HTTP response codes
Content-Encoding support
indexer configuration
Extended indexing features
Using syslog
Disabling Apache logging
Storing cached copies

Indexing in general

Configuration

First, you should configure mnoGoSearch. Indexer configuration is covered mostly by the indexer.conf-dist file. You can find it in the etc directory of mnoGoSearch distribution. You may take a look at other *.conf samples in the doc/samples directory.

To set up indexer.conf file, cd to mnoGoSearch installation /etc directory, copy indexer.conf-dist to indexer.conf and edit it.

To configure search front-ends (search.cgi and/or search.php3, or other), you should copy search.htm-dist file in /etc directory of mnoGoSearch installation to search.htm and edit it. See the Section called How to write search result templates in Chapter 8 for detailed description.

Running indexer

Just run indexer once a week (a day, an hour...) to find the latest modifications in your web sites. You may also insert indexer into your crontab job.

SQL back-end notes

By default, indexer being called without any command line arguments reindex only expired documents. You can change expiration period with the indexer.conf Period command. If you want to reindex all documents irrelevant if those are expired or not, use -a option. indexer will mark all documents as expired at startup.

Retrieving documents, indexer sends the If-Modified-Since HTTP header for documents that are already stored in the database. When indexer gets next document it calculates document's checksum. If checksum is the same as the old checksum stored in the database, it will not parse the document again. indexer -m command line option prevents indexer from sending the If-Modified-Since headers and make it parse documents even if the checksum is the same. It is useful, for example, when you have changed your Allow/Disallow rules in indexer.conf and new pages that were disallowed earlier need to be added.

If mnoGoSearch retrieves an URL with redirect HTTP 301,302,303 status it will index URL given in the Location: field of HTTP-header instead.

How to create SQL table structure

To create SQL tables required for mnoGoSearch functionality, use indexer -Ecreate. Executed with this argument, indexer looks up a file containing SQL statements necessary for creating all SQL tables for the database type and storage mode given in the indexer.conf DBAddr command. Files are looking up at /share directory of mnoGoSearch installation, which is usually /usr/local/mnogosearch/share/mnogosearch/.

How to drop SQL table structure

To drop all SQL tables created by mnoGoSearch, use indexer -Edrop. A file with SQL statements required to drop tables is looked up in the /share directory of mnoGoSearch installation.

Subsection control

indexer has -t, -u, -s options to limit action to only a part of the database. -t corresponds to 'Tag' limitation, -u is a URL substring limitation (SQL LIKE wildcards). -s limits URLs with given HTTP status. All limit options are OR-ed if in the same group and AND-ed if in different groups.

How to clear database

To clear the whole database, use 'indexer -C'. You may also delete only a part of the database by using -t,-u,-s subsection control options.

Database Statistics

If you run indexer -S, the command will show database statistics, including count of total and expired documents of each status. -t, -u, -s filters are usable in this mode too.

The meaning of status is:

  • 0 - new (not indexed yet) URL

If status is not 0, then it's a HTTP response code, some of the HTTP codes are:

  • 200 - "OK" (url is successfully indexed)

  • 301 - "Moved Permanently" (redirect to another URL)

  • 302 - "Moved Temporarily" (redirect to another URL)

  • 303 - "See Other" (redirect to another URL)

  • 304 - "Not modified" (url has not been modified since last indexing)

  • 401 - "Authorization required" (use login/password for given URL)

  • 403 - "Forbidden" (you have no access to this URL(s))

  • 404 - "Not found" (there were references to URLs that did not exist)

  • 500 - "Internal Server Error" (error in cgi, etc)

  • 503 - "Service Unavailable" (host is down, connection timed out)

  • 504 - "Gateway Timeout" (read timeout when retrieving document)

HTTP 401 means that this URL is password protected. You can use the AuthBasic command in indexer.conf to set a login:password for this URL(s).

HTTP 404 means that you have an incorrect reference in one of your document (reference to resource that does not exist).

Take a look at HTTP specific documentation for further explanation of the different HTTP status codes.

Link validation

Being started with -I command line argument, indexer displays the URL and its referrer pairs. It is very useful to find bad links on your site. Don't use the HoldBadHrefs 0 command in indexer.conf in this mode. You may use subsection control options -t,-u,-s in this mode. For example, indexer -I -s 404 will display all 'Not found' URLs with referrers, where links to those bad documents are found. Setting relevant indexer.conf commands and command line options, you may use mnoGoSearch specially for site validation purposes.

Parallel indexing

MySQL and PostgreSQL users may run several indexers simultaneously with the same indexer.conf file. We have successfully tested 30 simultaneous indexers with a MySQL database. Indexer uses MySQL and PostgreSQL locking mechanism to avoid double indexing of the same URL by different indexer's copies. Parallel indexing in the same database is not implemented for other back-ends yet. You may use a multi-threaded version of indexer with any SQL back-end which does support several simultaneous connections. Multi-threaded indexer version uses its own locking mechanism.

It is not recommended to use the same database with different indexer.conf files! The first process could add something that the second could delete, and it may never stop.

On the other hand, you may run several indexer processes with different databases with ANY supported SQL back-end.