Section

Name

Section -- defines document's section

indexer.conf search.htm

Synopsis

Section {name} {number} {maxlen} [when] [format] [cloneflag] [separator] [{expression} {replacement}]

Description

When used in search.htm, the "Section" command requires only the first three parameters and activates recognition of section name references in search queries. See the Section called Restrict searched words to a section in Chapter 8 for details. There are no any other purposes of using the "Section" command in search.htm. The rest of this article applies mostly for indexer.conf.

"string" is a section name and "number" is section ID between 0 and 255. Use 0 if you don't want to index some of these sections. It is better to use different sections IDs for different documents parts. In this case during search time you'll be able to give different weight to each part or even disallow some sections at a search time. maxlen argument contains a maximum length of section which will be stored in database.

"when" is an optional parameter defining when the section should be created. Three values are possible:

"format" is a flag telling indexer which parser to use for the section. Two values are understood:

This command is designed for use in combination with the simple type of HTDBDoc queries (i.e. consisting of a list of data columns, without full HTTP headers). The default value is "text". If your SQL table contains data in HTML format, you can specify the "html" option to force removing of HTML tags. See the Section called Indexing SQL database tables (htdb: virtual URL scheme) in Chapter 3 for details about simple HTDBDoc queries.

"cloneflag" is a flag describing whether the section should affect clone detection. It can be "DetectClone" or "cdon", or "NoDetectClone" or "cdoff". By default, url.* section values are not taken in account for clone detection, while any other sections take part in clone detection.

"separator" is a string that separates section. This is useful for attribute sections.

"expression" and "replacement" can be used to extract user defined sections.

There is a special "User.Date" section. It makes possible to use a user defined meta tag (or even any other document part) as an alternative "Last-Modified" value. A number of widespread formats is understood:


Sun, 06 Nov 1994 08:49:37 GMT
Sun, 6 Nov 1994 08:49:37 GMT
Sunday, 06-Nov-94 08:49:37 GMT
Sun Nov 6 08:49:37 1994
1994-11-06
06.11.1994

"nobody" is another section with a special meaning. When parsing HTML documents, indexer ignores the words outside the <body> and </body> tags by default. To activate indexing of these words, you can define a special section "nobody", which should have the same ID and length with the section "body". Making indexer see the words outside the body tags can be useful to index a remote site with broken HTML pages - when you can't modify the pages, or to index local HTML pages having SSI (sever side include) directives directly from disk using file:/// schema, even if the <body> and </body> tags are not in the HTML pages themselves, but in shared files included using SSI directives, like <!--#include virtual="../include/top.html"-->. For example:


Section body   1 256
Section nobody 1 256

Examples


Section body                    1       256
Section title                   2       128
Section meta.keywords           3       128
Section meta.description        4       128
Section header.server           5       64
Section url.file                6       0
Section url.path                7       0
Section url.host                8       0
Section url.proto               9       0
Section crosswords              10      0
Section Charset                 11      32
Section Content-Type            12      64
Section Content-Language        13      16
Section attribute.alt           14      128
Section attribute.label         15      128
Section attribute.summary       16      128
Section attribute.title         17      128
Section References              18      0
Section Message-ID              19      0
Section Parent-ID               20      0
Section MP3.Song                21      128
Section MP3.Album               22      128
Section MP3.Artist              23      128
Section MP3.Year                24      128
Section CachedCopy              25      64000
Section attribute.face          27      0
Section attribute.title         28      0 "."

# A user-defined section
Section h1                      29      128 "<h1>(.*)</h1>" $1

# User-defined date extracted from the "Date" meta-tag
Section User.Date               0       10 '<META NAME="Date" +CONTENT="([^"]*)">' "$1"

# Replacing Content-Type to application/msword
Section Content-Type            0       64 afterheaders cdoff "" "${URL}" "http://site/*.doc" "application/msword"

# Using "afterguesser" in conjuction with ${HTTP.LocalCharsetContent}
Section HTTP.LocalCharsetContent 0      0
Section h1lcs                   30      128 afterguesser cdoff "" "${HTTP.LocalCharsetContent}" "<h1>(.*)</h1>" $1

# Using a simple HTDBDoc query for a SQL table with text and HTML columns
Section 1 256 column1 text
Section 2 256 colimn2 html

See also

MaxDocSize, MaxWordLength, MinWordLength.