> Main > Products and Services > Purchase > Company
English / Russian

CNSearch

The search engine for web-sites

Current Version

CNSearch 2.0.1

User Manual / CNSearch 1.5.1

4.1 Indexer

4.1.1 search.conf

All indexer settings are contained in search.conf. This file has the following structure:

[Job name_of_task]
[Index]
Parameter1	Value1
Parameter2	Value2
Parameter3	Value3
[Index]
Parameter1	Value1
Parameter2	Value2
Parameter3	Value3

Parameters and their values are set for each action, separated by spaces or tabs.

Note: It is possible to use single-line commentaries in the configuration file. Each commentary starts with symbol "#".

The following parameters can be used to optimize indexing:

4.1.1.1 Parameters Index

URL <url>

URL	url

Address starting with 'http://...' in HTTP-indexing mode, or path to the site copy on a local disk mode.

For example:

For HTTP:

URL	http://www.novgorod.ru/frisbee/

For a local disk (Windows):

URL	c:/pub/home/frisbee/

For a local disk (Unix):

URL	/pub/home/frisbee/

Extensions <ext>

Extensions ext1,ext2,ext3

The parameter sets a list of extensions of files to be indexed; it can be used in local disk mode only, and is ignored in HTTP indexing mode. Extensions are separated by "," (comma).

For example:

Extensions htm,html,shtml,shtm

Type <typ>

Type typ

The parameter sets type of the search index:

Default value - normal

For example:

Type Strict

Path <path>

Path path

The parameter defines a path to the directory containing index and log-files.

For example:

Path c:\www\site.com

or

Path /home/www/site.com

CharSet <cset>

CharSet cset

The parameter sets the method of character coding identification. The following ways are possible:

For example:

CharSet ByHTTPHeader

MaxFiles <num>

MaxFiles num

The parameter sets maximum number of files to be indexed (10000 by default). Be careful: many web-servers contain a huge number of looped links.

For example:

MaxFiles 50

MinWords <num>

MinWords num

The parameter sets minimal number of words at the indexed document. Documents with lesser number of words will not be added to the search index. This parameter allows improving quality of search results by means of filterin out little and insignificant documents. Default value is 1.

For example:

MinWords 30

Statistic <stat>

Statistic stat

The parameter sets the method of saving reports which are generated at the end of indexing process and are saved to stats.log. Available options:

For example:

Statistic Append

Exclude <excl>

Exclude excl1,excl2,excl3

The parameter sets a list of words to be excluded from the indexing. Addresses containing at least one of excluded words are not included in indexing queue as well. Words are separated by "," (comma).

For example:

Exclude editpost.php?,reply.php?,admin/

ExcludeVar <var>

ExcludeVar var1,var2,var3

The parameter sets a list of variables to be excluded from the site URL's. Variables are separated by "," (comma).

For example:

ExcludeVar PHPSESSID,order

AddOption <opt>

AddOption opt

The parameter sets indexing method and can be used in HTTP indexing mode only. The following values are available:

For example:

AddOption SubPages

StopWordsFile <file>

StopWordsFile file

The parameter sets the name of the file containing stop-words (see Stop-words).

StopWordsFile stop.txt

Language <lng>

The parameter sets the language. Provided that this parameter is specified, the field 'Accept-Language' will be included into HTTP header. This variable may affect the document contents at some sites.

For example:

Language ru

AFrom <path>

AFrom path

The parameter sets a substring which will be replaced in URL by the string specified in the parameter ATo.

For example:

AFrom  /home/dir/mysite/
ATo    http://search.codenet.ru/

ATo <url>

ATo url

The parameter sets a substring which will replace AFrom in the URL; it is used together with AFrom.

For example:

AFrom http://127.0.0.1/
ATo   http://www.codenet.ru/

or

AFrom c:/documents/www/www.codenet.ru/
ATo   http://www.codenet.ru/

StartWord <word>

StartWord word

The parameter sets a word to start the indexing from. Page description will be composed of words following the starting one. Hence, it is possible to exclude menus and the like from description.

For example:

StartWord about

Sleep <seconds>

Sleep seconds

The parameter defines timeout between the site pages indexing (sec).

Example:

Sleep 5

ShowURL <yesno>

ShowURL yesno

Display the pages addresses during indexing. Default value is "yes".

Example:

ShowURL no

ShowEmail <yesno>

ShowEmail yesno

Display found e-mail addresses (mailto during indexing. Default value is "no".

Example:

ShowEmail no

ShowFTP <yesno>

ShowFTP yesno

Display found FTP-addresses during indexing. Default value is "no".

Example:

ShowFTP no

Compress <yesno>

Compress yesno

Request the response compression from the server (in case the server supports this feature). Default value is "yes". Incorrect pages compression may lead to indexing failure.

Example:

Compress no

MetaDescription <yesno>

MetaDescription yesno

The parameter defines page description method. Description can be displayed in search results with help of the special symbol %E. Available values are "Yes" or "No". Default is 'No'. If 'Yes' is used, the system attempts to get description from '<META name="description...' tag. If tag can not be found or the value is 'No', description is composed of the first words of the document.

For example:

MetaDescription Yes

MetaRobots <yesno>

MetaRobots yesno

If the parameter has value "No", the tag 'META name="robots"...' is ignored, otherwise the tag is analyzed for presence of NOINDEX, NOFOLLOW, NONE. More details can be found in the section The use of "Robots" META-tags. Default value is "Yes"

For example:

MetaRobots No

UseRobotsTxt <yesno>

UseRobotsTxt <yesno>

If the parameter is set to "Yes", indexing algorithm is taken from the file 'robots.txt', stored in the web-server root directory. Default value is "No". More information about working with 'robots.txt' is available in the section Search robots exclusion standard. Robot's name is "CNSearch".

For example:

UseRobotsTxt yes

ConnectCount <num>

ConnectCount <num>

The parameter sets quantity of the remote file requests; default value is 5.

For example:

ConnectCount 10

4.1.1.2 Working through proxy-server

Starting with version 0.91 an option of working through proxy-server became available in the system; 4 new directives were added - ProxyServer, ProxyPort, ProxyLogin, and ProxyPassword:

ProxyServer <serv>

ProxyServer server

The parameter specifies the proxy-server address. The indexer connects to the proxy-server using ProxyPort (see later).

For example:

ProxyServer proxy.domain.ru

ProxyPort <port>

ProxyPort port

The parameter specifies the port of the proxy-server.

For example:

ProxyPort 8080

ProxyLogin <login>

ProxyLogin login

The parameter sets login for connection to the proxy-server; it is used only in case the proxy server requires authorization and works with ProxyPassword (see later).

For example:

ProxyLogin alex

ProxyPassword <password>

ProxyPassword password

The parameter sets password for connection to the proxy-server; it is used only in case the proxy server requires authorization.

For example:

ProxyPassword qwerty

4.1.2 Morphology Support

To distinguish between morphological forms during the search process one should create file 'lang.cns' and save it in the directory, where index files will be stored. This file is not included into the distribution because of its size - 16 Mb.

If file 'lang.cns' is not found, the search and indexing process will be performed without taking morphology into account.

A special utility has been developed in the system allowing building 'lang.cns' from ispell dictionaries. One may find necessary dictionaries at http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html. Ispell dictionary consists of two files - a list of words (lang.dict) and a set of word formation rules (lang.aff). These files may have some other names in downloaded archives; in this case they should be renamed to 'lang.dict' and 'lang.aff'.

Note: If you have already built the index taking into consideration morphology, you should take into consideration morphology while entering a search request and use the same dictionary

4.1.3 Stop-words

Starting with the version 1.3 one can avoid indexing frequently used words (articles, pronouns, prepositions) to increase search speed and reduce volume of information stored in the search index. These words are called 'stop-words'.

Stop-words are defined at the indexing stage with the help of the special file containing the list of stop-words. For example:

- file: stopwords.txt ---------------
a
an
is
the
this
-------------------------------------

Name of the file containing stop-words is indicated in the Indexer configuration file in the option StopWordsFile, for example:

StopWordsFile	stopwords.txt

The web-site visitors can be informed about words being ignored in their search phrase with the help of the special symbol "%P" - stop-words will be displayed as shown in the picture:

Word combination "Stop Words" may be changed for some other definition (for example, when translating to the foreign language) by changing parameter StopWords in the configuration file of the search module (see cnsearch.conf).


Back | Manual index | Next