User Manual / CNSearch 1.5.1
All indexer settings are contained in search.conf. This file has the following structure:
[Job name_of_task] [Index] Parameter1 Value1 Parameter2 Value2 Parameter3 Value3 [Index] Parameter1 Value1 Parameter2 Value2 Parameter3 Value3
Parameters and their values are set for each action, separated by spaces or tabs.
Note: It is possible to use single-line commentaries in the configuration file. Each commentary starts with symbol "#".
184.108.40.206 Parameters Index
Address starting with 'http://...' in HTTP-indexing mode, or path to the site copy on a local disk mode.
For a local disk (Windows):
For a local disk (Unix):
The parameter sets a list of extensions of files to be indexed; it can be used in local disk mode only, and is ignored in HTTP indexing mode. Extensions are separated by "," (comma).
The parameter sets type of the search index:
- Abridged - an index file of a smaller size, which does not allow showing part of the text containing highlighted search words. (See Search module)
Default value - normal
The parameter defines a path to the directory containing index and log-files.
The parameter sets the method of character coding identification. The following ways are possible:
- ByMetaTag - identifies character set by means of META tag (default);
- ByHTTPHeader - identifies character set by HTTP header; in case the identification cannot be carried out by HTTP header, the system attempts to define it with the help of META tag. If both variants fail, the system assumes that a document has Windows-1251 character set;
- win-1251 - does not identify character set: win-1251 is default.
- koi8-r - does not identify character set: koi8-r is default.
The parameter sets maximum number of files to be indexed (10000 by default). Be careful: many web-servers contain a huge number of looped links.
The parameter sets minimal number of words at the indexed document. Documents with lesser number of words will not be added to the search index. This parameter allows improving quality of search results by means of filterin out little and insignificant documents. Default value is 1.
The parameter sets the method of saving reports which are generated at the end of indexing process and are saved to stats.log. Available options:
- No - do not save report.
- Append - append to existing file (by default).
- Overwrite - replace existing file.
The parameter sets a list of words to be excluded from the indexing. Addresses containing at least one of excluded words are not included in indexing queue as well. Words are separated by "," (comma).
The parameter sets a list of variables to be excluded from the site URL's. Variables are separated by "," (comma).
The parameter sets indexing method and can be used in HTTP indexing mode only. The following values are available:
- Page - only current page is indexed;
- SubPages - all pages, which contain address of the starting page in their URL;
- Server - the whole server is indexed.
The parameter sets the name of the file containing stop-words (see Stop-words).
The parameter sets the language. Provided that this parameter is specified, the field 'Accept-Language' will be included into HTTP header. This variable may affect the document contents at some sites.
The parameter sets a substring which will be replaced in URL by the string specified in the parameter ATo.
AFrom /home/dir/mysite/ ATo http://search.codenet.ru/
The parameter sets a substring which will replace AFrom in the URL; it is used together with AFrom.
AFrom http://127.0.0.1/ ATo http://www.codenet.ru/
AFrom c:/documents/www/www.codenet.ru/ ATo http://www.codenet.ru/
The parameter sets a word to start the indexing from. Page description will be composed of words following the starting one. Hence, it is possible to exclude menus and the like from description.
The parameter defines timeout between the site pages indexing (sec).
Display the pages addresses during indexing. Default value is "yes".
Display found e-mail addresses (mailto during indexing. Default value is "no".
Display found FTP-addresses during indexing. Default value is "no".
Request the response compression from the server (in case the server supports this feature). Default value is "yes". Incorrect pages compression may lead to indexing failure.
The parameter defines page description method. Description can be displayed in search results with help of the special symbol %E. Available values are "Yes" or "No". Default is 'No'. If 'Yes' is used, the system attempts to get description from '<META name="description...' tag. If tag can not be found or the value is 'No', description is composed of the first words of the document.
If the parameter has value "No", the tag 'META name="robots"...' is ignored, otherwise the tag is analyzed for presence of NOINDEX, NOFOLLOW, NONE. More details can be found in the section The use of "Robots" META-tags. Default value is "Yes"
If the parameter is set to "Yes", indexing algorithm is taken from the file 'robots.txt', stored in the web-server root directory. Default value is "No". More information about working with 'robots.txt' is available in the section Search robots exclusion standard. Robot's name is "CNSearch".
The parameter sets quantity of the remote file requests; default value is 5.
220.127.116.11 Working through proxy-server
The parameter specifies the proxy-server address. The indexer connects to the proxy-server using ProxyPort (see later).
The parameter specifies the port of the proxy-server.
The parameter sets login for connection to the proxy-server; it is used only in case the proxy server requires authorization and works with ProxyPassword (see later).
The parameter sets password for connection to the proxy-server; it is used only in case the proxy server requires authorization.
4.1.2 Morphology Support
To distinguish between morphological forms during the search process one should create file 'lang.cns' and save it in the directory, where index files will be stored. This file is not included into the distribution because of its size - 16 Mb.
If file 'lang.cns' is not found, the search and indexing process will be performed without taking morphology into account.
A special utility has been developed in the system allowing building 'lang.cns' from ispell dictionaries. One may find necessary dictionaries at http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html. Ispell dictionary consists of two files - a list of words (lang.dict) and a set of word formation rules (lang.aff). These files may have some other names in downloaded archives; in this case they should be renamed to 'lang.dict' and 'lang.aff'.
Starting with the version 1.3 one can avoid indexing frequently used words (articles, pronouns, prepositions) to increase search speed and reduce volume of information stored in the search index. These words are called 'stop-words'.
Stop-words are defined at the indexing stage with the help of the special file containing the list of stop-words. For example:
- file: stopwords.txt --------------- a an is the this -------------------------------------
Name of the file containing stop-words is indicated in the Indexer configuration file in the option StopWordsFile, for example:
The web-site visitors can be informed about words being ignored in their search phrase with the help of the special symbol "%P" - stop-words will be displayed as shown in the picture:
Word combination "Stop Words" may be changed for some other definition (for example, when translating to the foreign language) by changing parameter StopWords in the configuration file of the search module (see cnsearch.conf).
Back | Manual index | Next