User Manual / CNSearch 1.5.1
5 Additional Options
5.1 Search Optimization
It is possible to optimize two parameters:
- Search speed should be optimized in case search process is performed often or search results ate displayed slowly;
- The search index size should be optimized in case the host provider sets limit to the disk space.
5.1.1 Search Speed Optimization
To optimize search process one should employ the following options:
- Use defragmented index (see Defragmentation);
- Use Stop-words option (see Stop-words);
- Use search logics "And" (see Configuration settings) - number of disk requests is reduced (only provided that defragmented index is used);
- Disable morphology support - search speed increases without lang.cns dictionary.
5.1.2 Index Size Optimization
To optimize index size one should employ the following options:
- Use abridged index - cancel "fulltxt.cns" generation; this will lead to index size reduction for 1.5 - 2 times. The option can be executed with the help of Type parameter in the indexer configuration file (see search.conf);
- Disable morphology support;
- Use Stop-words option.
5.2 Statistics
CNSearch performs full-text search and provides statistical reports for analysis of the site contents and the popularity of its sections.
Statistical data is stored to 'stats.cns' file, which must be located in the same directory with search index files. If cgi-script performing a search cannot access this file or does not have permission to create it (typical), then statistical information will not be saved.
You may create 'stats.cns' file manually and set access rights for it.
Access to statistics data is password-protected. The password is set in the search module configuration file with the help of Stats parameter (see Configuration settings), for example:
-- cnsearch.conf ----------------------------- ::CONFIG stats = thisispass ::HTMLTOP <HTML> <HEAD> <TITLE>Search results - %Q</TITLE> </HEAD> ... -- end of cnsearch.conf ----------------------
To view statistical data one should define 'stats' parameter in the site URL, for example:
http://www.site.com/cgi-bin/search.cgi?stats=1&password=thisispass
or:
http://www.site.com/cgi-bin/search.exe?stats=1&password=thisispass
At the present moment two reports are available in the system:
- Search Requests
This reports shows search phrases most frequently used by the site visitors and number of found results. The report helps to analyze most frequent search objects. One may view statistics for any period of time:
- Time Distribution of Search Requests
This report allows analyzing distribution of search requests per days; it provides statistics for any period. For example:
In case you need some other reports, please contact us; they will be included into upcoming versions.
5.3 Plugins
Plugins present itself special modules which allow optimizing the program functionality. CNSearch uses plugins to index files of different types.
Plugins should be stored in the indexer directory. UNIX and Linux versions have an extension .so, Windows versions - .dll. To disable a plugin one should just move it to other directory.
Three plugins are included in the current distribution. They allow indexing files of the following types:
| File name in UNIX/Linux version | File name in Windows version | Type of the processed document |
|---|---|---|
| libtxt.so | libtxt.dll | *.TXT - test files |
| librtf.so | librtf.dll | *.RTF - Rich Text Format files |
| libdoc.so | libdoc.dll | *.DOC - Microsoft Word files |
| libxls.so | libxls.dll | *.XLS - Microsoft Excel files |
| libmp3.so | libmp3.dll | *.MP3 - MPEG Layer 3 audio-files |
Plugins of version 0.92 do not define character sets, because it is not necessary for most files.
The field 'encoding' in documents processed by plugins is replaced by the text set in plugin, which allows composing templates where type of the found document is displayed.
During start-up the indexer launches all active plugins, for example:
F:\1\bin\indexer>searchctl.exe localhost CNSearch ver.0.92 [build 2073] Compiled 07.04.2002 under MS Windows 2000 [Version 5.00.2195] Rebuilding URL list...Ok. Loading library: RTF (Rich text format) Loading library: TXT (Plain text) Loading library: DOC (Microsoft Word document format) http://www.test.ru/
The main advantage of plugins is possibility to develop new plugins to index files of some specific formats (for example, images and so on).
5.3.1 Plugin Creation
To create a plugin one should unpack 'plugin.zip', located in '/manual' folder of the distribution kit. This file contains the source code of a plugin, processing test files.
For correct function plugin must have a correct extension and possess the following set of functions:
| Name of a function | Description of a function |
|---|---|
| char *get_info(void) | The function returns a string - information about plugin (its name) |
| char *get_mime(void) | The function returns a string - list of MIME TYPEs, separated by vertical line "|" and processed by this plugin; |
| char* get_shortdesc(void) | The function returns a string - short name of a file type |
| char* get_range(void) | The function returns a string - field "Range" of HTTP header (see RFC2068); if field "Range" is not used, the function returns NULL. |
| char* get_title(void) | The function returns a string - document title. If value is NULL, URL of the document is displayed.
TPluginWord* get_word(unsigned char *d, unsigned long filesize) The main function returns pointer to 'TpluginWord' structure, containing a word which should be added to the search index. This function must return words contained in a document in series.
|
TpluginWord structure looks as follows:
typedef struct {
char word[32];
int rel;
bool end;
} TPluginWord;
where:
- word - a word with appended nulls \0x00 to the right. Thus, maximum length of a word is 32 symbols.
- rel - word relevancy; may range from 1 to 256. Recommended values are 1 to 4. In the example each word has relevancy of 1, with exception of word written in CAPITAL LETTERS. Its relevancy is 2.
- end - has 'true' value if there are no more words in a document. In this case 'word' and 'rel' are ignored.
Methods, used by the system for generation of plugin functions:
- get_info(), get_mime(), and get_shortdesc() functions are called once during the plgin loading;
- get_title() function is called once for each document; afterwards get_word() function is called for corresponding documents until 'end' field of TwordPlugin structure acquires 'true' value.
5.4 Search Robots
5.4.1 Introduction
Search robots present itself programs indexing web-documents in the Internet.
In 1993-94 it was discovered, that search robots often perform documents indexing against will of web-site owners. Sometimes, robots interfered with common users and the same files were indexed several times. In some cases robots indexed wrong documents - deep virtual directories, temporary information or CGI-scripts. Exclusions Standard was designed to solve such problems.
5.4.2 Function
To solve this problem one should create a file containing information about robot's behavior management to block robot's request to a web-server or its parts. This file must be located in the root directory.
This solution allows robot to find algorithms describing its required actions by requesting only one file. A file under name '/robots.txt' can be easily created on any existing web-server.
The choice of such particular name is dictated by several circumstances:
- Name of file must be the same for any operating system;
- File extension should not require any server re-configuration;
- Name of file should be easy to remember and descriptive;
- Possibility of coincidence with existing files should be minimal.
5.4.3 Structure
The structure and semantics of '/robots.txt" are as follows:
The file must contain one or several records separated by one or several lines (ending with CR, CR/NL, or NL). Each record must contain lines: "<field>:<optional_space><value><optional_space>".
Field <field> is register-independent.
Comments may be included in usual UNIX way: symbol '#' denotes start of a comment, end of line denotes end of a comment.
A record should start with one ore more 'User-Agent' lines followed by one ore more Disallow lines. Unrecognized lines are ignored.
User-Agent:
- Value of this field must be the name of a search robot. Access rights for the robot are set in this record;
- Though the standard allows indicating names of several robots, CNSearch recognizes only one, because method of separating names of robots, described in the standard, is not realized in the system.
- Upper or lower-case letters are equal;
- If value of this field is '*', then access rights set in the record are valid for any search robot requested '/robots.txt' file.
Disallow:
- Value of this field must present itself a partial URL which should not be indexed. Path to the file must be full or partial. For example, 'Disallow: /help' denies access both to '/help.html' and '/help/index.html', while 'Disallow: /help/' denies access to '/help/index.html' only.
- Any record must contain at least one 'User-Agent' line and one 'Disallow' line.
If '/robots.txt' is empty, do not correspond to the above-mentioned structure and semantics or is missing, then search robots act according to its settings.
5.4.4 Examples
Example 1:
# robots.txt for http://www.site.com User-Agent: * # this is an infinite virtual URL space Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear
Content of '/cyberworld/map/' and '/tmp/' are protected in this example.
Example 2:
# robots.txt for http://www.site.com User-Agent: * # this is an infinite virtual URL space Disallow: /cyberworld/map/ # Cybermapper knows where to go User-Agent: cybermapper Disallow:
In this example the search robot 'cybermapper' is granted full access, while the rest do not have access to content of '/cyberworld/map/'.
Example 3:
# robots.txt for http://www.site.com User-Agent: * Disallow: /
Access to the server is denied to any search robot in this example.
5.5 The Use of "Robots" META-tags
Besides the Exclusions standard, described above, there is also a possibility to manage search robots behavior by means of 'META' HTML-tag.
Unlike 'robots.txt' files, describing site indexing as a whole, tag 'META' manages indexing of a certain web-page. Besides, it is possible to cancel indexing not only of the document itself, but its links as well.
Indexing parameters should be defined in the 'content' field of a source code of each page. The following parameters can be used:
- NOINDEX - cancel document indexing;
- NOFOLLOW - cancel indexing of links contained in the document;
- INDEX - allow document indexing;
- FOLLOW - allow indexing of links contained in the document;
- ALL - equal to INDEX, FOLLOW;
- NONE - equal to NOINDEX, NOFOLLOW.
Default value: <meta name="Robots" content="ALL">.
Note: Values should not be separated by a comma.
Incorrect variant:
<META name="ROBOTS" content="noindex, nofollow">
Correct variant:
<META name="ROBOTS" content="none">
In this example the indexer allows analyzing a document without indexing its links:
<META name="ROBOTS" content="nofollow">
Name of the tag, as well as names and values of fields are not register sensitive. As a matter of fact, the indexer checks only for 3 values: NOINDEX, NOFOLLOW and NONE, because FOLLOW and INDEX are default values.
Back | Manual index | Next