> Main > Products and Services > Purchase > Company
English / Russian

CNSearch

The search engine for web-sites

Current Version

CNSearch 2.0.1

User Manual / CNSearch 1.5.1

5 Additional Options

5.1 Search Optimization

It is possible to optimize two parameters:

5.1.1 Search Speed Optimization

To optimize search process one should employ the following options:

5.1.2 Index Size Optimization

To optimize index size one should employ the following options:

5.2 Statistics

CNSearch performs full-text search and provides statistical reports for analysis of the site contents and the popularity of its sections.

Statistical data is stored to 'stats.cns' file, which must be located in the same directory with search index files. If cgi-script performing a search cannot access this file or does not have permission to create it (typical), then statistical information will not be saved.

You may create 'stats.cns' file manually and set access rights for it.

Access to statistics data is password-protected. The password is set in the search module configuration file with the help of Stats parameter (see Configuration settings), for example:

-- cnsearch.conf -----------------------------
::CONFIG stats = thisispass

::HTMLTOP
<HTML>
<HEAD>
<TITLE>Search results - %Q</TITLE>
</HEAD>
...
-- end of cnsearch.conf ----------------------

To view statistical data one should define 'stats' parameter in the site URL, for example:

http://www.site.com/cgi-bin/search.cgi?stats=1&password=thisispass

or:

http://www.site.com/cgi-bin/search.exe?stats=1&password=thisispass

At the present moment two reports are available in the system:

This reports shows search phrases most frequently used by the site visitors and number of found results. The report helps to analyze most frequent search objects. One may view statistics for any period of time:

This report allows analyzing distribution of search requests per days; it provides statistics for any period. For example:

In case you need some other reports, please contact us; they will be included into upcoming versions.

5.3 Plugins

Plugins present itself special modules which allow optimizing the program functionality. CNSearch uses plugins to index files of different types.

Plugins should be stored in the indexer directory. UNIX and Linux versions have an extension .so, Windows versions - .dll. To disable a plugin one should just move it to other directory.

Three plugins are included in the current distribution. They allow indexing files of the following types:

File name in UNIX/Linux versionFile name in Windows versionType of the processed document
libtxt.solibtxt.dll*.TXT - test files
librtf.solibrtf.dll*.RTF - Rich Text Format files
libdoc.solibdoc.dll*.DOC - Microsoft Word files
libxls.solibxls.dll*.XLS - Microsoft Excel files
libmp3.solibmp3.dll*.MP3 - MPEG Layer 3 audio-files

Plugins of version 0.92 do not define character sets, because it is not necessary for most files.

The field 'encoding' in documents processed by plugins is replaced by the text set in plugin, which allows composing templates where type of the found document is displayed.

During start-up the indexer launches all active plugins, for example:

F:\1\bin\indexer>searchctl.exe localhost

CNSearch ver.0.92 [build 2073]
Compiled 07.04.2002 under MS Windows 2000 [Version 5.00.2195]

Rebuilding URL list...Ok.
Loading library: RTF (Rich text format)
Loading library: TXT (Plain text)
Loading library: DOC (Microsoft Word document format)
http://www.test.ru/

The main advantage of plugins is possibility to develop new plugins to index files of some specific formats (for example, images and so on).

5.3.1 Plugin Creation

To create a plugin one should unpack 'plugin.zip', located in '/manual' folder of the distribution kit. This file contains the source code of a plugin, processing test files.

For correct function plugin must have a correct extension and possess the following set of functions:

Name of a functionDescription of a function
char *get_info(void)The function returns a string - information about plugin (its name)
char *get_mime(void)The function returns a string - list of MIME TYPEs, separated by vertical line "|" and processed by this plugin;
char* get_shortdesc(void)The function returns a string - short name of a file type
char* get_range(void)The function returns a string - field "Range" of HTTP header (see RFC2068); if field "Range" is not used, the function returns NULL.
char* get_title(void)The function returns a string - document title. If value is NULL, URL of the document is displayed.

TPluginWord* get_word(unsigned char *d, unsigned long filesize) The main function returns pointer to 'TpluginWord' structure, containing a word which should be added to the search index. This function must return words contained in a document in series.

  • d - pointer to the document being indexed. The document ends with code \0x0;
  • filesize - size of the document being indexed. It is used if the document contains code \0x0 (for example, Microsoft Word Document)

TpluginWord structure looks as follows:

typedef struct {
	char word[32];
	int rel;
	bool end;
	} TPluginWord;

where:

Methods, used by the system for generation of plugin functions:

5.4 Search Robots

5.4.1 Introduction

Search robots present itself programs indexing web-documents in the Internet.

In 1993-94 it was discovered, that search robots often perform documents indexing against will of web-site owners. Sometimes, robots interfered with common users and the same files were indexed several times. In some cases robots indexed wrong documents - deep virtual directories, temporary information or CGI-scripts. Exclusions Standard was designed to solve such problems.

5.4.2 Function

To solve this problem one should create a file containing information about robot's behavior management to block robot's request to a web-server or its parts. This file must be located in the root directory.

This solution allows robot to find algorithms describing its required actions by requesting only one file. A file under name '/robots.txt' can be easily created on any existing web-server.

The choice of such particular name is dictated by several circumstances:

5.4.3 Structure

The structure and semantics of '/robots.txt" are as follows:

The file must contain one or several records separated by one or several lines (ending with CR, CR/NL, or NL). Each record must contain lines: "<field>:<optional_space><value><optional_space>".

Field <field> is register-independent.

Comments may be included in usual UNIX way: symbol '#' denotes start of a comment, end of line denotes end of a comment.

A record should start with one ore more 'User-Agent' lines followed by one ore more Disallow lines. Unrecognized lines are ignored.

User-Agent:

Disallow:

If '/robots.txt' is empty, do not correspond to the above-mentioned structure and semantics or is missing, then search robots act according to its settings.

5.4.4 Examples

Example 1:

# robots.txt for http://www.site.com
User-Agent: *
# this is an infinite virtual URL space
Disallow: /cyberworld/map/ 
Disallow: /tmp/ # these will soon disappear

Content of '/cyberworld/map/' and '/tmp/' are protected in this example.

Example 2:

# robots.txt for http://www.site.com
User-Agent: *
# this is an infinite virtual URL space
Disallow: /cyberworld/map/
# Cybermapper knows where to go
User-Agent: cybermapper
Disallow:

In this example the search robot 'cybermapper' is granted full access, while the rest do not have access to content of '/cyberworld/map/'.

Example 3:

# robots.txt for http://www.site.com
User-Agent: *
Disallow: /

Access to the server is denied to any search robot in this example.

5.5 The Use of "Robots" META-tags

Besides the Exclusions standard, described above, there is also a possibility to manage search robots behavior by means of 'META' HTML-tag.

Unlike 'robots.txt' files, describing site indexing as a whole, tag 'META' manages indexing of a certain web-page. Besides, it is possible to cancel indexing not only of the document itself, but its links as well.

Indexing parameters should be defined in the 'content' field of a source code of each page. The following parameters can be used:

Default value: <meta name="Robots" content="ALL">.

Note: Values should not be separated by a comma.

Incorrect variant:

<META name="ROBOTS" content="noindex, nofollow">

Correct variant:

<META name="ROBOTS" content="none">

In this example the indexer allows analyzing a document without indexing its links:

<META name="ROBOTS" content="nofollow">

Name of the tag, as well as names and values of fields are not register sensitive. As a matter of fact, the indexer checks only for 3 values: NOINDEX, NOFOLLOW and NONE, because FOLLOW and INDEX are default values.


Back | Manual index | Next