NformatiX_medium.jpg

 

Software For Full Text Information Retrieval:

Home


Home | Installation | Usage | FAQs | Utilities | Architecture | QA | Mailing List | License | Author | Download | Thanks



home.jpg
installation.jpg
usage.jpg
FAQs.jpg
utilities.jpg
architecture.jpg
QA.jpg
mailinglist.jpg
license.jpg
author.jpg
download.jpg
thanks.jpg

Rel is a suite of programs and tools for building wide area full text information retrieval systems. The search mechanisms are capable of sorting documents by relevance to keyword search criteria. Boolean operations (and, or, not, and grouping operators,) on multiple keywords (which can be interior keywords,) are fully supported and the programs are capable of phonetic keyword search. The programs find application in enterprise wide area information retrieval, (knowledge based,) systems.

This is the home page for the rel programs. The rel.tar.gz tape archive contains the C sources for the programs and applications. The archive consists of the programs:

  • rel which orders the relevance of text documents to a search criteria.

  • rels which orders the relevance of text documents to a simple phonetic search criteria.

  • relx which orders the relevance of text documents to a phonetic search criteria.

  • htmlrel which orders the relevance of html text documents to a search criteria.

  • htmlrels which orders the relevance of html text documents to a simple phonetic search criteria.

  • htmlrelx which orders the relevance of html text documents to a phonetic search criteria.

  • wgetrel a shell script which searches Internet Web pages for documents that are relevant to a search criteria.

  • wgetrels a shell script which searches Internet Web pages for documents that are relevant to a simple phonetic search criteria.

  • wgetrelx a shell script which searches Internet Web pages for documents that are relevant to a phonetic search criteria.

Included in the sources is an example application of an organization/corporate wide knowledge based system that uses Procmail/SmartList to maintain corporate e-mail archives, (as an example, this system is used for the Stochastic UCE Detection, E-mail "Received: " Header IP Address Auditing, and Quarantine Attachments database maintenance.) Sophisticated queries can be submitted to the system using any mail user agent as the user interface. Documents are returned, in order of relevance, in standard digest, (RFC 1153), format. Boolean operations (and, or, not, and grouping operators,) on multiple keywords, (which can be interior keywords,) are fully supported, as is phonetic keyword search. The searching, digest formating, and assemblage of the returned documents are controlled by a single SmartList script for flexibility and extensibility.

The wgetrel suite of programs are an example application of the use the htmlrel programs to control the direction of search across the Internet as determined by the relevance of documents found at a site, and finds use in business intelligence operations. The suite uses the Wget program as a "page fetch engine" in a shell script for flexibility and extensibility.

The rel program suite was first announced in 1994. There is additional information, regarding the concepts of document relevance searching , in the Architecture page, which also contains a suitable bibliography.


Historical Perspective

The concept of document relevance searching is not new. It was first proposed by Vannevar Bush (Bush, V. (1941) "Memorandum regarding Memex," [Vannevar Bush Papers, Library of Congress], Box 50, General Correspondence File, Eric Hodgins.) There is a discussion of the Memex concept in the FAQs.

The Rel suite was designed in 1993, following discussions:

  1. 1993-08-12 IT uses
  2. 1993-08-17 Re: IT uses
  3. 1993-08-24 Re: IT uses
  4. 1993-08-27 Re: IT uses
  5. 1993-09-14 Re: IT uses
  6. 1993-09-14 Re: IT uses

in the Learning Organization, (LO, Re: Peter M. Senge,) and Business Process Reengineering, (BPR, Re: Michael Hammer & James Champy,) conference/mailing lists, which proposed an operative framework for knowledge based systems and competitive business intelligence. The design of the Rel suite was heavily influenced by the concepts of Vannevar Bush's Memex Machine. (The 1993-09-14 correspondence contains theoretical implications of using a Boolean not operator in search criteria-an implication that Bush was, intuitively, well aware of; see item 1996-10-22 in the FAQs page.) The design of the Rel suite was heavily influenced by the Qt text information retrieval system Unix Shell script from 1991.


Mathematical Analysis

Most full text information retrieval systems determine the relevance of documents using a word incident rate search criteria. This actually has formal merit-its a heuristic implementation of Bayesian Inference, which is based on formal probability and combinatorial theory, and often used in Natural Language processing.

Let a be the incident rate of a specific word in a document, i.e., the number of times the word appears in the document, divided by the total number of words in the document. Let b be the incident rate of another word in the document, and so on. Then, by the Baye's Chain Rule, (see also, Paul Graham's comments for a tutorial,) the combined probability, P, that the document is relevant to a search criteria based on the incident rates of the words is:



                              (a * b * ...)
            P = -----------------------------------------
                (a * b * ...) + ((1 - a) * (1 - b) * ...)

        

where the terms (1 - a), (1 - b), ... are the complements of the incident rates of the words, (assuming that the incident rates are statistically independent-the complements carry information about how many words in the document were not relevant to the search criteria.)

If there is no incidence of any one word in the search criteria, then the probability that the document is relevant is zero-essentially, a logical and operation, which has an intuitive esthetic appeal.

But for search criteria word incident rates that are small-a reasonable assumption for most text; a << 1, b << 1, ... , then (1 - a) * (1 - b) * ... is about unity, and a * b * ... in the denominator is about zero in relation to unity, and P can be approximated as a * b * ... with somewhat reasonable precision.

Taking the logarithm of a * b * ... for reasons of computational expediency, (i.e., just summing a + b + ...,) will not alter the order of relevance of multiple documents when sorted on ln (P), or P-since both are monotonic-allowing word incident rate and integer arithmetic to be used in the determination of document relevance.

Most full text information retrieval systems drop the formal requirements of working with word incident rates in documents, and simply use the word incident counts in a document-rejecting the document as non-relevant if any word in the search criteria is absent.

As an aside, Baye's Chain Rule expresses a general strategy for constructing search criteria in electronically automated document searching: chose words that will appear fairly uniquely in only relevant documents; and chose many of them. The first is intuitive-the significance of the second is often underestimated.


As a closing aside regarding implementation, the standard usage of Baye's Chain Rule is problematic as an algorithmic document search methodology where the word incident rates of most searches is a fraction of a percent per document. An alternative implementation is to use Baye's Chain Rule to calculate false positive rates; for example, assuming the document to be relevant, and calculating the probability that the declaration is a false positive using the Chain Rule-where a smaller chance of a false positive is, intuitively, more relevant. There are more sophisticated alternatives, too. For example, using a word incident rate in a complete archive for the compliment, instead of the incident rates per document. Document search algorithms are, indeed, a fine art.


Availability

The rel suite of programs are freely available for Download and distributed as source code, at no charge, and provided under License.


Mailing list

There is a mailing list available for users of the rel program suite. To join see the Mailing List page.


Archive

The NformatiX archive is at http://www.johncon.com/nformatix/archive/.




Home | Installation | Usage | FAQs | Utilities | Architecture | QA | Mailing List | License | Author | Download | Thanks


Copyright © 1994-2011 John Conover, john@email.johncon.com. All Rights Reserved.
Last modified: Tue Mar 1 18:16:52 PST 2011 $Id: index.html,v 1.0 2011/03/02 02:19:56 conover Exp $
Valid HTML 4.0!