NformatiX: The Rels Program

Software For Full Text Information Retrieval:

The Rels Program - Order The Relevance Of Text Documents To A Simple Phonetic Search Criteria

rels [options] patterns paths ...

DESCRIPTION

Rels is a program that determines the relevance of text documents to a set of keywords expressed in boolean infix notation. The relevance is determined by comparing the phonetic representation of the keywords with the phonetic representation of every word in a document. (Phonetic searching has some degree of tolerance to misspelled words.) The list of file names that are relevant are printed to the standard output, in order of relevance. The boolean operators supported are logical or, logical and, and logical not. These operators are represented by the symbols, "|", "&", and, "!", respectively, and left and right parenthesis, "(" and ")", are used as the grouping operators. The paths can be files and/or directories-if it is a directory, the program will recursively descend into the directory, searching all files and directories contained in the directory.

For example, the command:


  rels "(directory & listing)" /usr/share/man/cat1

(ie., find the order of relevance of all files that contain both of the words "directory" and "listing" in the catman directory) will list a few tens of files, out of the hundreds of catman files, of which "ls.1" is the among the most relevant-meaning that to find the command that lists directories in a Unix system, the "literature search" was reduced, on average, by about 98%, which is a considerable expediency in relation to browsing through the files in the directory. Although this example is remedial, a similar expediency can be demonstrated in searching for documents in email repositories and text archives.

Additional applications include information robots, (ie., "mailbots," or "infobots,") where the disposition (ie., delivery, filing, or viewing,) of text documents can be determined dynamically, based on the relevance of the document to a set of criteria, framed in boolean infix notation. Or, in other words, the program can be used to order, or rank, text documents based on a "context," specified in a general mathematical language, similar to that used in calculators.

The words in the query are case insensitive, and either upper or lower case can be used.

Associativity of operators is left to right, and the precedence of operators is identical to 'C':

precedence	operator
high	! = not
middle	& = and
lowest	\| = or

The operator symbols can be escaped with the "\" character to include the symbol in a search pattern. The "escape space" character sequence represents one or more instances of space character(s) in search patterns, and each instance will match one or more consecutive whitespace characters, (as defined by isspace(3) in ctype.h and/or locale.h,) and allows phrases to be searched for. The "many to one" whitespace character translation occurs in both the keyword arguments and the text document(s). Multiple consecutive instances of the "escape space" character sequence in keyword search phrases should not be used, and single instances are appropriate only when necessary to specify a consecutive sequence of keywords-the logical and operator is the preferred searching construct when searching documents that contain set(s) of keywords.

Note that the logical or operator, (|) is useful in conjunction with a thesaurus. For example, the thesaurus entry for the word "complexity" is:

Complexity. -- N. complexity; complexness &c. adj.; complexus; complication, implication; intricacy, intrication; perplexity; network, labyrinth; wilderness, jungle; involution, raveling, entanglement; coil &c. (convolution) 248; sleave, tangled skein, knot, Gordian knot, wheels within wheels; kink, knarl; webwork.
Adj. knarled. complex, complexed; intricate, complicated, perplexed, involved, raveled, entangled, knotted, tangled, inextricable; irreducible.

implying that a reasonable context for a search for things that are complex would be:


  rels '(complex | complic | implicat | intric | perplex | \
        labyrinth | involut | convolut | involv | tangl | \
        inextric | irreduc)' ...

which would probably return too many document names. The number of documents can be reduced with the logical and (&) and not (!) operator in an iterative fashion to reject documents of little interest.

DOCUMENT FORMAT ISSUES

Hyphenation issues are addressed by deleting hyphens and any following sequence of instances of whitespace characters, (as defined by isspace(3),) in both the keyword arguments and the text document(s).

Backspace character issues are addressed by overwriting the character before the backspace with the character after the backspace, which will instantiate the character of the last instance of of consecutive backspace/character combinations. This is specifically for catman pages which utilize underscore/backspace/character combinations for underlining, in addition to backspace/character combinations for bold (overstrike,) representation-note that for this process to be successful, a single underscore (used for underlining,) must preceed a single character in the sequence.

PHONETIC TRANSLATION

This program is a derivative works based on the rel(1) program, available from http://www.johncon.com/nformatix/archive/rel.tar.gz. The sources were modified to include a soundex search algorithm.

The soundex algorithm is a mechanical phonetic translation system for the English language, and converts English words into a corresponding phonetic code for the word. The algorithm is as follows:

for each character in a word:

if the character is the first character of a word, do nothing

else:
replace consecutive sequences of the labials, (ie., the characters, B, F, P, V,) with the character '1'
replace consecutive sequences of the gutterals and sibilants, (ie., the characters, C, G, J, K, Q, S, X, Z,) with the character '2'
replace consecutive sequences of the dentals, (ie., the characters, D, T,) with the character '3'
replace consecutive sequences of the longliquids, (ie., the character, L,) with the character '4'
replace consecutive sequences of the nasals, (ie., the characters, M, N,) with the character '5'
replace consecutive sequences of the shortliquids, (ie., the character, R,) with the character '6'
and, omit all other characters, (ie., the characters, A, E, H, I, O, U, W, Y,)
if the soundex translation of the word is larger than 4 characters, truncate to 4 characters.

For example, the soundex translation of the word "conover" is C516. Unfortunately, there are two related issues in using the soundex algorithm as a search mechanism; interior keyword search is impossible, and, there is no practical strategy to handle hyphenation.

As a heuristic, simply eliminating 1), above, would permit interior keyword searches and hyphenation through concatenation of characters on each side of a '-' character, at the expense of erroneous matches. In practice, the expense is small-depending on the point of view-particularly if the requirement in 9), above, is removed, permitting soundex keyword translations of more syllables.

Note that this heuristic returns soundex translated words that consist only of numbers. Since numerical data can be a valid search criteria, the ambiguity can be avoided by using letters from the alphabet for the numbers, making the algorithm as follows:

replace consecutive sequences of the labials, (ie., the characters, B, F, P, V,) with the character 'B'
replace consecutive sequences of the gutterals and sibilants, (ie., the characters, C, G, J, K, Q, S, X, Z,) with the character 'G'
replace consecutive sequences of the dentals, (ie., the characters, D, T,) with the character 'D'
replace consecutive sequences of the longliquids, (ie., the character, L,) with the character 'L'
replace consecutive sequences of the nasals, (ie., the characters, M, N,) with the character 'N'
replace consecutive sequences of the shortliquids, (ie., the character, R,) with the character 'R'
and, omit all other characters, (ie., the characters, A, E, H, I, O, U, W, Y,)

which turns out to be implementable as a direct, many-to-one, and on-to simple character mapping. It is, also, a very fast phonetic search methodology-there is no speed penalty. (Note, also, that the soundex code generated is a one-to-one and on-to mapping, ie, the soundex code of a soundex code will be the soundex code itself, eg., you can specify a soundex code on the command line as a keyword.)

Comparing the two methodologies, (standard soundex vs. modified soundex,) on a text version of the Webster's dictionary, (mine has 234,932 words,) as to the number of different words recognized, with both unlimited soundex word length, and a word length of 4:

standard soundex	modified soundex

length = 4	unlimited	length = 4	unlimited
4,335	61,408	932	31,983

Although the modified soundex with unlimited length is inferior to the standard soundex with unlimited word length in capability of recognizing differences in words, it is superior to the standard soundex with a word length of 4, which is the way the algorithm is usually used. It would seem that the modified soundex algorithm is a reasonable, (depending on the point of view,) compromise for implementing a phonetic search algorithm.

There are additional issues with the soundex algorithm for phonetic keyword searches:

it only works for the English language
a syntax error will be returned for keywords made up of ONLY the characters A, E, I, H, O, U, W, and Y, (there is nothing to search for-these characters are ignored by the soundex algorithm)
Extreme care must be exercised when using the algorithm to reject documents with the logical not operator (!) since it will reject more documents than probably expected.

meaning that the algorithm should be considered as an adjunct to, instead of a replacement for, a strict keyword search.

Tests on large email archives, and the HTML pages from WWW servers (each about 15 Mbytes,) tend to indicate that, in practice, the algorithm returns not quite twice as many keyword matches as a strict keyword search. (The output of this program was compared to the output of the rel(1) program.)

OPTIONS

-v: Print the version and copyright banner of the program.

WARNINGS

In the interest of performance, Memory is allocated to hold the entire file to be searched. Large files may create resource issues.

The "not" boolean operator, '!', can NOT be used to find the list of documents that do NOT contain a keyword or phrase, (unless used in conjunction with a preceeding boolean construct that will syntactically define an intermediate accept criteria for the documents.) The rationale is that the relevance of a set of documents that do NOT contain a phrase or keyword is ambiguous, and has no meaning-ie., how can documents be ordered that do not contain something? Whether this is a bug, or not, depends on one's point of view.

The phonetic translation only works for the English language.

A syntax error will be returned for keywords made up of ONLY the characters A, E, I, H, O, U, W, and Y, (there is nothing to search for-these characters are ignored by the soundex algorithm).

Extreme care must be exercised when using the algorithm to reject documents with the logical not operator (!) since it will reject more documents than probably expected.

DIAGNOSTICS

Error messages for illegal or incompatible search patterns, for non-regular, missing or inaccessible files and directories, or for (unlikely) memory allocation failure, and signal errors.

AUTHORS

A license is hereby granted to reproduce this software source code and to create executable versions from this source code for personal, non-commercial use. The copyright notice included with the software must be maintained in all copies produced.

THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE. THE AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY.

So there.

Comments and/or bug reports should be addressed to:

john@email.johncon.com

http://www.johncon.com/

http://www.johncon.com/ntropix/

http://www.johncon.com/ndustrix/

http://www.johncon.com/nformatix/

http://www.johncon.com/ndex/