[john@email.johncon.com: Re: Let's get practical LO545]

From: John Conover <john@email.johncon.com>
Subject: [john@email.johncon.com: Re: Let's get practical LO545]
Date: Wed, 29 Mar 95 21:21 PST
Michael McMaster writes in LO539:

 > Replying to LO417 --
 > The "Coroporate Knowledge Repository" has come up a number of times in
 > recent conversations.  In past efforts I've seen, the idea is a great
 > library register for "author/source" and content info (by some heading or
 > key word).  This hasn't worked out too well in the instances I'm familiar
 > with and doens't have much promise - same design flaws - in the beginnings
 > that I've heard of.

FYI, attached pls. find a brief synopsis of an asynchronous
conferencing system (also known as an information retrieval system,
electronic literature search system, or corporate repository,) that I
used in cross functional program management, in another life, a long
time ago.  The objective was to find a methodology to relate the
corporate information repository to the management structure, (we did
not consider the technical issues to be significant.) The general
concept was to add sufficient functionality to the Unix email system
to turn it into an electronic literature search system.

The attached is a "cut and stick" from some of the reports on the
system's development. The project/program team supported by this
system consisted of little over a hundred professionals, from
approximately 20 specialties, and 4 core corporate functions. They
were geographically, and ethnically, disperse.

Also, attached is the description a program, rel.c, which was used to
perform complex literature search queries on the full text database,
and return documents, in order of relevance to the query, (note that
hypertext methodologies are incapable of operating in this fashion.)
These documents were returned as an email digest, which could be
"burst," into constituent documents, allowing the most relevant
document to be reviewed first, (and then, if necessary, the specific
email could be responded to, for further clarification, etc.) Since
most email readers, elm, pine, email, etc., are capable of sorting a
mail box by different criteria, one could move "orthogonally in
information space," during the review process, (ie., move though the
documents by relevance, and then sort by author, to find out what
he/she had to say about things, then sort by date, to find out the
chronology of events in the discussion, etc.) The database system was
a distributed environment, with each segment of the database
consisting of less than 10Mbyte, so queries were done in parallel,
using all network resources, and, thus, very fast.

In point of fact, all of the attachments are a "cut and stick" from
documents fetched from the full text database system with command "rel
((information & retrieval) | (literature & search) | (corporate &
repository) & management)"

        John

--

John Conover, john@email.johncon.com, http://www.johncon.com/

Attachments:

______________________________________________________________________________

          Various "cut and sticks" from the development reports.

Information systems are used in program management, which must
coordinate the various activities of the corporate functions (ie.,
engineering, marketing, sales, etc.) involved in development
projects. After researching the issues, (see below,) We concluded that
a distributed full text system that uses the mail (MTA) system as a
communication medium is the desirable direction to pursue. Our
reasoning is as follows:

        1) The Unix MTA is almost universal, and will operate
        effectively over uucp and/or ethernet connectivities in a
        non-homogeneous hardware environment.

        2) Each transaction is logged, with a date/time stamp, and who
        created the transaction.

        3) The MTA already has remedial file storage capabilities,
        which can be used to query/respond to transactions at a later
        date.

        4) Most(?) computers are already connected together, and users
        are familiar with how to use the system.

        5) The MTA database can be NFS'ed to conserve machine
        resources.

        6) It is a text based system.

We discounted the "hyper text" type of systems, because the links must be
established before the document is stored-which is fine if you know what
you are going to query for. In a general management application, this is
seldom the case. We set up a prototype system, using the following
(readily available) programs:

        1) elm, because it has a slightly more sophisticated file
        storage structure, and a very powerful aliasing capability
        that can alias team members as a group. Additionally, it has
        limited query capabilities, and can, through its forms
        capabilities, send mail transactions in a structured format.
        (Which is advantageous if the transactions are used for
        notification of schedule milestone completion, etc.) Eudora
        was used on the PC's and MAC's, using POP3 as the
        communications environment between the PC's and the Unix MTA.

        2) The dbm library to build an extensible hash query system
        into the file storage structure made by elm.  This was
        operated in two ways, by an RPC direct call, and a mail daemon
        that "read" incoming mail (to a query "account") and returned
        (via mail) all transactions that satisfied boolean
        conditionals on requested words. (A data dictionary was added
        later, so that the dictionary could be scanned for matches to
        regular expressions, which were then passed to the extensible
        hash system, but for some reason, this was seldom used.) The
        query was made through a very simple natural language
        interface, ie.,

                send john and c.*r not January

        would return all transactions containing john, excepting those
        written in January. (We did not attempt phrases, it looked
        complicated-this is ill advised by Tenopir, etc.  below.)
        This program contained approximately 350 lines of C code.  A
        soundex algorithm was added later to overcome spelling
        errors-the full text database contained the soundex of the
        words in a document, and any words searched for were converted
        to soundex prior to the query. (See the works by Knuth for
        details of the soundex algorithm.) Also a parser was added so
        that the boolean search words could be grouped in postfix
        expressions, eg., ((john & conover) ! (January | march)). The
        order that the documents were returned in is in order of
        relevance.


This prototype was well received, and was used as follows:

        1) Management "decreed" that the system would be used as a
        management tool, and all data had to be entered, or
        transcribed into the system (including the minutes of
        meetings, etc.) If it didn't exist in the system, it did not
        exist. All discussions, and reasons for decisions had to be
        placed in the system. ALL team members and upper management
        had identical access to ALL transactions. (Mail could be used
        for private correspondence, such as politicking, etc. but all
        decisions, and the reasons for the decisions had to be placed
        in the system.) The guiding rule was that at the end of the
        project, the system contained a complete play by play
        chronology and history of all decisions, and reasoning
        concerning the project, and, by the way, who was responsible
        for the decisions. On each Monday, everyone entered into the
        system, his/her objectives for the week, and when each
        objective was finished, she/he mailed the milestone into the
        system-ie., all group members and management could thus find
        out the exact status of the project at any time (ie., a
        "social contract" was made with management and the rest of the
        members of the team.) In some sense, it is really nothing more
        than an automated, real-time MBO system. At any time, a
        discussion could be initiated on problems/decisions in the
        system by anyone. The project manager was assigned the
        responsibility of "moderator," or chair person for his/her
        section of the project. Each Friday, the system was queried
        for project status, and the status plumbed to TeX for
        formating, and printed for official documentation. This
        document was discussed at a late Friday people-to-people staff
        meeting.  (The reason for setting things up this way can be
        found in Davido, below.)

        2) Marketing was responsible for acquiring all market data on
        magnetic media, (from services like Data Quest, the Department
        of Commerce, etc.) and each document was "mailed" into the
        system so that the information was available for retrieval by
        anyone.  All had access to the progress made by engineering,
        and can contribute information on issues as the program
        develops-ie., this was a "concurrent engineering" environment.

        3) Engineering was responsible for maintaining schedules, and
        reflecting those schedules in the system-if slippages occurred
        the situation could be addressed immediately by management,
        and a suitable cross functional resolution could be arrived
        at.

        4) Sales was responsible for adding customer inputs,
        concerning the project, into the system, so customer
        definitions could be retrieved by all project members. This
        included the customer data, such as who has buying authority
        in the customer's organization, who has signature, etc.

The results were very impressive not only by productivity standards, but
also by "correctness to fit and form" standards (ie., the right product
was in the market at the right time, the first time.) This has becoming a
central agenda, as outlined in Davido, below.

Bibliography:

"Computer-Supported Cooperative Work," Irene Greif
"A model for Distributed Campus Computing," George A. Champine
"Enterprise Networking," Ray Grenier and George Metes
"Connections," Lee Sproull and Sara Kiesler
"5th Generation Management," Charlse M. Savage
"Intellectual Teamwork," Jolene Galegher, Robert E. Krout and Carmen Egido
"In the Age of the Smart Machine," Shoshana Zuboff
"The Virtual Corporation," William H. Davido and Michael S. Malone
"Accelerating Innovation," Marvin L. Patterson
"Paradigm Shift," Don Tapscott and Art Caston
"Developing Products in Half the Time," Preston G. Smith and Donald G. Reinertsen
"Full Text Databases," Carol Tenopir and Jung Soon Ro
"Text and Context," Susan Jones
______________________________________________________________________________

Rel is a program that determines the relevance of text documents to a
set of keywords expressed in boolean infix notation. The list of file
names that are relevant are printed to the standard output, in order
of relevance.

For example, the command:

    rel "(directory & listing)" /usr/share/man/cat1

(ie., find the relevance of all files that contain both of the words
"directory" and "listing" in the catman directory) will list 21 files,
out of the 782 catman files, of which "ls.1" is the fifth most
relevant-meaning that to find the command that lists directories in a
Unix system, the "literature search" was cut from 359 to 5 files, or a
reduction of approximately 98%. The command took 1 minute and 26
seconds to execute on a on a System V, rel. 4.2 machine, (20Mhz 386
with an 18ms. ESDI drive,) which is a considerable expediency in
relation to browsing through the files in the directory since ls.1 is
the 359'th file in the directory. Although this example is remedial, a
similar expediency can be demonstrated in searching for documents in
email repositories and text archives.

General description of the program:

This program is an experiment to evaluate using infix boolean
operations as a heuristic to determine the relevance of text files in
electronic literature searches. The operators supported are, "&" for
logical "and," "|" for logical "or," and "!" for logical "not."
Parenthesis are used as grouping operators, and "partial key" searches
are fully supported, (meaning that the words can be abbreviated.) For
example, the command:

    rel "(((these & those) | (them & us)) ! we)" file1 file2 ...

would print a list of filenames that contain either the words "these"
and "those", or "them" and "us", but doesn't contain the word "we"
from the list of filenames, file1, file2, ... The order of the printed
file names is in order of relevance, where relevance is determined by
the number of incidences of the words "these", "those", "them", and
"us", in each file. The general concept is to "narrow down" the number
of files to be browsed when doing electronic literature searches for
specific words and phrases in a group of files using a command similar
to:

    more `rel "(((these & those) | (them & us)) ! we)" file1 file2`

Although regular expressions were supported in the prototype versions
of the program, the capability was removed in the release versions for
reasons of syntactical formality, for example, the command:

    rel "((john & conover) & (joh.*over))" files

has a logical contradiction since the first group specifies all files
which contain "john" anyplace and "conover" anyplace in files, and the
second grouping specifies all files that contain "john" followed by
"conover". If the last group of operators takes precedence, the first
is redundant. Additionally, it is not clear whether wild card
expressions should span the scope multiple records in a literature
search, (which the first group of operators in this example does,) or
exactly what a wild card expression that spans multiple records means,
ie., how many records are to be spanned, without writing a string of
EOL's in the infix expression. Since the two groups of operators in
this example are very close, operationally, (at least for practical
purposes,) it was decided that support of regular expressions should
be abandoned, and such operations left to the grep(1) suite.

Applicability:

Applicability of rel varies on complexity of search, size of database,
speed of host environment, etc., however, as some general guidelines:

    1) For text files with a total size of less than 5 MB, rel, and
    standard egrep(1) queries of the text files will probably prove
    adequate.

    2) For text files with a total size of 5 MB to 50 MB, qt seems
    adequate for most queries. The significant issue is that, although
    the retrieval execution times are probably adequate with qt, the
    database write times are not impressive. Qt is listed in "Related
    information retrieval software:," below.

    3) For text files with a total size that is larger than 50 MB, or
    where concurrency is an issue, it would be appropriate to consider
    one of the other alternatives listed in "Related information
    retrieval software:," below.

References:

    1) "Information Retrieval, Data Structures & Algorithms," William
    B. Frakes, Ricardo Baeza-Yates, Editors, Prentice Hall, Englewood
    Cliffs, New Jersey 07632, 1992, ISBN 0-13-463837-9.

    The sources for the many of the algorithms presented in 1) are
    available by ftp, ftp.vt.edu:/pub/reuse/ircode.tar.Z

    2) "Text Information Retrieval Systems," Charles T. Meadow,
    Academic Press, Inc, San Diego, 1992, ISBN 0-12-487410-X.

    3) "Full Text Databases," Carol Tenopir, Jung Soon Ro, Greenwood
    Press, New York, 1990, ISBN 0-313-26303-5.

    4) "Text and Context, Document Processing and Storage," Susan
    Jones, Springer-Verlag, New York, 1991, ISBN 0-387-19604-8.

    5) ftp think.com:/wais/wais-corporate-paper.text

    6) ftp cs.toronto.edu:/pub/lq-text.README.1.10

Related information retrieval software:

    1) Wais, available by ftp, think.com:/wais/wais-8-b5.1.tar.Z.

    2) Lq-text, available by ftp,
    cs.toronto.edu:/pub/lq-text1.10.tar.Z.

    3) Qt, available by ftp,
    ftp.uu.net:/usenet/comp.sources/unix/volume27.

______________________________________________________________________________
Last modified: Fri Mar 26 18:57:31 PST 1999 $Id: 950329212354.20965.html,v 1.0 2001/11/17 23:05:50 conover Exp $