QT(1)                    USER COMMANDS                      QT(1)


NAME
     qt - Text information retrieval system.

SYNOPSIS
     qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ...
     qt -w [-f index_name] [-1 | ... | -8] file1 file2 ...
     qt -w [-f index_name] [-1 | ... | -8] < file_list
     qt -w -dw [-f index_name] word1 word2 ...
     qt -w -df [-f index_name] file1 file2 ...
     qt -v
     qt -h

DESCRIPTION
                          Q T  version 0.1

     Qt stands for Query Text, a text information retrieval system. Qt
     creates, maintains, and queries a full text database. The database
     file system is organized as an inverted index. The program is written
     as a single script, in Bourne Shell, and permits simple natural
     language queries.

     This program will create and query inverted index files that index
     the words in text files. These indices are useful in information
     retrieval systems. The inverted index files are, typically, about the
     same size of the text files, and do not require the text files to be
     present for query operations. The query functions, typically, consist
     of boolean operations on word searches. The output of the query is,
     typically, a list of the file names that contain the queried word(s).

The read synopsis is:

        qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ...

     where word1-word2 ... are the words to be queried in the inverted
     index, and op1-op2 ... are the operations to be performed on the set
     of file names that contain these words. The word/operation arguments
     consist of pairs of search words, and boolean operators, with a left
     to right operational precedence.

     Thus if A, B, and C are words, then the query:

        A and B or C not D

     would specify that all file names containing word A should be found,
     then all the file names containing word B should be found, and only
     those file names that contain words A and B should be added to those
     file names containing word C, and then if these file names do not
     contain word D, they are output.

     Logical "or'ing" is implicit, thus:

        A B C

     is identical to:


QT(1)                       7 Jan 92                            1


QT(1)                    USER COMMANDS                      QT(1)


        A or B or C

     Obviously, the keywords, "and," "not," and "or," may not be queried
     for, when using the implicit "or" query constructs.

     If the "-e" option is specified, then "exact match" queries will be
     performed, otherwise, a "partial key" type of search will be
     performed, which is the default. It is recommended that the "-e"
     option be used if the query involves any boolean operations.

     If the "-r" option is specified, then the words being queried for
     will be output before the list file names that contain the queried
     words. This output format is compatible with egrep(1), and is useful
     in doing "relevance feedback" searches.

     If the "-rc" option is specified, then the count of records in a file
     that contain match(s) will be output with the file name containing
     the matches. This provides the system with a remedial "relevance
     feedback" capability. The original text files that were used to
     construct the inverted index file must be available in the system to
     use this option.

     If the "-rp" option is specified, then the records in a file that
     contain match(s) will be output after the file name containing the
     matches. This output format provides the system with a remedial
     "permuted index" type of "proximity retrieval." The original text
     files that were used to construct the inverted index must be
     available in the system to use this option.

     The write synopsis is:

        qt -w [-f index_name] [-1 | ... | -8] file1 file2 ...

     where file1 file2 ... are the file names that contain words that are
     to be added to the inverted index, or:

        qt -w [-f index_name] [-1 | ... | -8] < file_list

     where file_list is the name of a file that contains a list of file
     names, one file name per record, that contain words that are to be
     added to the inverted index.

     It is recommended that file names contain the absolute path to the
     system's root directory.

     If the inverted index file does not exist, then it will be created,
     and contain an index to all of the words in the input files. If the
     inverted index file exists, then the indices of all words in the
     input files will be added, incrementally. Instances of words and
     filename pairs will be unique in the inverted index.

     The "-w" option, specifies that write operations will be performed,


QT(1)                       7 Jan 92                            2


QT(1)                    USER COMMANDS                      QT(1)


     and is a mandatory option, to be used if and only if write operations
     are desired.

     The "-f index_name," optionally, specifies the inverted index file's
     name. If the "-f" option is not specified, the inverted index file
     name will default to "qt.index."

     The lexical analyzer level is specified by, "-1", through "-8". If
     none are specified, the default, "-4", will be used. The lexical
     analyzers with larger numbers are, generally, more sophisticated
     about the words that are placed in the inverted index. The lexical
     analyzers available are:

        1) Parses words and numbers. All other characters are omitted.
        Capitalization is preserved. Probably the best choice if
        non-word searches are important.

        2) Like 1) above, but the '_' character is recognized. This
        parser seems to work well with "C" program source files.

        3) Like 1) above, but, only words of more than two characters
        are placed in the inverted index file. If capitalization is
        considered important in the search criteria, then this seems
        to be the best choice.

        4) Like 1) above, but, capitalization is ignored. For general
        text where all words and numbers are considered significant,
        this seems to be the best choice. Also seems a good choice for
        "catman" pages. Queries should be in lowercase.

        5) Like 3) above, but capitalization is ignored. For general
        text, this seems to be the best choice. Queries should be in
        lowercase.

        6) Like 4) above, but words containing only numbers are
        omitted from the inverted index file. For text containing only
        words, this seems to be the best choice. Queries should be in
        lowercase.

        7) Like 4) above, but does not include Unix mail headers in
        the inverted index file. Each email should be in a separate
        file, as opposed to concatenated into folders. This seems to
        be the best choice for Unix mail files, if the header
        information is not desirable.

        8) Like 4) above, but deletes TeX and/or LaTeX commands from
        the inverted index file. This seems to be the best choice for
        TeX and LaTeX documents.

     The more sophisticated the parser, the smaller the size of the
     inverted index file. Multiple runs can be made, using the different


QT(1)                       7 Jan 92                            3


QT(1)                    USER COMMANDS                      QT(1)


     parsers, to store words in the inverted index. For example, using
     parsers 3) and 4) would place both the capitalized and
     non-capitalized words in the index.  This would not duplicate any
     words already in the index-only add the words that were different.

     The remove words synopsis is:

        qt -w -dw [-f index_name] word1 word2 ...

     where word1 word2 ... are the words that are to be deleted from the
     inverted index file, and may be a regular expressions-no '^' or '$'
     characters should be used, unless they are escaped. The "-w" option
     is mandatory.

     The remove files synopsis is:

        qt -w -df [-f index_name] file1 file2 ...

     where file1 file2 ... are the file names that are to be deleted from
     inverted index. The file names to be deleted from the inverted index
     file may be regular expressions-no '^' or '$' characters should be
     used, unless they are escaped. The "-w" option is mandatory.

     The version synopsis is:

        qt -v

     which will print the version number of qt.

     The help synopsis is:

        qt -h

     which will list a synopsis of the command semantics.

     A common example of writing an inverted index file would be:

        find /dir1/dir2 -type f -print | qt -w

     which would recursively descend through the directory hierarchy, and
     create an inverted index of all of the words in all of the files in
     all of the directories, starting with /dir1/dir2.

     A common example of retrieving information from an inverted index
     file would be:

        more +/word `qt word`

     where the "more" program would page through the documents that
     contain "word," advancing to the next instance every time the 'n' key
     is depressed.


QT(1)                       7 Jan 92                            4


QT(1)                    USER COMMANDS                      QT(1)


     A common example of relevance determination in retrieving information
     from an inverted index file would be:

        egrep -ic `qt -r word` | sort -n -r -t: +1

     which would print the file(s) that contain "word," with the count of
     the instances of records that contain "word" in each of the file(s).

     As a simple application example, this program can be used to search
     the "catman" pages for a command that performs a specific function,
     even though the command's name is not known-e.g., if you knew what
     you wanted to do, you could find the command that would do it.

     Comments and/or bug reports should be addressed to:

        john@email.johncon.com (John Conover)

     Known caveats: There is no concurrency control-it would be
     ill-advised to use this program as a concurrent application.
     Additionally, the natural language query does not support grouping
     operators.

     Applicability:

     Applicability of qt varies on complexity of search, size of database,
     speed of host environment, etc., however, as some general guidelines:

        1) For text files with a total size of less than 5 MB,
        standard egrep(1) queries of the text files will probably
        prove adequate.

        2) For text files with a total size of 5 MB to 50 MB, qt seems
        adequate for most queries. The significant issue is that,
        although the retrieval execution times are probably adequate
        with qt, the database write times are not impressive.

        3) For text files with a total size that is larger than 50 MB,
        or where concurrency is an issue, it would be appropriate to
        consider one of the alternatives listed in "Related
        information retrieval software:," below.

     References:

        1) "Information Retrieval, Data Structures & Algorithms,"
        William B. Frakes, Ricardo Baeza-Yates, Editors, Prentice
        Hall, Englewood Cliffs, New Jersey 07632, 1992, ISBN
        0-13-463837-9.

        The sources for the many of the algorithms presented in 1) are
        available by ftp, ftp.vt.edu:/pub/reuse/ircode.tar.Z


QT(1)                       7 Jan 92                            5


QT(1)                    USER COMMANDS                      QT(1)


        2) "Text Information Retrieval Systems," Charles T. Meadow,
        Academic Press, Inc, San Diego, 1992, ISBN 0-12-487410-X.

        3) "Full Text Databases," Carol Tenopir, Jung Soon Ro,
        Greenwood Press, New York, 1990, ISBN 0-313-26303-5.

        4) "Text and Context, Document Processing and Storage," Susan
        Jones, Springer-Verlag, New York, 1991, ISBN 0-387-19604-8.

        5) ftp think.com:/wais/wais-corporate-paper.text

        6) ftp cs.toronto.edu:/pub/lq-text.README.1.10

        7) "Unix Shell Programming," Lowell Jay Arthur, John Wiley &
        Sons, Inc., New York, 1990, ISBN 0-471-51820-4.

     Related information retrieval software:

        1) Wais, available by ftp, think.com:/wais/wais-8-b5.1.tar.Z

        2) lq-text, available by ftp, cs.toronto.edu:
        /pub/lq-text1.10.tar.Z

     The program, qt, is free software, and can be redistributed and/or
     modified, without any restrictions. It is distributed with no
     warranty of any kind, implied or otherwise.  Specifically, there is
     no warranty of fitness for any particular purpose and/or
     merchantability.

SEE ALSO
     cat(1), cp(1), echo(1), egrep(1), join(1), look(1), mv(1), rm(1),
     sed(1), sort(1), sync(1), tr(1), uniq(1).


QT(1)                       7 Jan 92                            6