QT(1) USER COMMANDS QT(1) NAME qt - Text information retrieval system. SYNOPSIS qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ... qt -w [-f index_name] [-1 | ... | -8] file1 file2 ... qt -w [-f index_name] [-1 | ... | -8] < file_list qt -w -dw [-f index_name] word1 word2 ... qt -w -df [-f index_name] file1 file2 ... qt -v qt -h DESCRIPTION Q T version 0.1 Qt stands for Query Text, a text information retrieval system. Qt creates, maintains, and queries a full text database. The database file system is organized as an inverted index. The program is written as a single script, in Bourne Shell, and permits simple natural language queries. This program will create and query inverted index files that index the words in text files. These indices are useful in information retrieval systems. The inverted index files are, typically, about the same size of the text files, and do not require the text files to be present for query operations. The query functions, typically, consist of boolean operations on word searches. The output of the query is, typically, a list of the file names that contain the queried word(s). The read synopsis is: qt [-e] [-r | -rc | -rp] [-f index_name] word1 [op1] word2 [op2] ... where word1-word2 ... are the words to be queried in the inverted index, and op1-op2 ... are the operations to be performed on the set of file names that contain these words. The word/operation arguments consist of pairs of search words, and boolean operators, with a left to right operational precedence. Thus if A, B, and C are words, then the query: A and B or C not D would specify that all file names containing word A should be found, then all the file names containing word B should be found, and only those file names that contain words A and B should be added to those file names containing word C, and then if these file names do not contain word D, they are output. Logical "or'ing" is implicit, thus: A B C is identical to: QT(1) 7 Jan 92 1 QT(1) USER COMMANDS QT(1) A or B or C Obviously, the keywords, "and," "not," and "or," may not be queried for, when using the implicit "or" query constructs. If the "-e" option is specified, then "exact match" queries will be performed, otherwise, a "partial key" type of search will be performed, which is the default. It is recommended that the "-e" option be used if the query involves any boolean operations. If the "-r" option is specified, then the words being queried for will be output before the list file names that contain the queried words. This output format is compatible with egrep(1), and is useful in doing "relevance feedback" searches. If the "-rc" option is specified, then the count of records in a file that contain match(s) will be output with the file name containing the matches. This provides the system with a remedial "relevance feedback" capability. The original text files that were used to construct the inverted index file must be available in the system to use this option. If the "-rp" option is specified, then the records in a file that contain match(s) will be output after the file name containing the matches. This output format provides the system with a remedial "permuted index" type of "proximity retrieval." The original text files that were used to construct the inverted index must be available in the system to use this option. The write synopsis is: qt -w [-f index_name] [-1 | ... | -8] file1 file2 ... where file1 file2 ... are the file names that contain words that are to be added to the inverted index, or: qt -w [-f index_name] [-1 | ... | -8] < file_list where file_list is the name of a file that contains a list of file names, one file name per record, that contain words that are to be added to the inverted index. It is recommended that file names contain the absolute path to the system's root directory. If the inverted index file does not exist, then it will be created, and contain an index to all of the words in the input files. If the inverted index file exists, then the indices of all words in the input files will be added, incrementally. Instances of words and filename pairs will be unique in the inverted index. The "-w" option, specifies that write operations will be performed, QT(1) 7 Jan 92 2 QT(1) USER COMMANDS QT(1) and is a mandatory option, to be used if and only if write operations are desired. The "-f index_name," optionally, specifies the inverted index file's name. If the "-f" option is not specified, the inverted index file name will default to "qt.index." The lexical analyzer level is specified by, "-1", through "-8". If none are specified, the default, "-4", will be used. The lexical analyzers with larger numbers are, generally, more sophisticated about the words that are placed in the inverted index. The lexical analyzers available are: 1) Parses words and numbers. All other characters are omitted. Capitalization is preserved. Probably the best choice if non-word searches are important. 2) Like 1) above, but the '_' character is recognized. This parser seems to work well with "C" program source files. 3) Like 1) above, but, only words of more than two characters are placed in the inverted index file. If capitalization is considered important in the search criteria, then this seems to be the best choice. 4) Like 1) above, but, capitalization is ignored. For general text where all words and numbers are considered significant, this seems to be the best choice. Also seems a good choice for "catman" pages. Queries should be in lowercase. 5) Like 3) above, but capitalization is ignored. For general text, this seems to be the best choice. Queries should be in lowercase. 6) Like 4) above, but words containing only numbers are omitted from the inverted index file. For text containing only words, this seems to be the best choice. Queries should be in lowercase. 7) Like 4) above, but does not include Unix mail headers in the inverted index file. Each email should be in a separate file, as opposed to concatenated into folders. This seems to be the best choice for Unix mail files, if the header information is not desirable. 8) Like 4) above, but deletes TeX and/or LaTeX commands from the inverted index file. This seems to be the best choice for TeX and LaTeX documents. The more sophisticated the parser, the smaller the size of the inverted index file. Multiple runs can be made, using the different QT(1) 7 Jan 92 3 QT(1) USER COMMANDS QT(1) parsers, to store words in the inverted index. For example, using parsers 3) and 4) would place both the capitalized and non-capitalized words in the index. This would not duplicate any words already in the index-only add the words that were different. The remove words synopsis is: qt -w -dw [-f index_name] word1 word2 ... where word1 word2 ... are the words that are to be deleted from the inverted index file, and may be a regular expressions-no '^' or '$' characters should be used, unless they are escaped. The "-w" option is mandatory. The remove files synopsis is: qt -w -df [-f index_name] file1 file2 ... where file1 file2 ... are the file names that are to be deleted from inverted index. The file names to be deleted from the inverted index file may be regular expressions-no '^' or '$' characters should be used, unless they are escaped. The "-w" option is mandatory. The version synopsis is: qt -v which will print the version number of qt. The help synopsis is: qt -h which will list a synopsis of the command semantics. A common example of writing an inverted index file would be: find /dir1/dir2 -type f -print | qt -w which would recursively descend through the directory hierarchy, and create an inverted index of all of the words in all of the files in all of the directories, starting with /dir1/dir2. A common example of retrieving information from an inverted index file would be: more +/word `qt word` where the "more" program would page through the documents that contain "word," advancing to the next instance every time the 'n' key is depressed. QT(1) 7 Jan 92 4 QT(1) USER COMMANDS QT(1) A common example of relevance determination in retrieving information from an inverted index file would be: egrep -ic `qt -r word` | sort -n -r -t: +1 which would print the file(s) that contain "word," with the count of the instances of records that contain "word" in each of the file(s). As a simple application example, this program can be used to search the "catman" pages for a command that performs a specific function, even though the command's name is not known-e.g., if you knew what you wanted to do, you could find the command that would do it. Comments and/or bug reports should be addressed to: john@email.johncon.com (John Conover) Known caveats: There is no concurrency control-it would be ill-advised to use this program as a concurrent application. Additionally, the natural language query does not support grouping operators. Applicability: Applicability of qt varies on complexity of search, size of database, speed of host environment, etc., however, as some general guidelines: 1) For text files with a total size of less than 5 MB, standard egrep(1) queries of the text files will probably prove adequate. 2) For text files with a total size of 5 MB to 50 MB, qt seems adequate for most queries. The significant issue is that, although the retrieval execution times are probably adequate with qt, the database write times are not impressive. 3) For text files with a total size that is larger than 50 MB, or where concurrency is an issue, it would be appropriate to consider one of the alternatives listed in "Related information retrieval software:," below. References: 1) "Information Retrieval, Data Structures & Algorithms," William B. Frakes, Ricardo Baeza-Yates, Editors, Prentice Hall, Englewood Cliffs, New Jersey 07632, 1992, ISBN 0-13-463837-9. The sources for the many of the algorithms presented in 1) are available by ftp, ftp.vt.edu:/pub/reuse/ircode.tar.Z QT(1) 7 Jan 92 5 QT(1) USER COMMANDS QT(1) 2) "Text Information Retrieval Systems," Charles T. Meadow, Academic Press, Inc, San Diego, 1992, ISBN 0-12-487410-X. 3) "Full Text Databases," Carol Tenopir, Jung Soon Ro, Greenwood Press, New York, 1990, ISBN 0-313-26303-5. 4) "Text and Context, Document Processing and Storage," Susan Jones, Springer-Verlag, New York, 1991, ISBN 0-387-19604-8. 5) ftp think.com:/wais/wais-corporate-paper.text 6) ftp cs.toronto.edu:/pub/lq-text.README.1.10 7) "Unix Shell Programming," Lowell Jay Arthur, John Wiley & Sons, Inc., New York, 1990, ISBN 0-471-51820-4. Related information retrieval software: 1) Wais, available by ftp, think.com:/wais/wais-8-b5.1.tar.Z 2) lq-text, available by ftp, cs.toronto.edu: /pub/lq-text1.10.tar.Z The program, qt, is free software, and can be redistributed and/or modified, without any restrictions. It is distributed with no warranty of any kind, implied or otherwise. Specifically, there is no warranty of fitness for any particular purpose and/or merchantability. SEE ALSO cat(1), cp(1), echo(1), egrep(1), join(1), look(1), mv(1), rm(1), sed(1), sort(1), sync(1), tr(1), uniq(1). QT(1) 7 Jan 92 6