NformatiX: The Wgetrelx Program

Software For Full Text Information Retrieval:

The Wgetrelx Program - Search Internet Web Pages For Documents That Are Relevant To A Phonetic Search Criteria

wgetrelx [-l N] [-n N] criteria <htmlfile | url1 url2...>

DESCRIPTION

Wgetrelx is an Internet Web page search engine, using the programs htmlrelx(1) and wget(1). (The program wget(1) is available via anonymous ftp from ftp://prep.ai.mit.edu/pub/gnu/.) The direction of the search is controlled through determination of relevance of the documents to a search criteria.

See the man page for htmlrelx(1) for additional information on the syntax of the criteria argument-the boolean operators supported are logical or, logical and, and logical not. These operators are represented by the symbols, "|", "&", and, "!", respectively, and left and right parenthesis, "(" and ")", are used as the grouping operators.

The documents searched are stored in a directory, www, residing in the directory where wgetrelx(1) was invoked. The output of the wgetrelx(1) script is a file, www.html, in the directory where the wgetrelx(1) script was invoked. The structure and syntax of the file is compatible with the Netscape level 1 bookmark file specification-it can be browsed with Netscape, Mosaic, or Lynx, for example:


  lynx www.html
  Mosaic www.html
  Netscape www.html

EXAMPLE USAGE


  wgetrelx mycriteria http://my.favorite.url
  wgetrelx mycriteria www.html
      .
      .
      .
  wgetrelx mycriteria www.html
  netscape www.html

SEARCH STRATEGIES

The shell script wgetrelx(1) uses the wget(1) program in conjunction with htmlrelx(1) to provide a flexible and extensible Internet HTML Web page search tool. The script may be altered to optimize search strategies. One of the advantages of relevance searching on the Internet is that the the search of HTML links can be controlled by the relevance of information contained in the HTML pages. This can be an iterative process, for example:


  wget -nc -r -l1 -H -x -t3 -T15 -P www -Ahtml,htm http://my.favorite.url

would "seed" the html page directory, www, with pages from the URL, http://my.favorite.url. Note that a search "context" has already been specified; doing a search, for example on game theory, by specifying a keyword of "game" to an Internet search engine would not produce the desired results. However, if the URL, http://my.favorite.url, was the Web pages for an economics department at a university, the "context" would be entirely different.

The next iteration of the search, going down another level in the hierarchy of the links might be:


  cd www
  htmlrelx criteria * > ../www.html
  cd ..

and the search iterated:


  wget -nc -r -l1 -H -x -t3 -T15 -P www -Ahtml,htm -i www.html

where the file www.html is a list of the URL's containing information, in order of relevance, as specified in the criteria arguments to htmlrelx(1). Since the URL's are ordered by relevance, the most "promising," (ie., the documents with the best probability of containing the information that is being searched for,) the file, www.html, can be trimmed, say, to 10, URL's:


  cd www
  htmlrelx -n 10 criteria * > ../www.html
  cd ..

and the search iterated:


  wget -nc -r -l1 -H -x -t3 -T15 -P www -Ahtml,htm -i www.html

which would descend the search another level in the link hierarchy from http://my.favorite.url.

Alternatively, the file, www.html, can be edited, and reordered, (in an interactive fashion with each search,) with any popular browser to enhance the search direction and capability. Note that the search criteria can be altered in the process, and, since the Web pages are stored on the local machine, can be viewed, "off line." Note, also, that the programs wget(1) and htmlrelx(1) are "portable," so the actual search can use a host that has a direct high speed connection to the Internet-and the file, www.html, transfered back to the local machine.

One of the issues in searching the Internet, is that the the number of HTTP links that need to be searched increases exponentially with the number of HTTP pages that have already been searched-if the number of pages in the directory, www, are increasing exponentially, it is probably appropriate to constrain the search through alteration of the search criteria used for htmlrelx(1). (There are about three links, on average, on every HTML page.)

For exhaustive searches, the depth, (the -l argument to both wget(1) and htmlrelx(1),) can be increased. For general searching, a depth of 3 will usually suffice, and only one iteration will be required. Typically, this will reduce the search time for specific information by approximately an order of magnitude.

OPTIONS

-l N: Search to a depth of N many links.
-n N: Output a maximum of N many http descriptors.
-v: Print the version and copyright banner of the program.

WARNINGS

In the interest of performance, Memory is allocated to hold the entire file to be searched. Large files may create resource issues.

The "not" boolean operator, '!', can NOT be used to find the list of documents that do NOT contain a keyword or phrase, (unless used in conjunction with a preceeding boolean construct that will syntactically define an intermediate accept criteria for the documents.) The rationale is that the relevance of a set of documents that do NOT contain a phrase or keyword is ambiguous, and has no meaning-ie., how can documents be ordered that do not contain something? Whether this is a bug, or not, depends on one's point of view.

DIAGNOSTICS

Error messages for illegal or incompatible search patterns, for non-regular, missing or inaccessible files and directories, or for (unlikely) memory allocation failure, and signal errors.

AUTHORS

A license is hereby granted to reproduce this software source code and to create executable versions from this source code for personal, non-commercial use. The copyright notice included with the software must be maintained in all copies produced.

THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE. THE AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY.

So there.

Comments and/or bug reports should be addressed to:

john@email.johncon.com

http://www.johncon.com/

http://www.johncon.com/ntropix/

http://www.johncon.com/ndustrix/

http://www.johncon.com/nformatix/

http://www.johncon.com/ndex/