Prepared in 2003 at a client's request. Please notify us if you believe this comparison is in any way inaccurate or is now outdated.
Project Fundamentals
|
Last Modified |
Known Security Holes |
Security features |
Program language |
Specification of Files to Index |
File Types Supported |
Code availability, Licensing |
Tech Support |
---|---|---|---|---|---|---|---|---|
HtDig |
2002-02-01 |
Yes-beta* |
Not known |
C only |
Site* |
HTML, text * |
GPL |
mailing list |
Webglimpse |
2003-05-16 |
No |
perl -T |
C for indexing and search; |
Site, Directory |
HTML, text, PDF, MSWord, .gz, .zip * |
open code, |
Yes. Guaranteed successful install |
User Interface
| Boolean queries |
Phrase searches |
Fuzzy/approx matching |
Easy search interface with
|
Wildcard searches |
Language Templates Available |
Limit search by... |
Re-Rank Hits (user choice of criteria) |
Keyword Highlighting |
Combine results from multiple archives |
---|---|---|---|---|---|---|---|---|---|---|
HtDig |
Yes |
No |
Yes |
No |
No |
English |
URL pattern |
No |
No |
No |
Webglimpse |
Yes |
Yes |
Yes |
Yes |
Yes |
Hebrew, German, Spanish, Italian, French, Finnish, Norwegian, Portuguese and Estonian (Russian just received 6/02/03) |
URL pattern or Subdirectory |
Yes |
Yes |
Yes |
Administration
| Web-ministration interface |
Customizable Search output |
Customizable Ranking formulas |
Query Log (what are users searching for?) |
Statistics on gathered pages |
Email to administrator on index failure |
Meta tag support |
---|---|---|---|---|---|---|---|
HtDig |
No |
Yes |
No |
No |
No |
Not known |
Not known |
Webglimpse |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Technical details
| Indexing algorithm |
Options for Speed |
Options for Size |
Options for returned text |
Platforms tested on |
---|---|---|---|---|---|
HtDig |
Not known |
Not known |
Not known |
Not known; but depends on retrieving all files, even local ones. According to the htdig site, the database may frequently be larger than the actual files indexed. |
Linux, Solaris, SunOS, HP/UX, IRIX, freeBSD, Mac OS X |
Webglimpse |
block-level inverted index |
caching of search results; ability to limit number of hits returned for extremely fast search (<1s on 2Gb of data) |
Tiny, medium or large index; pre-filtering of files. Index takes typically 5-15% of total file size. Local files do not need to be gathered. | find sentences; limit by chars; limit by lines |
Linux, Solaris, SunOS, HP/UX, freeBSD, AIX, IRIX, OSF, Mach, Mac OS X |
Some Users
HtDig |
NASA, Tennessee Valley Authority, Valley Internet, Together Networks, many Linux and GNU-related sites, many universities. |
---|---|
Webglimpse |
NASA, Los Alamos Natl Labs, Altohiway, Texas Workforce Commission, Baystate Health System, Intel, Hewlitt-Packard, AT&T, many small businesses, universities, and government agencies |
Notes
HtDig has a known security hole in the latest beta version 3.2.0b3, currently downloadable from the site http://htdig.org. There is a fix in the latest stable version, 3.1.16, and in the code snapshot. The previous stable version, 3.1.15, also had the security hole. This beta version with known security problems has apparently been available for download since 2001-10-15. According to these notes, "This hole can allow remote users to read any file on your system that the UID running your webserver can read."
HtDig selects the files to index by gathering links from one or more starting URLs. It will gather links that are on the same site as the starting ones by matching a simple set of string patterns.
Webglimpse can index files by Site, essentially the same as HtDig; by Directory (all files within a specified directory on the server, whether or not they are linked); and by Tree (all files with a certain number of 'mouse clicks' or 'hops' away from one or more starting points. Webglimpse can also include or exclude files by regexp patterns and can accept information about synonymous virtual domains and alias directories in order not to gather duplicate links.
According to the 'Features and Requirements' page on the http://htdig.org website, " Both HTML documents and plain text files can be searched. Searching of other file types will be supported in future versions.". However, there are references to searching PDF files in the FAQ area; this may refer only to the beta version which currently is released with a security hole. Possibly by getting the new beta code snapshot you might successfully be able to index PDF using the xpdf add-on.
Webglimpse supports indexing any file that can be filtered to text by an external program. Free and reliable external
programs are known for PDF, MSWord, and all compressed file formats. By pre-filtering files before indexing (and
filtering on download) searches are quite fast even on these filetypes. Pre-filtering also saves a great deal of space
when indexing remote files. Several scripts to filter HTML tags are provided, including ones which convert HTML
character codes such as á =