BibGlimpse

BibGlimpse Supplement

BibGlimpse is a light-weight reprint manager for distributed literature research. For low-overhead ease of use, it features simple record addition by upload or copy-to-file (e.g., via NFS, SAMBA, or scp), followed by automatic retrieval of a matching bibliographic record from Medline for most PDF reprints (95%). BibGlimpse supports personal annotation of reprints and allows structured full-text queries (cf. BibGlimpse unique feature list and a feature comparison with other popular tools). BibGlimpse has been published in BMC Bioinformatics:

BibGlimpse: The case for a light-weight reprint manager in distributed literature research
Thomas Tuechler, Golda Velez, Alexandra Graf and David P Kreil
BMC Bioinformatics 2008, 9:406

You can test-drive BibGlimpse using a demo repository containing more than one hundred PubMed listed papers from open-access journals. To access the demo follow the link below and use the provided login user and password:

BibGlimpse Demo Repository

user:	`guest`
pwd:	`guest`

We wish to emphasize at this point that we do not endorse or encourage redistribution of reprints beyond what is permitted for the realm of collaborative research. Also note that both reprints and annotations in this repository do only serve the purpose of demonstration and are not meant to be more than examples for the technical capabilities of the system.

Please do not hesitate to write with suggestions or if you have any difficulties. Contact: Thomas Tüchler, bibglimpse08 [at] boku.ac.at.

BibGlimpse Queries

The BibGlimpse literature manager incorporates the Webglimpse search engine. BibGlimpse queries hence employ the Webglimpse query syntax. There is extensive documentation on Webglimpse available, including a general description of the Webglimpse query syntax. A special section deals with structured queries. A more detailed description of the entire supported query syntax is given in the manual pages of the underlying glimpse search engine. Consider the below example:

campto# AND bibf=Prives

This will search for all papers that have words starting with campto anywhere in the full-text, bibliography, or annotation, and which contain Prives in the bibliographic record (e.g., as author). Here, the wildcard symbol is #, and the fields to which search terms can be restricted are

name: filename of the reprint PDF,
full: full text of the reprint PDF,
suppl: full text of supplementary material,
bibf: bibliographic record, i.e. either Medline or BibTeX,
anno: annotation created by users.

When no field is specified, all fields are searched. In displaying search results, a query hit is shown as below:

R A P M T p73 induction after DNA damage is regulated by checkpoint kinases Chk1 and Chk2.

Urist M, Tanaka T, Poyurovsky MV, Prives C.

Genes Dev. 2004 Dec 15;18(24):3041-54.

... conflicting results on the existence of p73 mRNA regulation by genotoxicity ...

Here, the query was for genotoxicity and the search result shows, for each hit, the paper's title, its authors, and the journal citation line. Links are provided to the full repository 'R'ecord and, when available, the personal 'A'nnotation file and bibliographical information in 'M'edline and Bib'T'eX format. A link to the 'P'ubmed is also given. Viewing a Repository record, feel free to try editing the annotation field. Changes will be indexed in the background and are available in a few minutes. You can also upload your own files. If you upload a PDF file with extractable text (some papers are scanned in as images and do not contain any text), the system will attempt to automatically retrieve the corresponding Medline entry from PubMed. For PubMed listed papers, this works about 95% of the time. In a realistic setting of a group of collaborating researchers, individual researchers are likely to have access to the file system where the reprints are stored, either via NFS or SMB/SAMBA mounts, or via ssh/scp. In that case, instead of a web upload, files can simply be dropped or copied to any directory in the repository tree. They will be picked up during the next indexing process (on our system, this runs at least once per hour).

Local installation of BibGlimpse

BibGlimpse is designed as an integral extension to Webglimpse. Detailed installation instructions for BibGlimpse are available from the installation documentation. In brief, just download and unpack the BibGlimpse distribution package, define the installation path and run the installation script ./BibGlimpse.SETUP.

BibGlimpse was tested on different Linux distributions, including ubuntu 8.04.1, fedora 9 and openSUSE 11.0. For these, we also provide dedicated scripts for failsafe installation of all packages required for compiling the glimpse sources. See the setup examples pages for more details.

For test-driving BibGlimpse on Windows, a package installing Cygwin and running BibGlimpse on it is also available. Please refer to the setup with Cygwin page for detailed instructions.

If, after a 30-day trial, you wish to continue using the Webglimpse engine, you need to obtain a license, which is free of charge for non-profit or academic use.

Automated retrieval of bibliographic records from Medline

A key feature of BibGlimpse is the automated retrieval of bibliographic records from Medline. To achieve satisfactory performance, a test corpus of 1005 PubMed listed PDF reprints from 194 different journals was compiled from real-world reprint collections. The scope in this test corpus ranges from Bioinformatics to Drosophila Genetics, from microarray production to human malnutrition. BibGlimpse manages to successfully match the correct Medline records to 95% (955) of these 1005 reprints. Only 0.5% (5) spurious Medline mapping were observed while a remainder of 4.5% (45) was tagged as not-found. We provide an annotated table of contents of this corpus that lists, for all papers tested, the PubMed ID (PMID), article title, journal name, publication date, MD5 checksum of the PDF file, and our retrieval results.

For Medline retrieval, a support-vector-machine (SVM) classifier selects text lines as candidates for author queries. To train and evaluate this SVM, an appropriate dataset was constructed, consisting of 2328 randomly selected text lines extracted from PDF reprints. 185 of these text lines were manually tagged as 'author' lines. This dataset is also available for download: SVMdataset.txt

The SVM classification relies on 8 characteristic features that were extracted for each line using regular expressions: Number of commas (',') per word, number of initials per word, footnotes per word, min(6, distance to the next empty line), all capital letters (true or false), contains an asterisk (true or false), contains a colon (true or false). We here provide our mapping of the dataset into feature space: SVMfeatureset.oct

To train the SVM, 20 of the author lines and 20 of the remaining other text lines were randomly chosen. Classification performance in terms of true and false positive author lines was evaluated on the remaining data points. In a first step, the two tuning parameters of the SVM, the Radial Basis Function (RBF) kernel parameter b and the slack variable weighting parameter C, were optimized based on such randomly chosen training samples. While maximizing classification performance, we minimized the number of support vectors to avoid over-fitting. An optimal parameter set was obtained as b = 25 and C = 250. In a second step, an optimal set of support vectors was determined. Again, random sets of 20 author, 20 non-author lines were used for training and then evaluated against the remaining dataset. From 2000 of such randomly trained support-vector-machines, we selected an SVM with 23 support vectors that achieved 95% true positives and 8% false positives. Reassuringly, typical other SVMs achieve a true positives rate of 88% for 10% false positives. This indicates a good choice of feature space and SVM parameters and provides reasonable evidence that the classifier is not over-fitting the data. The distribution of these randomly trained 2000 SVMs is shown in this ROC plot.
Note that for the particular application of constructing queries for the retrieval of bibliographic records, false positives in flagging 'author' lines are only problematic with respect to search time, but not with respect to accuracy because all hits are verified by reverse search. As a result of this cross-checking, if a PubMed hit and the PDF text in question do not match, false hits are excluded, avoiding incorrect bibliography assignments.

Boku Bioinformatics home | Webglimpse home