BibGlimpse is a light-weight reprint manager for distributed
literature research. For low-overhead ease of use, it features simple
record addition by upload or copy-to-file (e.g.,
via NFS, SAMBA, or scp), followed by automatic retrieval of a
matching bibliographic record from Medline for most PDF reprints
(95%). BibGlimpse supports personal annotation of reprints and allows
structured full-text queries (cf. BibGlimpse unique feature
list and a feature
comparison with other popular tools). BibGlimpse has been
published in BMC Bioinformatics:
BibGlimpse: The
case for a light-weight reprint manager in distributed literature
research
Thomas Tuechler, Golda Velez, Alexandra Graf and David P Kreil
BMC Bioinformatics 2008, 9:406
You can test-drive BibGlimpse using a demo repository containing more than one hundred PubMed listed papers from open-access journals. To access the demo follow the link below and use the provided login user and password:
user: | guest |
pwd: | guest |
We wish to emphasize at this point that we do not endorse or encourage redistribution of reprints beyond what is permitted for the realm of collaborative research. Also note that both reprints and annotations in this repository do only serve the purpose of demonstration and are not meant to be more than examples for the technical capabilities of the system.
Please do not hesitate to write with suggestions or if you have any
difficulties. Contact: Thomas Tüchler,
bibglimpse08 [at] boku.ac.at.
The BibGlimpse literature manager incorporates the Webglimpse search engine. BibGlimpse queries hence employ the Webglimpse query syntax. There is extensive documentation on Webglimpse available, including a general description of the Webglimpse query syntax. A special section deals with structured queries. A more detailed description of the entire supported query syntax is given in the manual pages of the underlying glimpse search engine. Consider the below example:
campto# AND bibf=Prives |
This will search for all papers that have words starting with campto anywhere in the full-text, bibliography, or annotation, and which contain Prives in the bibliographic record (e.g., as author). Here, the wildcard symbol is #, and the fields to which search terms can be restricted are
When no field is specified, all fields are searched. In displaying search results, a query hit is shown as below:
R A P M T p73 induction after DNA damage is regulated by checkpoint kinases Chk1 and Chk2. |
Urist M, Tanaka T, Poyurovsky MV, Prives C. |
Genes Dev. 2004 Dec 15;18(24):3041-54. |
... conflicting results on the existence of p73 mRNA regulation by genotoxicity ... |
Here, the query was for genotoxicity and the search result
shows, for each hit, the paper's title, its authors, and the
journal citation line. Links are provided to the full repository
'R'ecord and, when available, the personal
'A'nnotation file and bibliographical information in
'M'edline and Bib'T'eX format. A link to the
'P'ubmed is also given. Viewing a Repository record, feel free
to try editing the annotation field. Changes will be indexed in the
background and are available in a few minutes. You can also upload
your own files. If you upload a PDF file with extractable text (some
papers are scanned in as images and do not contain any text), the
system will attempt to automatically retrieve the corresponding
Medline entry from PubMed. For PubMed listed papers, this works about
95% of the time. In a realistic setting of a group of collaborating
researchers, individual researchers are likely to have access to the
file system where the reprints are stored, either via NFS or SMB/SAMBA
mounts, or via ssh/scp. In that case, instead of a web upload, files
can simply be dropped or copied to any directory in the repository
tree. They will be picked up during the next indexing process (on our
system, this runs at least once per hour).
BibGlimpse is designed as an integral extension to Webglimpse. Detailed installation instructions for BibGlimpse are available from the installation documentation. In brief, just download and unpack the BibGlimpse distribution package, define the installation path and run the installation script ./BibGlimpse.SETUP.
BibGlimpse was tested on different Linux distributions, including ubuntu 8.04.1, fedora 9 and openSUSE 11.0. For these, we also provide dedicated scripts for failsafe installation of all packages required for compiling the glimpse sources. See the setup examples pages for more details.
For test-driving BibGlimpse on Windows, a package installing Cygwin and running BibGlimpse on it is also available. Please refer to the setup with Cygwin page for detailed instructions.
If, after a 30-day trial, you wish to continue using the Webglimpse engine, you need to obtain a license, which is free of charge for non-profit or academic use.
A key feature of BibGlimpse is the automated retrieval of bibliographic records from Medline. To achieve satisfactory performance, a test corpus of 1005 PubMed listed PDF reprints from 194 different journals was compiled from real-world reprint collections. The scope in this test corpus ranges from Bioinformatics to Drosophila Genetics, from microarray production to human malnutrition. BibGlimpse manages to successfully match the correct Medline records to 95% (955) of these 1005 reprints. Only 0.5% (5) spurious Medline mapping were observed while a remainder of 4.5% (45) was tagged as not-found. We provide an annotated table of contents of this corpus that lists, for all papers tested, the PubMed ID (PMID), article title, journal name, publication date, MD5 checksum of the PDF file, and our retrieval results.
For Medline retrieval, a support-vector-machine (SVM) classifier selects text lines as candidates for author queries. To train and evaluate this SVM, an appropriate dataset was constructed, consisting of 2328 randomly selected text lines extracted from PDF reprints. 185 of these text lines were manually tagged as 'author' lines. This dataset is also available for download: SVMdataset.txt
The SVM classification relies on 8 characteristic features that were extracted for each line using regular expressions: Number of commas (',') per word, number of initials per word, footnotes per word, min(6, distance to the next empty line), all capital letters (true or false), contains an asterisk (true or false), contains a colon (true or false). We here provide our mapping of the dataset into feature space: SVMfeatureset.oct
To train the SVM, 20 of the author lines and 20 of the remaining other
text lines were randomly chosen. Classification performance in terms
of true and false positive author lines was evaluated on the remaining
data points. In a first step, the two tuning parameters of the SVM,
the Radial Basis Function (RBF) kernel parameter b and the
slack variable weighting parameter C, were optimized based on
such randomly chosen training samples. While maximizing classification
performance, we minimized the number of support vectors to avoid
over-fitting. An optimal parameter set was obtained as b = 25
and C = 250. In a second step, an optimal set of support
vectors was determined. Again, random sets of 20 author, 20 non-author
lines were used for training and then evaluated against the remaining
dataset. From 2000 of such randomly trained support-vector-machines,
we selected an SVM with 23 support vectors that achieved 95% true
positives and 8% false positives. Reassuringly, typical other SVMs
achieve a true positives rate of 88% for 10% false positives. This
indicates a good choice of feature space and SVM parameters and
provides reasonable evidence that the classifier is not over-fitting
the data. The distribution of these randomly trained 2000 SVMs is
shown in this
ROC
plot.
Note that for the particular application of constructing queries for
the retrieval of bibliographic records, false positives in flagging
'author' lines are only problematic with respect to search
time, but not with respect to accuracy because all hits are verified
by reverse search. As a result of this cross-checking, if a PubMed hit
and the PDF text in question do not match, false hits are excluded,
avoiding incorrect bibliography assignments.