|
This chapter describes the technical side of the TULIP project, which has been the major focus of at least the first half of the project. The lessons learned here have already had an important impact on the directions Elsevier Science is taking, as well as on the implementation of the digital library (components) at the participating universities. Because we think that some of these lessons could be equally valuable for other institutions looking to get started in this field, we have summarized them below, and prepared some general guidelines and checklist (appendix X).
To give a better feel for the size of the TULIP project, we have listed some key numbers in the box below.
Key Numbers in TULIP program
Number of journal titles: 43
Additional number of journal titles in 1995: 40
Number of Datasets 99
- in original titles 86
- in additional journal titles 13
Number of issues 2784
- in original titles 2427
- in additional journal titles 357
Number of articles 74096
- in original titles 66569
- in additional journal titles 7527
Number of pages 536946
- in original titles 470040
- in additional journal titles 66906
Storage space per Dataset
- Average Dataset 380 Megabytes
- Biggest Dataset 530 Megabytes
Storage space per issue
- Average page image 72 Kilobytes
- Average issue 14 Megabytes (+/- 193 pages)
- Largest issue 86 Megabytes (1200 pages)
Total storage space (approximate) 39 Gigabytes
Average transmission time per Dataset *) 6h 42m
- Quickest school average 4h 21m
- Slowest school average 11h 2m
*) raw throughput time; failures and overhead excluded
Number of TULIP universities 9
Number of different solutions (including abandoned and prototype versions)
- Number of MS Windows implementations 2
- Number of X-Windows implementations 6
- Number of Telnet implementations 3
- Number of Gopher implementations 1
- Number of WWW implementations 4
II 1. Production at Elsevier Science
This chapter describes the phases of production at Elsevier Science and its partners to generate and deliver electronic journal issues for the TULIP project. This can be broken down in three main phases:
The (traditional) production from manuscript to journal issue.
The scanning and capturing of journal issues in full TULIP Datasets.
The customization and delivery of Datasets to the libraries involved in TULIP.
II 1.1. Description of paper journal production
“Traditional” paper journal production is done in several phases. We briefly describe the steps from manuscript written by the author to the journal issue received by the librarian. Variations exist for particular types of publications, but the majority of articles is handled in the following order:
The author creates the manuscript and submits this manuscript to the editor of a journal.
The editor distributes the manuscript to a small number of reviewers to assess its scientific value. Based on their reports and possibly subsequent modifications by the author, the editor decides to accept or reject the manuscript for publication in the journal. Accepted manuscripts are submitted to the publisher.
The accepted manuscript is received by the publisher, where the details are entered in the tracking system which monitors production flow. Electronically delivered manuscripts (nicknamed “compuscripts”) are converted to a generic word processing format (SGML).
The text is marked up according to the style of the journal and spelling- and other checks are done. “Anchors” are inserted for the artwork.
Artwork (figures, charts, photographs, etc.) is prepared and converted to electronic files by conversion, redrawing or high quality scanning.
The resulting manuscript including artwork is proofed by the typesetter to obtain a draft version of the article, which is mailed to the author for approval.
The corrections returned by the author are inserted in the article text or artwork files. The article is ready for issue assembly.
Based on the publication scheme (weekly, biweekly, monthly, etc.) of the journal, finished articles are assembled in a journal issue. Page numbers are assigned, a table of contents, author and subject index, and other editorial material is added and the full issue is sent to the typesetter.
The typesetter produces high quality Camera Ready Output (CRC) on film or print plate for further processing.
The CRC material is printed in “quires” (sixteen pages on one piece of paper, to be folded and cut) and these are collected and bound into journal issues. Packages of printed issues are dispatched to the publishers’ distribution center.
The distribution center packages the issues into envelopes with address labels. The issues are bundled by geographical area and dispatched to customers by mail, airmail or courier service (based on subscription arrangements).
II 1.2. Description of production of electronic version
II 1.2.1 Introduction
As a result of historical technological differences and more recent acquisitions, there was no single production flow mechanism covering all journal titles within the Elsevier Science organization in 1992. Elsevier Science consists of several larger or smaller publishing offices, each with its own publishing portfolio, working procedures and technologies as well as its own third-party suppliers such as typesetting, artwork preparation and printing companies. For instance, the 43 initial journals in TULIP were published by four different offices in Amsterdam (The Netherlands), Oxford (UK), Lausanne (Switzerland) and New York (USA). These 43 titles are typeset by some 18 different typesetting companies. In 1992, PostScript had not gained the popularity it currently has for scientific typesetting, hence both PostScript and non-PostScript routines were used.
At present, Elsevier Science is consolidating all these different production methods to streamline the output into standard electronic formats, such as SGML, PostScript, PDF, JPEG and TIFF, which then become the basic material to not only produce paper versions of the journals in their most appropriate typeset form, but also to provide “real” electronic versions of the journals based on SGML. In the future we should experience a state where paper is the derivative of the electronic journal, the reverse of the situation in TULIP.
As an intermediate step, however, for the TULIP project we used the paper version of the journals to produce scanned images as the electronic form of the journals. It was decided that it would be preferable to use and digitize the finished paper product (end of phase 10, see above) rather than trying to obtain production files in a diversity of (sometimes proprietary) electronic formats from these different sources. This was based on the following observations:
The material from the production files is not a “clean cut” version, i.e. the pages have an extra margin, that contains crop marks and production remarks, which is cut away later.
The cover is not the final one. In a number of cases the “blank” color cover is held at the printer in a large stock on which only the volume, issue and cover date information is added per issue.
Some of the material is in the form of plastic film with transparent text on a black background, impossible to process with currently used plain paper page scanners.
The typesetter and printer are changing to electronic delivery of CRC. This exchange is based on PostScript, but the resulting files are totally unusable for TULIP, since these are imposition prints with e.g. a complete one sided quire with half of the pages upside down in one large print file.
Typesetters and printers work with tight production schedules. There was a hazard that the introduction of new requirements would disturb their production flow.
II 1.2.2 The Dataset
TULIP is based on electronic subscription-based regular delivery of large volumes of journal information. No standard existed for this kind of delivery. Available standards for document delivery dealt with the concept of demand-driven single document requests, which appeared to be inapplicable for supply-driven electronic delivery of entire journal issues. Therefore, in TULIP’s first year the concept of the TULIP Dataset was created in close cooperation with the different internal production and delivery partners, and with the technical coordinators at the universities.
A Dataset holds a number of page image and text files from several journal issues, collected biweekly in this particular case. The structure and format of a Dataset follows the ISO 9660 Mode 1 standard for CD-Rom mastering, which is more or less similar to regular MS-DOS conventions for file names and directory structures. This convention can easily be used in other computer platforms, such as Apple Macintosh and UNIX operating systems.
The directory-structure of the Dataset directly reflects the division into journals (identified by their International Standard Serial Number - ISSN), journal issues and pages/articles. The files in the Dataset are page images, “raw” ASCII files, SGML-coded citation files and a master index containing bibliographic information and pointers connecting the files.
Every page from a journal issue (cover-to-cover) corresponds with a page image file. These are standard black/white single-page Tagged Image File Format (TIFF) files with a resolution of 300 dots per inch. The page-sizes differ from journal to journal. The maximum size is European A4, i.e. 21 × 29.7 cm or 8.27 × 11.69 inch. The compression method used is the International Telecommunications Union (ITU; formerly known as CCITT) Fax Group IV encoding scheme, with which it is possible to reduce the average 1 megabyte image to a TIFF file of 80 kilobytes (Kb).
Every page has a corresponding text file with the full text. This ASCII file is the result of Optical Character Recognition (OCR). These files are provided in unedited (“raw”) format, since no further keyboarding/editing/spell checking is performed on them.
Every editorial item (e.g. full length scientific article, product review, correspondence letter, editorial note, etc.) has a corresponding Standard Generalized Markup Language (SGML) file, holding the full bibliographic information (e.g. title, authors, abstract, keyword, page range, etc.). The Document Type Definition (DTD) for these SGML files is Elsevier’s Full Length Article DTD. These SGML files were initially not part of the TULIP Datasets, but were added in 1995 when TULIP production became part of the Elsevier Electronic Subscriptions production.
Each Dataset has one master index file, the so-called DATASET.TOC file, which holds complete bibliographic information as well as all relevant cross reference data, i.e. which page images are related to which articles and which articles are in a particular journal issue.
An important characteristic of the TULIP project is the large storage requirement. Traditional text oriented database systems normally deal with small, plain-text bibliographic records, averaging 2 to 3 Kb (with abstract). The information delivered in TULIP averages 840 Kb per article, roughly 400 times this size.
The storage requirements for a typical journal issue, holding 20 articles plus cover pages, editorial notes, table of contents, etc. on 200 pages, are approximately 17 megabytes (Mb):
A single page TIFF image file takes about 80 Kb, 200 pages are approx. 16 Mb.
The corresponding unedited (“raw”) OCR-quality ASCII file takes up 4 Kb, that is 800 Kb for all 200 pages.
The SGML file, which holds the bibliographic data of a given article, is on average 4 Kb. For 20 articles this amounts to 80 Kb.
The index data comprises some 4 Kb per article, adding up to 80 Kb for an average issue.
For the TULIP titles the average subscription frequency is 14 issues per year, that is 238 Mb per journal per year. However there is a wide variation in volume and frequency between the journal titles.
II 1.2.3 Overview of production steps
TULIP operational units transform printed journal issues into their electronic equivalents as follows:
Dispatch of journal issues to the scanning offices:
Journal issues for TULIP follow the usual internal production routines for typesetting, printing and binding. The Elsevier distribution centers ship the journal issues via air mail delivery to the TULIP scanning offices.
Receipt and verification of journal issues:
As soon as the journal issue arrives at the scanning office, it is checked for completeness and registered in the tracking system at the scanning office.
Page image scanning:
The spine of the issue is cut and the pages are fed into a high-volume (40 pages per minute) double-sided scanner. Care is taken that the order of the pages is correct, the pages are not fed skewed and that the printed page numbers correspond with the actual pages, including certification of pages without page numbers (such as advertisements) and oddly numbered pages (such as roman numbered pages with iii, iv, xi). The pages are scanned at 300 dots per inch (dpi) and directly compressed by means of hardware based compression boards in order to keep storage space manageable.
At the start of the project in 1992 the decision was made to base page images on 300 dots per inch, black/white TIFF according to the ITU/CCITT Fax Group IV compression scheme. This decision was based on a good ratio of image quality versus storage consumption and production costs. No affordable high-volume 600 dpi scanners were available at that time and the quality of text and line art was regarded as good, somewhere between office laser printer and photocopier quality. The implication of this choice is that color is not possible with this format and that the quality of photographs can be unsatisfactory, especially for photographs with little contrast such as electromicrographs.
Optical character recognition (OCR):
After the page images are scanned and cropped they are processed in a background process on a separate machine in the same network. This machine automatically picks up a page image and performs optical character recognition to generate the corresponding ASCII file. The OCR process takes longer to process a page image than the actual scanning. At TULIP’s start, hardware OCR equipment was used, but with the advent of faster machines and better algorithms OCR is performed entirely in software and takes thirty seconds to a whole minute per page. The quality of the resulting files has improved greatly in the past years, but mathematical symbols and other typographic codes remain a problem. Even now, there is no software available to handle complicated scientific texts successfully.
Editing of OCR texts to obtain SGML files and bibliographic records:
Page images and OCR texts are used in a production editor environment to generate the bibliographic records and the SGML files for each article. Because each journal title has a different presentation and layout for article elements such as titles, abstracts, article texts, etc., it is not possible to fully automate this process. Despite efforts in artificial intelligence, only a human being can comprehend the whole scope of a scientific article. Associating page images with articles, i.e. identifying the pages which “belong” to an article, is an integral part of the production editor process.
Collecting material in Datasets:
The TULIP production schedule results in biweekly delivery of Datasets. All material that has passed the final quality control step is collected into the directory structure of a Dataset. All edited bibliographic information and relevant cross reference data is compiled into the DATASET.TOC file which is the primary index for each Dataset. A final quality control step is performed on the entire Dataset and checksum files are generated (for a description of this facility, see below). The resulting Dataset is finally copied onto CD-Recordable disk. This disk was shipped by courier to the customizing and delivery office, located in the USA. Due to the problems in Internet file transfer procedures (see below), it was decided in 1994 to discontinue network delivery and to send CD-Rom’s directly to the universities.
II 1.3. Description of customizing and Internet distribution
Elsevier Science worked together with Article Express*) to customize Datasets to reflect universities’ subscriptions and to deliver those customized Datasets via the Internet. The original implementation that was agreed on by Elsevier and the collaborating universities in early 1993 was a so called “push” model: After the Dataset has been received and customized, it is ready for delivery. It was “pushed” from Article Express to all universities by means of a series of FTP scripts. Tracking and validation was based on the assumption that FTP could be used for the “push” model of delivering large Datasets with very many pieces.
*) Article Express was originally a combined operation of Engineering Information, Inc. and Dialog Inc., Article Express was later run by Dialog, after Engineering Information was bought out.
II 1.3.1 Customizing Datasets
Upon receipt of a new Dataset on CD-Rom from the scanning office, the entire content of the CD-Rom is copied to large capacity magnetic disk. Article Express holds a database including among other things, which universities subscribe to which journal titles, since not every university subscribes to all journal titles. Universities only receive electronic equivalents of journals of which they hold a subscription. A number of different “logical” Datasets (based on pointers to relevant files, not by duplicating entire Datasets) is derived, one for each university. Also, different DATASET.TOC files are generated for each university. Subsequently a number of FTP scripts is generated, also one for each university. As soon as this preparatory work is done, the different logical Datasets are ready for dispatch.
II 1.3.2 Internet Delivery
The original delivery process worked as follows:
In a series of FTP sessions (one per directory holding a journal issue) the entire content of the customized Datasets was transferred.
The following step was an FTP session in which all transferred directories were checked against the original data.
For any mismatch (missing files or ones which had a different file size than intended) the files were retransmitted and revalidated. If the validation still failed, an automatic message was generated for the service provider.
As a final step, after all validation had been accomplished, the customized DATASET.TOC file with all bibliographic data and the relevant cross reference data was transferred to indicate the successful completion of Dataset delivery.
On average Datasets are between 200 and 300 Mb in size. FTP deliveries tended to fluctuate between two to 14 hours with an average of 6.5 hours, dependent on a number of visible or less visible factors such as time of day, Internet rerouting, type of connection (T1 = 1.5 Mbit/second or T3 = 45 Mbit/second), high user load, etc.
The initial “push” strategy was based on the assumption that all universities would be able to start receiving TULIP Datasets within the same time frame and that there would be “enough” storage space available at the universities’ end to receive the information. The idea was that after receipt of a Dataset from the scanning office it would be transmitted without delay from Article Express to all universities. However, because not all universities were ready to receive at the same time due to differences in their implementation schedules, this did not work very well.
Therefore during the course of the project the “push” strategy was changed to a “push on demand” strategy. From time to time (at least weekly), the local TULIP coordinator at a university connects via Telnet to the dedicated dispatch machine at Article Express to see if there are any new Datasets available and what the estimated size of the Dataset is in Mb. If there is a new Dataset, he/she indicates which one(s) to deliver (in TULIP terms: “kicking it off”) and then exits the Telnet session.
Implementation of this “kick-off” facility allowed for the university staff to initiate the transfer of a Dataset when they were ready to accommodate the data. This helped considerably because it enabled the universities to process Datasets at their convenience when they had sufficient disk resources. It proved very successful in the case where one late starter began receiving Datasets about one year after other sites and had a pretty large backlog. A complete year of Datasets was transmitted within a few weeks.
Nevertheless, the amount of errors (detailed below) remained high. Therefore, it was decided to cease the delivery of Datasets via the Internet and to revert to CD-Rom distribution.
II 1.3.3 Single article delivery
At TULIP’s start, a facility was designed at Engineering Information for single article delivery over the Internet. Universities were entitled to receive all bibliographic records of the TULIP titles, even if no subscription was kept on certain titles. If a researcher found an article of interest in one of the non-subscribed titles, it was possible to request this article by means of a formatted electronic mail message. To accomplish this facility a separate WORM-based optical storage system was developed by Engineering Information staff. It was the intention that this multi-gigabyte storage system would hold all Datasets after these were dispatched to the universities.
The development of this system suffered from several technical problems, most of them due to the incompatibility to connect optical devices to the UNIX systems at Engineering Information. Therefore this system never became fully operational. The solution was to use a manual procedure, but this resulted in long delivery delays, and those universities that tried this system, found it did not work satisfactorily. The demand for this system turned out to be very low. Most of the universities subscribed to nearly all journals and so did not need to request single documents. The only exception was the University of Tennessee, which did not receive and store page images and relied on this facility to order page images. However, besides some troublesome experimenting, there was no demand for single articles and so the facility was practically abandoned in 1994.
II 1.4. Lessons learned concerning production
Below is an overview of some of the standards that have been adopted in the course of the TULIP project, of some of the problems that were faced during the four years of producing TULIP, and of the solutions found and implemented to solve these problems.
II 1.4.1 Dataset structure
The Dataset structure with directories and a single “DATASET.TOC” master index file has proven to be a stable and robust “envelope” format to collect and transmit large quantities of electronic material, independent of medium. It is simple to generate, load and convert in different systems and it is open to add formats like full text SGML files and MPEG videos, without violating the original structure. This Dataset structure format, nicknamed EFFECT, Exchange Format For Electronic Components and Texts, has been offered to the Internet Engineering Task Force (IETF) as a possible Internet standard. The format has been applied in other projects, such as the EASE project in which Elsevier Science is cooperating with Tilburg University in the Netherlands, and in the JSTOR project at the University of Michigan.
II 1.4.2 Page image cropping
There is a large variation in journal heights and widths. Scanning is performed at maximum page size regardless of journal size. The resulting page images therefore initially had black borders, meaning wasted disk storage space and high laser printer toner consumption.
The first cut at this problem was to manually measure each journal issue before it was fed into the scanner and to have the scanner operator enter height and width. This proved to be too laborious and error prone to work satisfactorily.
Image enhancement programs, which automatically crop pages based on visual aids on the page image, have also been investigated, but proved too inaccurate. Sometimes fine horizontal or vertical lines in a table or page footer were taken as the page margin, resulting in too much cropping. Automatic cropping also results in different dimensions for each page. This proved unsatisfactory for screen display purposes as the (left and right) page images “jump” on the screen when browsing through an article.
The following procedure, which is based on a separate cropping operation step, was finally adopted. Every journal title typically has its own fixed dimensions. In the cropping step a different “mask” is applied per journal title which removes a fixed number of pixels from the right and bottom sides from odd pages and the same amount of pixels from the left and bottom sides from even pages (due to the dual page scanning method both sides of the page are scanned at the same time). The result is equal size page images per journal issue, with minimal black borders.
No good solution has been found yet for fold-out pages, which don’t appear in TULIP journals very often, but are frequently present in medical and geological journals for large maps, charts or tables. Fold-out pages are simply cut into several page images, losing the intended overview of the fold-out. It could be considered to add the entire fold-out at 70% size, enabling users to get an overview of the entire chart, scheme or table.
II 1.4.3 Halftone quality
Page image scanning quality has enhanced considerably in the past years. Image enhancement technology improved and scanning staff became more experienced, resulting in crisp text and good artwork. The only exception is the quality of scanned photographs, which is less than satisfactory. This holds especially for those photographs with little contrast such as electron microscope or other micrograph-optical reproductions. This is inherent of the image format chosen, which is bi-level bitmaps. Each pixel in a bi-level file denotes either black or white as opposed to halftone or color bitmaps. Each pixel in a bi-level bitmap is only one bit, while pixels in halftone or color files can be many bits to represent different colors or gray values. Bi-level bitmaps are therefore relatively small and can furthermore be compressed excellently with the Fax Group IV compression scheme, the TIFF format is well established and software/hardware tools are readily available.
However, when scanning a halftone photograph into a bi-level bitmap, an image scanner has to decide for each pixel area (typically 1/300 square inch), whether the gray value is above or below a certain threshold and should be represented as a black or white dot only. In some cases, photographs with large grey areas result in smudgy black rectangles on the page images. This problem is further complicated by the symptom of “moiré patterns”, because of interference of the print screening (the angled tiny raster which is visible when observing printed photographs with a magnifying glass), and the scanner.
New scanning technology came on to the market recently to tackle this particular problem, however this came “too late” for the TULIP project. A new line of high volume image scanners recognizes halftone areas on the page and performs a sophisticated dithering technique on these areas, leaving the text areas untouched. The result is a considerably better scanned photograph quality, although still not equivalent to the original photograph. Scanners of this type will be used for future electronic projects.
II 1.4.4 CD Rom mastering
In 1992, CD Rom mastering equipment based on CD Recordable write-once (golden) disks was not readily available and had its teething problems. Some of these first generation problems were encountered in 1993, when it was necessary to produce a large quantity of CD Rom’s due to the 1992 backlog. The situation has now become more stable, but needs continued attention because the technology is still not 100% error proof.
II 1.4.5 Checksums
TIFF page images are very sensitive to “corruption”. One single incorrect bit in a page image makes the file useless. Errors occasionally occur in CD-Rom mastering as well as in Internet file transfers, although at the outset these two technologies were expected to have sufficient error recovery facilities. On average we have encountered one incorrect bit, resulting in a fully incorrect image per maybe 20,000 correct images, equaling one wrong bit per 1.6 gigabyte.
It is impossible to determine a pattern in the occurrence of erroneous files. To detect possible problems a checksum file is generated for each subdirectory as a final step in the quality control phase. These checksum files are checked after the CD Rom is written. Only validated CD Rom’s are shipped to the universities. As a precautionary measure the receiving universities also validate all incoming Datasets against the checksum files. They have indicated that this is of high relevance to safeguard the integrity of their electronic holdings.
Checksums were introduced in the beginning of 1994. The first checksums were based on the “sum” routine, available as a UNIX command. However, this routine was dependent on the byte order of the central processing unit and on the UNIX “dialect” (Berkeley BSD or AT&T System V), and therefore not universally useable. Since mid 1995, the checksums are based on the publicly available MD5 signature algorithm, developed by RSA Data Security, Inc., which is independent of byte order and CPU and is more robust.
II 1.4.6 Unique identifiers
One of the things we have learned in TULIP is, that in a large scale database environment it is necessary to have a unique and unambiguous way of identifying journals, issues and articles.
Journal identifiers As a unique journal identifier, the International Standard Serials Number (ISSN) has been chosen. This however, posed problems with journal titles which are renamed, are split up, or are joined. Nevertheless, the ISSN proved standard “enough” for the majority of journals, thereby avoiding the need for a “new” standard.
Issue identifiers No short and simple unambiguous scheme existed for identifying journal issues. Normally a journal issue is identified by a volume and issue number (e.g. Volume 193, Issue 4). However, it appeared that there exist many troubling exceptions to this rule, such as combined issues (e.g. Issues 1-4), combined volumes (e.g. Volumes 192-194) and special issues (such as indexes, supplements and proceedings issues). To avoid any inconsistencies, a simple generic sequence number, unrelated to the printed volume and issue numbers, has been adopted for TULIP.
Item identifiers The standard adapted in this project is the Standard Serial Document Identifier (SSDI) (previously known as the ADONIS numbering scheme). The proposed NISO Z39.56 standard, also known as SICI (Serial Item and Contribution Identifier), has also been considered. While Z39.56 is an excellent format for retrospectively assigning a unique and unambiguous identifier to paper-based information, and it is very easy for a librarian to assign a Z39.56 code to an article in his collection (even if this article was published centuries ago), it has two major disadvantages in the electronic era:
It is restricted to paper-based material. Volume, issue and page numbers, relevant for paper forms, are being used, which could be irrelevant in electronic environments where a page paradigm is not applicable like for instance, a hierarchy of HTML-files published as part of a World Wide Web service.
It is restricted to material that is ready for publication. This means that it is not possible to denote an article with a Z39.56 code before it is certain what the volume, issue, page number and publication date exactly are. For instance, publishers often provide information in “current awareness” or “pipeline” services about articles in forthcoming publications, when all the above specifications are not yet finally available.
The SSDI scheme chosen in TULIP could be regarded as a “social security number” for documents. The SSDI is assigned at the moment that the article is accepted for publication; it is used as a reference number for the authors during production phases and it is printed on each page of the article in the issue. As it is relatively short (a fixed 16 digit number), the SSDI is very usable as a primary key in computer environments. Since this number is based on the ISSN, it is not restricted to Elsevier, but allows every publisher in the world the ability to assign SSDI’s.
In 1995, the (slightly adapted) SSDI was incorporated in the Publisher Item Identifier (PII) initiative, encouraged by a cooperation of major publishers and societies. This cooperation includes the American Chemical Society (ACS), the American Institute of Physics (AIP), the American Physical Society (APS), the Institute of Electrical and Electronics Engineers (IEEE) and Elsevier Science.
The Z39.56 Standard Committee has apparently taken the concerns mentioned above into consideration and will incorporate publisher-assigned identifiers in a forthcoming release of Z39.56.
Section identifiers One point not attempted to be solved in TULIP is the need to divide articles into groups or sections within a journal issue. A few journal titles (especially the larger ones), have an editorial setup to divide journal issues into several subject areas or sections. Articles about a given topic within the scope of the journal are collected together and identified by means of special separation pages and/or with a categorization in the table of contents. Future projects could consider adopting a sectioning strategy.
II 1.4.7 Production backlogs
When the TULIP project started in 1992, it was based on rather new, uncommon technologies such as high-volume scanning, optical character recognition, CD-Rom mastering, Internet file transfer, etc. It took almost a year to develop and test the procedures. Therefore reliable, stable production was only possible beginning in January, 1993. Since the decision had been made to begin the TULIP journal data with the 1992 subscription year, it was necessary to produce two years of data in 1993, beginning with the 1992 backlog. The 1992 material was done retrospectively as much as possible (more recent journal issues were done first, going backward in time). Due to several teething problems, it took the full year 1993 to work away the 1992 backlog. In February 1994, the production systems became stable with a regular biweekly frequency.
In the beginning of 1995, a smaller backlog arose when SGML files were added to TULIP Datasets. It took until the end of the summer to catch up and return to the regular schedule. The lesson here is to be extremely careful with introducing new pioneering technologies in smoothly operating environments. Procedures which perform nicely in a small-scale situation without deadlines are not easily transferable to large-scale, tightly scheduled operations. New technologies require extra administration, operator training, good feedback of teething errors to developers, motivation of staff, etc. All those aspects are easily overlooked in the laboratory stages in which only “proof of principle” has to be defined.
II 1.4.8 Logistics in the scanning offices
During the TULIP project electronic material was lagging behind the paper journal issues, due to the decision to not intercept intermediate (possibly incomplete) articles, but instead to scan final journal issues after these were printed and bound. There were efforts during the TULIP project to accelerate the electronic delivery of material to correspond as much as possible with the printed journal. However, two to three weeks has been the practical minimum lag time.
II 1.5. Lessons learned about customizing and Internet distribution
II 1.5.1 Lessons learned on customizing
From a technical viewpoint, customizing a Dataset to only include those journals a university subscribes to is not difficult, a simple database table compares which journal titles which university is supposed to get. A few UNIX command scripts performed the task of constructing logical directory sub-trees and removing un-subscribed material from the DATASET.TOC master index file.
One difficulty is that there is no recognizable link between the cover date printed on the journal issues and the subscription year based on volume numbers. At the end of a year, subscription cancellations can pose problems when “late” issues (part of last year’s subscription) are published with cover dates of the following year. These should be sent to the universities, even though there is no current subscription. In TULIP, there were no cancellations during the project period. However, in the reverse situation where a university requested “older” material to complete their holdings retrospectively, a few minor complications arose.
II 1.5.2 Lessons learned on Internet delivery
Lack of experience with delivering large Datasets with many pieces led to the original design of an FTP “push” model which proved unreliable. Near the end of the project several of the technical collaborators discussed a design for an FTP “pull” system that would still allow subscription control, but would be a better match for FTP’s design strengths.
Different types of problems occurred with Internet delivery, ranging from operational problems at the sending or receiving side to more generic problems with the Internet and the FTP protocol (see appendix XI for more detail). The large amount of FTP problems made the delivery process unmanageable. It was decided to temporarily discard FTP deliveries and to revert to CD Rom delivery of Datasets. Most universities have adapted to this change in procedures without problems, although they have expressed their long-term preference to use Internet as a delivery method.
Studies are currently under way to develop a “full pull” strategy based on FTP mirroring technology with automatic E-mail functionality, FTP and Perl scripts. This would work as follows: As soon as a Dataset is available for dispatch, a structured E-mail message is sent to a dedicated E-mail address. An automated process at each university picks up these messages and starts a standard series of pull-and-validate scripts. All transfer errors would be recorded and checked by the TULIP operator at the university.
II 2. Technical implementation at the universities
In this part, a brief overview will be given of the major similarities and differences between the TULIP implementations, as well as of the major experiences and lessons learned by the universities as described in their final reports. All detailed information on these implementations can be found in these final reports (appendices I-IX).
II 2.1. Organization of IT/Information services
The TULIP universities are medium to large, or very large organizations with relatively high-tech environments, with good campus networks and highly skilled personnel. The library and the computer center normally are separate organizational units. Some libraries have their own development staff who operate independently from the computer center. In one case (Michigan) staff from the College of Engineering collaborated on the project.
II 2.2. TULIP systems development
The approaches to TULIP are all different, no system is the same. Some of the components used to build the TULIP implementation can be the same, but the actual implementations differ greatly. BRS Search, Kerberos, Adabas, Notis and Newton are mentioned more than once as a component, but the way these components are used in the implementations differs remarkably. Also, development phases are quite incomparable. It is noteworthy that nearly every university has tried more than one different alternative/prototype, except the University of California and Virginia Tech, but even these are considering a Web implementation as an alternative to their current solutions.
Also, TULIP-like systems are very different from traditional library systems. At several universities, where there has been a change of personnel during the project period, this has meant a very steep learning curve for those getting involved in the project, due to the fact that the (development of the) TULIP system was completely uncharted territory.
II 2.2.1 Development process
Development was a lot harder and different than expected. It was not necessarily the technology that was the major problem, but more the scale and infrastructure of the project. “Perhaps, the major general lesson that has come out of doing TULIP is that systems like TULIP are a lot harder to develop and deploy than might be apparent. And the reason for these difficulties has less to do with the technologies used to build such systems than with the infrastructures needed to support them.” Some of the characteristics of the development process are:
Most of the systems were assembled from different components resulting in rather proprietary solutions, not easily transferable to other organizations.
Cheap or free-of-charge public domain, or shareware software components, were used wherever possible, and there also seemed to be a preference to make something bespoke themselves rather than purchasing off-the-shelf software.
In most cases the TULIP development was done by one person or the combined effort of a small group.
II 2.2.2 Migration to production
None of the TULIP systems became “mature” in the sense that prototypes were handed over to production departments. All systems remained prototypes, with all typical evidence of prototype systems, such as lack of proper documentation, backup-restore procedures, management tools, etc.
At a few universities plans are being made to move towards more production-type operations.
II 2.3. TULIP functionality
II 2.3.1 Searching vs browsing
Most of the universities, with the exception of the University of California, implemented a browsing facility, that is the possibility to choose a journal, then an issue, then an article from the table of contents.
All of the universities developed a searching facility, mostly restricted to boolean logic and proximity searching.
II 2.3.2 Separate system vs based on/integrated with OPAC or A&I services
Three basic types of implementations were built:
a separate database and user interface
a separate database but using a common/known interface
integration with an existing information service, i.e. access to the image files through comprehensive secondary databases.
Re 1.
A number of solutions were stand alone implementations, where TULIP was not only offered as a new database, but also had a “new” (proprietary) interface, mostly based on X-windows. However, a number of Web implementations have also been developed, for which obviously the client (interface) is very well known, even though the TULIP files are in stand alone databases. In all these cases, systems based on RS6000/AIX, Decstation/Ultrix and SunSparcstations/Solaris, were most popular.
Re 2.
All universities had an existing OPAC (library catalogue), most of which were IBM mainframe or UNIX based. In two cases, TULIP was implemented as an adaptation of the existing OPAC (GTEL at GT, MIRLYN at the University of Michigan). In the OPAC-based environments, the OPAC was used for searching and browsing bibliographic records only. The page images were stored separately from the OPAC on one of the UNIX systems mentioned above.
Re 3.
At the University of California, the initial access to the TULIP files was through the existing Inspec or current contents databases on the MELVYL system, after which page images could be displayed and printed through a proprietary X-Windows based viewer.
II 2.4. Client systems
II 2.4.1 Clients
In TULIP, efforts have been made to build several client systems, usable in a multitude of client systems, such as MS-Windows PCs, Apple Macintoshes, a diversity of UNIX systems (IBM AIX, Dec Ultrix, Sun Solaris, Motif, SCO Unix, Linux, Indigo IRIX, etc.) and for several terminal oriented mainframe systems, notably VT100 and IBM3270-based. Upscaling and broadening would create huge maintenance problems because all client systems should be kept in line with upcoming new facilities and emerging new processors (e.g. PowerPC) and operating systems (e.g. OS/2, Windows95, Windows NT).
In the early days of the project many different approaches reflecting this variety, were tried, with more or less satisfactory results:
Telnet
X-Windows-based proprietary developments
MS-Windows/Visual Basic-based developments
MS-Windows/Proprietary developments such as OCLC’s Guidon
Gopher
It became clear however, that there are basically two ways to support an application on many different client platforms:
One is to develop, test, install and upgrade a client system on these different platforms in synchronization with each other.
The other is to develop an application which works on the lowest common denominator platform, i.e., a terminal emulation which could be done on all client platforms.
Both solutions were not satisfactory, as in the first case extraneous amounts of development time are required, and in the second case the functionality that can be offered is very low, whereas one wants to offer a system that is an improvement over existing systems, with high sophistication and functionality.
The advent of the World Wide Web provided a solution for this dilemma. One of the crucial advantages is the availability of ready-to-use, publicly available, user friendly, graphical Web browsers on all prevalent platforms. The Web environment allows developers to concentrate fully on the server part and not to bother any further with the client part.
The standard which has been set with HTML and HTTP and portable WWW-clients such as NCSA Mosaic and Netscape Navigator solves the maintenance problem, freeing time to concentrate on server developments. The current WWW tools are somewhat unstable and rather restricted in functionality, but it is expected that limitations will quickly decrease over time and new functions (such as Hot Java and embedded viewers), will be added. Another major advantage is that, since these clients are so generally available and are easy to use, the need for support and training is minimal. However, a disadvantage is that WWW clients are restricted in programmability if you want to add functions the client itself does not fulfill. For instance, there is no satisfactory solution as yet within the Web environment for printing high quality images. At the University of Michigan, a so-called helper application takes care of the printing. It is expected that future possibilities such as Java will alleviate current constraints. Also, the possible future pricing and licensing strategy for these products is a major concern for academic institutions.
There was a general consensus among TULIP developers, that writing good, stable client software has proved to be a major, often underestimated task. And maintaining this software in different computer environments is not very rewarding.
So while the Web browsers have some limitations with regard to the required TULIP functionality, both the benefits for development and the popularity of the Web are so pervasive, that most developers feel that the Web is the only way to go at the moment.
II 2.4.2 Viewing
The majority of the universities have implemented page image viewing for their TULIP systems. A few observations of those who implemented this facility follow:
The utilized Group IV compression algorithm is a straightforward method, but it demands high resources for scanning, viewing and printing. At the start of TULIP, common technology to deal sufficiently fast with page images was not readily available. For scanning and printing, special acceleration boards were used, and only high-end UNIX workstations were able to perform page viewing. In the past four years, computer and printer technology has evolved to the extent that no special purpose equipment is needed to achieve near-instantaneous image viewing and printing.
The first page image viewing applications were based on X-Windows, which lacked image compression. In those cases where networks links between imaging application and user screens were slow, this resulted in relatively long response times because entire uncompressed page images were transmitted over the networks. This was true in particular for the University of California system. However, in other cases the response times were well under two seconds, which is what users have said they want.
In practice, X-Windows was restricted to UNIX machines. PCs and Macs are considered as too slow or too cumbersome for X-Windows emulation. Also, X-Windows requires a rather difficult installation procedure compared to what is “usual” for PC applications, for which any user can typically run a foolproof Install or Setup program, without user-hostile parameter settings.
In most page image viewing applications, a technique known as “gray scaling” or “anti-aliasing” is applied to enhance page images (see appendix XII for an elaboration on these techniques).
Page viewing, also known as page flipping, needs to be very fast in order to avoid user annoyance. Typical flip times must be below two seconds. To speed this up, the faster implementations apply technologies such as pre-fetching (i.e. requesting - in the background - the most probable “next” page in advance so that it is readily available when the user asks for it), and caching (i.e. keeping a few page images temporarily stored “at hand” in case the user wants to see them again).
Experience shows that, even with anti-aliasing technology, page images displayed on the computer screen are not really used for reading. As ascertained by the log files, the average duration a page image is shown is far below one minute. This allows for a brief scan of the page to discern the relevance of the article, but not for exhaustive reading. None of the implementations allows for online highlighting and/or annotations, which is “less” than the functional equivalent of the paper.
Most implementations did not allow viewing of pages outside the scope of “full articles”. Page viewing starts as the result of either browsing or searching. After selection of an article, the first page of the article is displayed. Buttons enable navigation to the previous, next, first or last page (within the same article), or to another article. However, none of the implementations, except for the University of California’s, allowed for easy browsing through an entire journal issue, including the pages with advertisements, announcements, obituaries, etc., which were not represented in the table of contents.
II 2.4.3 Printing
Printing of page images has proven to be a major concern for all universities. The large size of the 300 dpi page images in combination with the utilized compression scheme, is a highly demanding exercise for older laser printers and even the newer high capacity laser printers could show an exceptionally slow throughput compared to more regular print jobs if not properly set up. And even if properly set up, page images are large compared to other regular office print jobs and could easily drain network resources and congest print queues. Printing of images from other than directly IP-attached laser printers should be avoided. However, keeping track of local laser printers, attached to the network, is troublesome. In large organizations there is too much change (moving printers, modification of local sub-nets) to allow for proper and easy central print management.
Early in the project, printing of non-compressed page images could easily take between 15 and 30 minutes per page. The advent of affordable PostScript Level 2 printers, supporting compressed image printing, instigated the shared development by the TULIP participants of a TULIP-endorsed subroutine to print page images (available from the EFFECT home page) within 30 seconds on HP LaserJet 4MX laser printers or compatibles. This printing code was the product of a concerted collaborative effort. Distributed printing is a key infrastrucuture issue for all campuses. This collaboration underscores the shared nature of this problem.
To prevent network congestion, most universities have central printing facilities in which a high capacity laser printer is directly connected to the image server. However, these implementations did not seem very popular in the TULIP context. Users seemed to prefer a quickly available local print in lower quality over high-quality central prints which took time to arrive at their desk.
II 2.4.4 Exporting/faxing
Export of material other than for viewing or printing, for instance for supplying texts in wordprocessor formats, has not been implemented by any of the universities. In Web implementations limited cutting and pasting of text is possible, by using the source “code”.
Georgia Tech has developed a fax service supplementing the print service, with which it is possible to provide page images directly to fax machines.
II 2.5. Server systems
II 2.5.1 Search engine
Each university uses a different system or approach for searching. Universities using OPAC-based TULIP systems use the native OPAC search engine. Non-OPAC based systems use a full text search system. Choices range from proprietary “home-grown” systems (e.g. FTL), to public domain or experimental software (e.g. WAIS, Clarit), to commercially available systems (e.g. BRS, Newton, SiteSearch). All those systems provided basic search possibilities such as Boolean logic and word proximity. A few sites have tentatively investigated natural language systems, but were unable to implement this functionality within the life span of this project.
Most universities choose to implement fielded search on the bibliographic data provided in the DATASET.TOC file only. This information is divided into well-defined structured fields such as article title, authors, keywords, journal name, publication date, and abstract. Searching can be done on all of these fields, including on words in, for instance, title and abstract.
“Raw” ASCII files were available for additional searching. The quality of these OCR-ed files has improved considerably in the course of the project. Two of the universities (University of Michigan and Virginia Tech), have implemented full text searching on the basis of the raw ASCII files. The main reasons for not implementing searching of the full text ASCII for the other sites were:
the relatively low quality of these ASCII files;
keyword-Boolean retrieval does not perform very well on these files, Carnegie Mellon University plans to start using the full text when Claritech natural language retrieval is being put into production, which performs better on the raw ASCII files;
the added complexity/system load of loading these large files into their full text database systems.
The implementation at the University of California was different from the others:
All universities except the University of California implemented TULIP as a closed collection, that is the TULIP system only gave access to the limited set of journals in the TULIP project. For other information (e.g. other journals from other publishers), users had to rely on their traditional ways of accessing those collections. The University of California used a different approach for TULIP to overcome the lack of comprehensiveness. A procedure of semi-automatic matching between their Inspec and Current Contents (CC) databases and TULIP was developed, in which every incoming TULIP Dataset was checked against Inspec and CC, and matching records were marked. When searching in Inspec or CC, users find TULIP records with an indication that the full article is available. Subsequently, issuing the “display” command starts the page image viewer. The users’ regular searching/browsing in Inspec or CC leads them to the TULIP pointer, and they can access the page images directly.
Another feature worth mentioning here is the “profiling” feature implemented at the University of Michigan, which means the ability to store queries that are automatically run against new TULIP Datasets when they arrive. For this facility, end users specified a profile with a predefined set of keywords based on their interests. Each incoming Dataset was scanned using this profile and users were notified about articles of potential interest to them by means of electronic mail messages containing the abstracts of matching articles.
II 2.5.2 Data loading
As described earlier, the File Transfer Protocol (FTP) as a mechanism for large-scale bulk delivery was an aggravating experience. Even if a better and stable mechanism were developed (e.g. pull as opposed to the employed push method), FTP is considered not to be scalable to larger collections and to more customers with the current technologies and network bandwidth. However, TULIP participants think that in a more remote future, these restrictions will disappear and that will make network delivery the preferred method again. Therefore development should continue.
One reason most universities prefer network delivery over CD Rom deliveries is that completely automated delivery is possible, as opposed to CD Rom, where manual loading procedures remain necessary.
Because all TULIP systems are different, loading new data is also different for all universities. New Datasets mostly are verified, separated into several image and full text databases, the text is indexed and images prepared. Sometimes loading errors are encountered which necessitate human intervention. Especially in the beginning of the project, this process needed constant oversight.
II 2.5.3 Storage
In the first years of TULIP, magnetic media prices were very high. Most universities have investigated optical storage technologies, in view of the lower price. However, a number of disadvantages became evident.
Optical media are slow compared to magnetic ones, especially when applied with jukeboxes. This is an important disadvantage in an operation geared for quick response such as near-instantaneous page viewing. Magnetic disk caches relieve this somewhat but result in variable delays, which could annoy users (less frequently requested or older material takes longer to fetch, but the user doesn’t know why some material takes longer to display than other material).
The optical media used are basically read-only. This was counter-productive in an experimental environment in which sometimes rearrangements or modifications of files were needed.
Optical media are relatively new and unknown in combination with server systems, resulting in a considerable amount of incompatibility problems.
In the past four years magnetic disks have decreased dramatically (a factor of 1.5 to 2 per year). Especially the new RAID (Redundant Array of Inexpensive Disks) technology is favored as a good option for applications with massive storage requirements, combining a good price/performance ratio with reliability and compatibility with any server system.
The very large volume of TULIP data also presented some problems with backing up, which apparently have not been solved at all universities. A few times requests were made to Elsevier to borrow its spare CD’s to reload data lost due to disk crashes or other calamities. New ways will have to be found to routinely back up the page images and other data files. Delivery on CD Rom obviously alleviates this problem somewhat, as these contain all the data, but not the indexes etc. generated by the university’s application.
New data formats such as SGML, HTML and Acrobat PDF do not appear to reduce the need for large storage space. There are strong indications that data in these formats is equally large in file size, but has better information search, retrieval and presentation possibilities.
II 2.5.4 Security and authentication
The TULIP license allowed for unlimited on-site electronic distribution and use of the data, but restricted off-site delivery. Also, use should be registered and reports on use generated.
User name/password schemes restricting usage to legitimate users were the obvious way to do this. For some institutions unaccustomed to authorized usage this proved problematic. Management of user names and passwords in large, non-constant environments such as a university with many students, departments, locations, etc., and with continuous changes, is troublesome at best.
There was initial concern regarding logfile delivery to Elsevier because of privacy breach. This was overcome by implementing anonymous code schemes, which make it impossible to track certain use to a particular user.
The advent of the World Wide Web complicated this even further because of its inherent limitations. Restrictions can be set on machines and IP addresses, but not easily on authorized persons without user-hostile password procedures. Nevertheless, Web server logging facilities provided possibilities to generate usage information, if care was taken to include this in the design of the particular Web system. New generic WWW authentication, security and encryption technologies to master this problem, are emerging.
II 2.6. Network delivery to end users
TULIP page images, which are much more voluminous than plain bibliographic records, take longer to transmit from a central storage and retrieval system to the end user’s desktop. A large scale implementation involving massive transfer of page images in a typical network infrastructure can easily drain available resources.
Users expect that a system which allows them to view page images on their screens and to print them locally, has fast response times. Especially image viewing should be almost instantaneous to prevent user rejection, which puts a considerable strain on available network bandwidth and on the capacity to retrieve from the storage media described above.
It became clear during the TULIP project, that for “usable” page image viewing and printing, a local area network should minimally support “normal” Ethernet speeds of 10 megabit/second. Modem or ISDN based SLIP or PPP connections are too slow to provide fast response times.
In the early days of the TULIP project, printing contributed significantly to network congestion, as the image files had to be sent uncompressed to local printers (see also paragraph II 2.4.3. on printing). This improved with the arrival of PostScript Level 2 printers supporting the Fax Group IV compression scheme.
Contrary to other TULIP universities, the campuses of the University of California are spread across the entire state, interconnected with (fractional) T1 lines of 1.5 Megabit per second. The computer center at the office of the president in Oakland held the TULIP data, there was no local storage at campus. The network capacity proved to be detrimental to system response, leaving users waiting a long time for images to appear on their screens. This experience provided support for an initiative to increase the bandwidth of the University of California’s network, since it is expected that more image and multimedia projects will be launched in the foreseeable future, which will require bandwidth similar to the TULIP project.
II 3. Conclusions and recommendations
Some of the main problem areas identified are:
Maintaining suites of client software
FTP and the Internet as a means of large scale delivery
Storage
(Infrastructure for) viewing and printing.
Re 1.
Most universities decided to “shift to the Web” on a shorter or longer term, thereby abandoning X-Windows and MS Windows applications. This has the advantage for developers of easier cross platform portability, because developing, maintaining and upgrading several software clients in a distributed environment (Mac, PC, UNIX) is a cumbersome task. In the Web environment, development work is concentrated on the server side. Furthermore, support and training efforts decrease, and user documentation needs minimal emphasis. However, Web applications have fewer possibilities to provide “real-time” functionality such as image zooming.
Basically, the Web can not be ignored. The advantages plus the sheer user-pull outweigh any disadvantages.
Re 2.
Large-scale Internet FTP transfer is not scalable with the current transmission schemes and restricted bandwidth. Suggestions for a robust “full pull” strategy have been discussed, but no conclusion has been reached.
Re 3.
Scalability of TULIP-like systems will also be problematic. Current massive storage technology and network bandwidth capacities practically limit electronic collections to a small percentage of the total library collections. It is expected that a “staged” approach to electronic collection-building will emerge, which is composed of local servers for primary relevant material, and remote (for instance regional) servers for material of secondary importance to the particular institution.
Re 4.
Image viewing on the screen requires a high-speed infrastructure, as (perceived) speed to the user’s desktop is crucial. The components of the system influencing this are:
Server/storage speed (optical storage on jukeboxes will be slower than magnetic storage)
Network speed (“real” LAN is a minimum)
Client machine speed
Application software image caching “smarts”.
Printing page images is an important concern. With the advent of printers that understand compressed Group IV fax images, and careful attention to setup, this aspect becomes viable. The older laser printers are not equipped to deal with the large size of the 300 dpi page images in combination with the utilized compression scheme. Printing of images from other than directly IP-attached laser printers should be avoided.
When sent as compressed Group IV fax, the images are actually smaller than the PostScript files that you would normally see as output from a word processor. Even so, network congestion can be an issue and in some cases a central printing facility was used in which a high capacity laser printer is directly connected to the image server.
Other:
Bi-tonal (black/white) scanned page images are good for text and line art, but the quality is unsatisfactory for gray scale images and color artwork.
A well documented data structure is important for large scale delivery of electronic files. This structure must be medium independent in order to allow for different transfer methods and/or media.
Data shipments as large as in the TULIP project need to be checked for corruption by some checksum facility. Ultimately a platform-independent, portable and robust checksum facility was adapted for TULIP, the “MD5” algorithm of RSA Data Security, Inc.
--------------------------------------------------------------------------------
|