|
Appendix I to IX: University reports
We regret to have not electronically available the Web versions of Appendices I to IX.
Appendix X: Checklist of aspects to be considered for the implementation of a “digital library”
It is hard to specify exactly which aspects need to be considered for the implementation of a “digital library” at an institution, as this is very much dependent on the installed base, as well as on the desired functionalities. That is why we have chosen to discuss solutions which require a lot of investments in time and/or money to implement, as well as solutions which require (relatively!) less investments, in order to get a sense of the impact of these alternatives. A “low cost” solution could have a menu-driven, character-based interface, which can run on any terminal emulation, with typically remote printing. Although not very sophisticated, the TULIP project has proven that this approach can work. Alternatively, as Web browsers have become commonplace, and implementing Web based systems has become less problematic, choosing this route has emerged as a good solution at relatively low cost, providing a lot of the more sophisticated functionality usually associated with the higher investment solutions. A “high investment” solution could be characterized as having a (proprietary) graphical user interface, allowing for flexible page image viewing, a lot of sophistication in terms of browsing and searching, and printing to a local printer.
The basic infrastructure needed will differ for each solution, and is described first, as it is the foundation for what follows. The rest of the paper lists the additional resources (hardware, software, infrastructure, organization) needed to implement a digital library system. Wherever appropriate and possible, some explanation is given of the requirements listed.
1. Lower cost solution
1.1. Basic infrastructure needed
It is assumed that a basic infrastructure is in place as described in the following points:
1.1.1 Existing systems
A local area network for locally attached PCs, Apple Macintoshes and other decentrally located workstations. Most common is a network based on Ethernet and TCP/IP. With such a network it is possible to give external users access via Dial-in SLIP or PPP protocols. Similar possibilities exist for Token Ring based networks with Novell Netware, Lan Manager or other network protocols.
Central computer center facilities with mainframe or minicomputers.
Locally placed desktop computers with network possibilities. In the case of a TCP/IP based network this requires Ethernet boards, network drivers and terminal emulation software and/or World Wide Web client software, such as NCSA Mosaic or Netscape Navigator.
Some form of high level printing will have to be possible. Printing facilities could range from centrally located high volume laserprinters, such as Xerox Docutech type machines, to decentral office laser printers, directly connected to the LAN.
1.1.2 Organization
IT production environment
As the description of systems infrastructure above indicates, the implementation of a digital library system requires a certain systems infrastructure. Similarly, it presupposes a certain capacity in the organization to maintain systems once these have been installed.
Development staff
If there is no development staff in house, a “turn key” solution, such as the OCLC system, which consists of a search engine plus a user interface could be considered.
Library staff/relations with IT
A digital library system has to be developed in close cooperation between the library and the IT department. The library will have to involve end users to get feedback and will also have to train/promote.
1.2. Functionalities needed
To be developed by institution
1.2.1 User interface
There are basically two options for the development of the EJS on a restricted scale:
Based on an existing library system interface, most probably the OPAC, which is available to at least Telnet-VT100 based clients.
If there is already a library system in place, this probably needs adaptation. Most “older” systems are based on terminal emulation (“Telnet” or similar) to let users type in queries and browse the results. The bibliographic data format as, for instance, provided by Elsevier Science, is based on common practice in secondary databases and provides typical information units such as titles, authors, keywords, abstracts, etc. Systems already aimed at providing this type of information will probably need minimal change. Users are already accustomed to the interface and need no special training for it.
Main concerns are:
The development for converting and loading the EES data format to the internal format
The development/implementation of a file or database system to hold the large amount of page images. A possible solution is to purchase a standard hierarchical storage manager system, composed of an image server (a dedicated processor) and a “staged” storage area of magnetic disk, jukeboxed optical disk storage, and a tape library
Adequate linking mechanisms to connect bibliographic records in the library system to the associated pages in the page image collection
The printing infrastructure needs to be defined since the volume of page images is different from traditional office applications, for instance:
local print of bibliographic data on ordinary" office laser printers;
central printing facility or;
dedicated decentral office laser printers especially set up for image printing (e.g. PostScript level 2 with CCITT Fax Group IV support), controlled by central server.
WWW interface If there is no library system available or if it is decided to develop an entirely new system, the best option at this moment seems to be to use the very popular World Wide Web as user interface. This brings the following additional points into play:
The look-and-feel of a WWW based user interface should be developed and proper HTML files and HTML-generating procedures need to be developed
A choice needs to be made, to develop full text search software, or to purchase this software and adapt this to the preferred other components. A multitude of choices is commercially available such as Information Dynamics Basis+, BRS Search, Verity Topic, Oracle SQL*Text and others. This choice is dependent on the desired functionality. For instance, sophistication in Boolean search, truncation/wildcard operators and presentation, needs consideration. Also, the integration aspects need attention, e.g., is a complete turn-key full text database preferred, or are a number of different toolkits from several suppliers favored
Online viewing of page images on computer displays should be avoided if no sufficient performance can be guaranteed.
1.2.2 Searching
Information searched:
Bibliographic data only.
Searching methods:
Boolean search
Truncation/wildcard operators
Presentation of hits in reversed chronological order.
1.2.3 Browsing
Selection by choosing journal title -> issue -> article.
1.2.4 Viewing
Bibliographic data only in the case of a character based solution, viewing of (downgraded) page images is possible in the Web solution, performance is a consideration.
1.2.5 Printing
Local print of bibliographic data
Central printing facility for page images
Decentral (identified) office laser printers for page images controlled by server.
1.3. Additional infrastructure needed
To be acquired.
1.3.1 Text management database
1.3.2 Standard Hierarchical Storage Management system (HSM) for page images
Image server
Magnetic disk storage, optical disk storage, tape library.
1.3.3 Printing
High volume central printing facility
Dedicated office laser printers, especially set up for image printing.
1.3.4 Network
No upgrades assumed.
1.4. Organization
Attention points:
Project management
Introduction
Promotion
Training
Support.
2. Higher investment solution
2.1. Basic infrastructure needed
2.1.1 Existing systems
Additional to the lower cost solution described above, the following points need to be considered for development/implementation of this solution; the main difference lies in the more sophisticated display of page images on computer displays.
High capacity network
The network needs to have sufficient bandwidth for fast delivery of (large) page images from the central server to the desktop machines. Dial-in connections should use the highest capabilities available, e.g. minimally 14.400 bits/sec, but preferably 28.800 bits/sec or higher (e.g. 64Kbits/sec ISDN connections).
Desktop computers need to be high end PCs (486 or Pentium), PowerMacintoshes and/or workstations, able to provide realtime viewing, browsing and zooming (anti-aliased) page images in World Wide Web software (Mosaic, Netscape, etc.), or dedicated page viewing helper applications.
Printing facilities
Fast office laserprinters, running PostScript level 2 with CCITT Fax Group IV support.
2.1.2 Organization
IT production environment
Development staff
Library staff/relations with IT.
2.2. Functionalities needed
2.2.1 User interface
Graphical User Interface, WorldWide Web based (Netscape, Mosaic, etc.).
Integration with other collections in same user interface. Formats to be supported:
Page images
PDF files
Word processors
Spreadsheets
SGML files
HTML files
Graphics formats (JPEG, TIFF, GIF, EPS)
2.2.2 Searching
Information searched:
Bibliographic data
Full text search by using raw ASCII files
Searching through different collections in one go (requires deduplication)
Searching methods:
Boolean search
Truncation/wildcard operators
Phonetic search
Presentation of hits in reversed chronological order
Relevance of ranking of hits
E-mail notification by means of user profile.
2.2.3 Browsing
Selection by choosing journal title -> issue -> article.
2.2.4 Viewing
Bibliographic data
Anti-aliased 100 dpi or 75dpi page images.
2.2.5 Printing
Local print of bibliographic data
Central printing facility for page images
User-designated, decentralized laser printers for page images controlled by server.
2.3. Additional infrastructure needed
To be acquired
2.3.1 Server + software
2.3.2 Sophisticated text management database
2.3.3 Hierarchical Storage Manager (HSM) with large magnetic cache for page images
Image server
Magnetic disk storage, optical disk storage, tape library.
2.3.4 Printing
High volume central printing facility
Dedicated office laserprinters, especially set up for image printing.
2.3.5 High capacity network
No upgrades assumed.
2.4. Organization
Attention points:
Project management
Introduction
Promotion
Training
Support
--------------------------------------------------------------------------------
Appendix XI: Internet delivery problems
The different problems concerning Internet delivery ranged from operational problems at the sending or the receiving side, to more generic problems with the Internet and the FTP protocol:
For one, problems occurred at the receiving end of the Internet delivery process, such as magnetic disk space limitations, account quotas, invalid passwords, account restrictions, invalid permissions and university hosts off-line. All of these situations occurred on a frequent basis with most of the participating sites. When one of these situations occurred, the delivery of the Dataset could not proceed. Since these responses were not anticipated, they were not “programmed” into the original system and investigative work was needed each time to determine the reason why the Dataset delivery aborted or did not take place. This in turn would lead to the delayed delivery of a Dataset.
There have been unanticipated bottlenecks, not only on the receiving side, but also on the delivery side. The setup at Article Express was developed and implemented in 1992, based on certain assumptions regarding automated procedures, human resource capacity, stability of Internet transfer and the universities receiving each Dataset consequently, that is all in one batch. For instance, the idea was to have only a fairly small number of Datasets online to fulfil deliveries.
At the point in the project where the needs to keep more Datasets online became apparent, in order to deal with the different speeds of Dataset loading at the Universities, Article Express started adding more disks. However, the SCO UNIX-based 386/486 systems at Article Express had unforeseen restrictions in dealing with multi-gigabyte hard disks.
Even though the Internet service providers claim that they provide continuous connection to the Internet, difficulties were frequent. The TULIP delivery mechanism was set up to transfer material in a continuous stream between two UNIX hosts. Any interruption into this process resulted in incomplete deliveries and deadlock situations.
A manifestation of intermittent FTP connections is a condition known as “text file busy”. This is the error message which FTP reports, when an FTP process attempts to overwrite or replace a file which has been previously locked by an FTP transfer, which was in turn aborted due to communication difficulties on the Internet. When this condition was reported by the host universities, a deadlock situation occurred, which could only be resolved through manual intervention at either end.
Since this was an unanticipated condition when the delivery system was implemented, there was no automated facility for alerting the university. Therefore, the approach taken was to wait for the university to clear up or delete any pending FTP processes that had “aged” more than 24 hours or so. This, of course, prevented a successful validation step in the delivery system procedures, which in turn resulted in delayed delivery of the Dataset.
The implementation of the FTP program operates in batch mode. A series of FTP commands is written to a local file and the file is submitted to the FTP program for execution. The results of the FTP program, i.e. the output status, are then written to a local disk file. After all of the commands are executed, the status file is parsed and searched for success and error codes. For most error conditions, this mechanism works well. It also works quite well when there are no error conditions. It does not work well when Internet communications are interrupted and no status is returned from the FTP program, particularly in deadlock situations.
--------------------------------------------------------------------------------
Appendix XII: Gray scaling and anti-aliasing
300 dpi page images are too large to fit on most common computer displays. Consider that an average 300 dpi page image is 7" wide and 10" high, which is 2,100 pixels wide and 3,000 pixels high. A typical 14" computer display running in SVGA is 600 pixels high and 800 wide. At least the full width of the page should be entirely visible for good readability. This means that the 2,100 pixels should be down-scaled by two thirds to 700 pixels, to allow for space for window borders and scroll bars, giving a total width of 800 pixels.
The following 300 dpi input (enlarged 10 times) is taken as an example:
When simple down scaling to one third (i.e. 300 dpi to 100 dpi) is applied, every rectangular area of 3 × 3 pixels is considered for each pixel in the result. If only 0-4 pixels are black, the resulting pixel is white. If 5-9 pixels are black, the resulting pixel is black. The end result (enlarged 30 times) is as follows:
However, applying the anti-aliasing technique yields better images. For every rectangle a gray-value percentage is calculated, based on the number of white and black pixels. For instance, zero one black pixel yields 0%, one black pixel means 11%, two pixels equals 22%, etc. and nine black pixels signifies 100%. Because computer displays are able to show different grades of gray, the anti-aliased result of the down scaled example (enlarged 30 times) looks like this:
--------------------------------------------------------------------------------
Appendix XIII: Interview Guide
We regret to have not electronically available the Web version of this appendix.
--------------------------------------------------------------------------------
Appendix XIV: Detailed findings of the qualitative research
The interview guides used for the focus groups and the one-on-one interviews have been adapted for each site in order to fit the particular implementations of TULIP and/or the specific situations on each campus (see Appendix XIII for the interview guide used at MIT). Nevertheless, all interview guides covered the same issues, so we were able to compare the results.
The major findings are reported according to the sequence of the interview guides, which first addressed a few generic topics regarding information behavior (problems encountered, expectations for the future, electronic information sources, etc.), before discussing the specific TULIP topics. In short:
1. Current information and search behavior
2. Electronic information sources
3. Evaluation of TULIP.
When reading these findings it should be kept in mind that only three TULIP sites have been involved in the qualitative research. Most of the findings of the research activities conducted by other universities (to the extent that they are comparable) concur with these results.
Ad 1. Current information and search behavior
Patterns used to review journals
Looking through the table of contents a first selection of relevant articles is made. The following step is reading the abstract and skipping through the article to see if any pictures or figures catch the eye.
Although abstracts (in this discipline, Material Science) are considered to be a helpful scanning device, their quality could be improved: better than half of the abstracts are considered to adequately describe the full article (60-75%).
In general, respondents do not read most articles in-depth: an estimated 5-30% is read thoroughly.
Since scientific papers require considerable time to read in-depth, photocopies are made to bring back to the office/lab or home. Faculty members seem to make less photocopies than students.
Most students read journal articles in their office, while faculty members also read journals at home.
Graduate students almost always have a particular subject in mind when looking for information. Therefore, most visits to the library begin with a database search (INSPEC, Current Contents).
Faculty also make use of on-line searches, but are more likely to know better what they are looking for, for instance, because their network of personal contacts indicated relevant sources. On the other hand, they are more likely to browse or pleasure-read a journal.
A high percentage of both faculty and students indicated that they subscribe to one or more core journals from which they read a high percentage of the articles.
Physical access to the libraries is frequently mentioned as a problem, especially when more than one library has to be visited (sometimes scattered around campus), this is considered a cumbersome task. Distance to the library, things like traffic and finding a parking space, appear to be of direct influence on the likelihood/frequency of library visits.
The lack of availability of information sources is another problem mentioned frequently. Canceled subscriptions, journals not subscribed to, journals out to be bound or placed in storage (archival material) and many other barriers occur when looking for information.
Cost
In particular, the costs related to the use of electronic databases have been mentioned as prohibitive (e.g. Inspec, Current Contents). Meanwhile, cost for both primary and secondary information is rarely “out of pocket”, and therefore plays a minor role on a day-to-day basis, either lost in the overall university structure or covered by grants.
Recent changes and expectations for the future
According to many respondents, their current search capabilities have improved compared to the recent past as a result of access to such services as Inspec and the Internet. Respondents not only expect more and faster online information to become available in the future, but also strongly desire this.
Ad 2. Electronic information sources
The time spent on a computer varies considerably: from 1 hour per day to “all day”. The computer is used for a wide variety of activities: word processing, communication (e-mail), simulations and other scientific “number-crunching” applications, electronic searching, “surfing” the Internet, etc. Electronic information sources are used “when needed”, which can range from several times per week to a few times per year.
Use of sources
The most frequently used electronic information sources are:
INSPEC
Science Citation Index
Current Contents: considered to be less useful (because abstracts are not included) and very expensive.
Within UC, INSPEC and Current Contents are used far more frequently than other electronic sources. Appreciated elements of INSPEC are the online abstracts, back issues as far as 1968, and the available information on conference proceedings. The greater scope of materials and the automatic background searches (mailed every couple of weeks), are considered as positive points of Current Contents.
The Internet is not widely used for gathering information: it is considered to be time consuming and therefore not very productive.
Ad 3. Evaluation of TULIP
Launching TULIP
The general feeling at the sites involved in the qualitative user research is that getting started on TULIP did not require a great deal of training, because it is quite simple to use. Users had different expectations of TULIP, ranging from a feeling that it would be a “toy”, to being intrigued about its ability to provide full text. Some were pleasantly surprised to find out what TULIP did offer, others were disappointed by its slow speed or technical problems.
Frequency of use
The use of TULIP by the individual researchers is rather infrequent. Most access is randomly, as needed. Users believe that TULIP would be used more often if it were expanded to include more journals. TULIP usage is, of course, very much influenced by the user friendliness of the implementation: at UC several students tried TULIP only once or twice, because they never succeeded in installing the necessary X-Windows software to run TULIP on their PCs.
Advantages/benefits
Overall, there is enthusiasm about the concept of TULIP: desktop access to full text articles. The possible time-savings and convenience offered by desktop access to articles is considered very favorably. It replaces going to the library and it can be used any time of the day.
Disadvantages/limitations
The following disadvantages and/or limitations were mentioned. These were sometimes reasons for stopping or reducing the use of TULIP:
Insufficient journal coverage: this is generally considered to be the greatest limitation; several core journals are not Elsevier publications and the time coverage is limited; it does not go back far enough in time.
Difficulties in accessing TULIP: no X-Windows available on the desktop, no connection to printers in the lab or office, no knowledge of how to access TULIP with a MAC, are a few of the difficulties mentioned. At UC, virtually all respondents had difficulties installing the system due to equipment and software problems.
Printing problems: at all sites printing is considered to be too slow (“every page is like agony”). At UC, most respondents could not figure out how to print from TULIP. After spending (too) much time, nearly all abandoned their attempts.
The print quality of text is considered as acceptable, but opinions about the quality of the photographs and graphs are much less favorable.
The quality of the image on screen is strongly dependent on the implementation. Users at MIT felt that the TULIP image quality was quite good, while those at UM complained about poor image quality of the Web implementation. However, many respondents do not like to read information on a computer screen.
Speed: mostly at UC, complaints have been heard about TULIP’s response times.
A particular problem at UC is switching between TULIP and MELVYL: respondents consider loading X-Windows time consuming and a task that should not have to be done by the user. The fact that each time the user re-enters MELVYL from the TULIP image viewer he must begin a new search and reset the defaults, is also considered to be a serious drawback.
The effect of TULIP on readership of paper journals
The general feeling is that TULIP represents an addition to, rather than replacement of, the paper journals. The serendipity of paper journals is missed when doing an electronic search. People still like to spend some time with the printed version, which is easier to browse and where there is a greater likelihood of “finding without looking”. Many respondents simply enjoy browsing through a journal or book: there is an “emotional” relation with paper products. The “fun” aspect of reading a book or a journal should not be underestimated: “Seeing material on the computer is work, holding a book is fun”.
According to the users, usage of TULIP does not appear to have a significant impact on readership of the journals involved. (However, there is some evidence that the publicity and communication efforts around TULIP has increased the awareness of the TULIP titles and, by consequence, their usage.)
Evaluation of specific features
Searching: This appears to be one of the most valuable features of TULIP. Providing full text makes TULIP superior to a service like INSPEC. Despite these positive perceptions, many users have difficulty in using the search feature. This is sometimes related to more general difficulties with keyword searching.
Profiling: Most users were unfamiliar with this feature, but were quite enthusiastic about the concept. Those who tried the profiling feature (at UM), however, had minimal success.
The help feature: Most respondents did not use this feature.
Browsing: Most respondents did not (like to) use the browsing feature. Browsing through a hard copy is preferred (emotional touch, quicker, easier).
Timeliness of materials: Most respondents felt that the information on TULIP is timely, but many of them did not appear to be aware of the actual updating of the TULIP database. When asked specifically, a few respondents commented that there seems to be a delay in adding new issues.
Response time: There were numerous complaints at UC, and to a lesser extent at UM, about TULIP’s processing speed, in particular printing documents was felt to be much too slow. At MIT, TULIP’s response time for articles and printed output was considered to be acceptable.
Missing items/suggestions for improvement
As noted above, the major shortcoming of TULIP is its limited coverage.
TULIP could be improved by expanding it to:
all the core journals
conference proceedings, books, etc.
back issues
secondary journals and trade journals (i.e. not scientific journals)
visual images (video, sound, 3D).
However, respondents envisage some other improvements as well:
The linking of TULIP information to other information sources is frequently suggested by the respondents. Mentioned are links to:
references
other publications of an author
company profiles
Science Citation Index.
There have been a few suggestions regarding alternative search methods. Allowing for keyword searching in the body of the text has been mentioned, as well as backward and forward “chaining”: linkages to references mentioned in an article (backward) and linkages to other articles in which the article has been referenced (forward).
Simplifying and making the access to TULIP faster and easier has already been mentioned. One of the suggestions in this context was to put TULIP on the Web.
Adding a bibliographic formatting feature that would allow users to “grab” certain bibliographic citations and put these directly into a paper, would also be appreciated by several users.
Cost/payment
Cost is of moderate importance to students, because they do not personally subscribe to journals. Faculty members are more concerned: they fear that the cost of using the final product may be prohibitive. Respondents indicated avoidance of various databases currently available to them because of expense.
At UC respondents were asked how electronic information should be billed. All concurred that it should be charged for availability only, not on a “per use” base.
--------------------------------------------------------------------------------
|