Using SGML and XML in Scientific Publishing
Elsevier has a long tradition of using SGML (Standard Generalized Markup Language) for its products. In the 1980s, the CAPCAS DTD (Document Type Definition) was created to capture article frontmatters. In 1992, the first DTD for full-length scientific articles was developed.
SGML documents are structured (“tagged”) independently of the presentation. This enables Elsevier to base its workflow on the “SGML-first” principle and publish the journal articles from a single SGML source in printed form and electronically on ScienceDirect.
A DTD describes which tags and special characters may be used in the files and which rules apply to these tags. Since 1994, Elsevier has developed several DTDs for scientific journal articles. These DTDs have been made publicly available, which has enabled several other Scientific, Technical and Medical (STM) publishers to use them as the basis of their own DTDs.
The CAP (Computer-Aided Production) workflow, implemented in the 1990s, has resulted in the situation that today the vast majority of Elsevier’s 1,800+ primary and tertiary journals are produced using SGML. SGML files, artwork files and PDF files (Portable Document Format from Adobe) are stored in Elsevier's Electronic Warehouse and delivered to a variety of electronic platforms. Metadata for the articles is in a proprietary format called EFFECT as well as in an XML-based format called CONTRAST.
Large-scale implementation of the CAP workflow began with the release of the full-length article DTD version 3.0 in November 1995 and continued with the implementation of DTD 4.1, released in November 1997. Updates of the DTD followed in February 2000 (DTD 4.2) and January and March 2001 (DTD 4.3).
Together with XML, the new DTD adopts several other recent standards:
- Unicode, the character set of XML.
- CALS tables, enhancing interoperability of tables in journal articles and existing tools.
- MathML, making mathematical formulae accessible to existing and newly developed tools for the publication and exchange of mathematical information.
- XLink, used to link to documents and resources on the web
To keep this family manageable, we have developed the concept of the “Common Element Pool” (CEP), which contains a large set of elements shared between the various DTDs. The actual product DTDs only describe the top level structure of the product, which is filled in with CEP elements.
Early in the 1990s Elsevier developed the EFFECT format (Exchange Format For Electronic Components and Texts). It is a private Electronic Data Interchange format for the transport and interchange of published articles in SGML format, their associated files and their metadata. With the advent of XML there is no longer a need to use a private format for such data exchanges. We have developed W3C Schemas for content transport, which will replace the EFFECT format. (See also the Elsevier DTDs and transport schemas page).
Starting with DTD 4.2 Elsevier has developed the “Tag by Tag” format for its DTD documentation. Tag by Tag documentation has been published for DTD 4.2 and 4.3, and Tag by Tag documentation is available for the XML Journal Article DTD 5.0.
Good documentation should go along with good validation, both to capture errors efficiently and consistently, and to enforce quality requirements with business partners.
Recently we have reimplemented our validation with our new Vtool. It is a configurable rules-based tool allowing us to check many aspects that go beyond the validation by a parser. The rules file is in XML format. The tool is able to check not only SGML or XML files, but any tag-based file. In addition, it contains libraries to create tag-based files from non-tag-based files, such as PDF and artwork files.
Schema parsers promise to take parser-based validation a step further, but we expect that they will at best only partly take over the work of our Vtool.
Elsevier DTDs and transport schemas page. Another useful resource is the ScienceDirect site.
We make a metadata package available to our readers on the metadata page. The package extracts metadata from a journal article or a book chapter and writes them as RDF data.
In 1997 Elsevier, together with other STM publishers, initiated the STIX workgroup, a workgroup of STIPUB. The aim of the workgroup is to provide a proper solution for the special characters used in STM publications. To that end a comprehensive overview of special characters used in publications of its member companies was compiled.
The first goal of the workgroup is to have unique, universally standardized computer codes for the special characters. To that end the workgroup has collaborated with the MathML workgroup and submitted documented lists of special characters for inclusion in Unicode. As a result many characters used in STM publications are now part of Unicode 3.2.
The second goal of the workgroup is the provision of a comprehensive set of fonts, to be made available under a royalty-free license to anyone. See the mission statement and the progress reports on the STIXfonts website.