Using SGML and XML in Scientific Publishing

SGML DTDs
XML DTDs
Other document types
Quality Control: Documentation and Validation
Available DTDs and their documentation
Metadata for XML articles and chapters
Characters for STM publishing

SGML DTDs

Elsevier has a long tradition of using SGML (Standard Generalized Markup Language) for its products. In the 1980s, the CAPCAS DTD (Document Type Definition) was created to capture article frontmatters. In 1992, the first DTD for full-length scientific articles was developed.

SGML documents are structured (“tagged”) independently of the presentation. This enables Elsevier to base its workflow on the “SGML-first” principle and publish the journal articles from a single SGML source in printed form and electronically on ScienceDirect.

A DTD describes which tags and special characters may be used in the files and which rules apply to these tags. Since 1994, Elsevier has developed several DTDs for scientific journal articles. These DTDs have been made publicly available, which has enabled several other Scientific, Technical and Medical (STM) publishers to use them as the basis of their own DTDs.

The CAP (Computer-Aided Production) workflow, implemented in the 1990s, has resulted in the situation that today the vast majority of Elsevier’s 1,800+ primary and tertiary journals are produced using SGML. SGML files, artwork files and PDF files (Portable Document Format from Adobe) are stored in Elsevier's Electronic Warehouse and delivered to a variety of electronic platforms. Metadata for the articles is in a proprietary format called EFFECT as well as in an XML-based format called CONTRAST.

Large-scale implementation of the CAP workflow began with the release of the full-length article DTD version 3.0 in November 1995 and continued with the implementation of DTD 4.1, released in November 1997. Updates of the DTD followed in February 2000 (DTD 4.2) and January and March 2001 (DTD 4.3).

 

XML DTDs

Because a large-scale workflow based on SGML was already in place, Elsevier, like many other STM publishers, had no need to switch rapidly to the new XML standard for generalized markup. In October 2002, Elsevier published its new XML DTD for journal articles, called the Journal Article DTD version 5.0 (JA DTD 5.0). This DTD has been in use since 2004.

Together with XML, the new DTD adopts several other recent standards:


Other document types

Until 2002, Elsevier has applied its SGML-based workflow only to journal articles. Since 2005, it has been extended to Elsevier's book publication program. While the Full Length Article DTDs up to version 4 were each an only child, the Journal Article DTD version 5.0 is a member of a family of DTDs. The Elsevier Books DTD, a successor of the Health Sciences Books DTD is another member of this family.

To keep this family manageable, we have developed the concept of the “Common Element Pool” (CEP), which contains a large set of elements shared between the various DTDs. The actual product DTDs only describe the top level structure of the product, which is filled in with CEP elements.

Early in the 1990s Elsevier developed the EFFECT format (Exchange Format For Electronic Components and Texts). It is a private Electronic Data Interchange format for the transport and interchange of published articles in SGML format, their associated files and their metadata. With the advent of XML there is no longer a need to use a private format for such data exchanges. We have developed W3C Schemas for content transport, which will replace the EFFECT format. (See also the Elsevier DTDs and transport schemas page).

 

Quality Control: Documentation and Validation

Our vast experience with workflows based on SGML has made it clear that a DTD alone is insufficient. Good documentation must clarify the interpretation of the tags and specify the ways in which they are used.

Starting with DTD 4.2 Elsevier has developed the “Tag by Tag” format for its DTD documentation. Tag by Tag documentation has been published for DTD 4.2 and 4.3, and Tag by Tag documentation is available for the XML Journal Article DTD 5.0.

Good documentation should go along with good validation, both to capture errors efficiently and consistently, and to enforce quality requirements with business partners.

Recently we have reimplemented our validation with our new Vtool. It is a configurable rules-based tool allowing us to check many aspects that go beyond the validation by a parser. The rules file is in XML format. The tool is able to check not only SGML or XML files, but any tag-based file. In addition, it contains libraries to create tag-based files from non-tag-based files, such as PDF and artwork files.

Schema parsers promise to take parser-based validation a step further, but we expect that they will at best only partly take over the work of our Vtool.

 

Available DTDs and their documentation

The Elsevier XML DTDs, together with accompanying documentation, are available on the Elsevier DTDs and transport schemas page. Another useful resource is the ScienceDirect site.

 

Metadata for XML articles and chapters

We make a metadata package available to our readers on the metadata page. The package extracts metadata from a journal article or a book chapter and writes them as RDF data.

 

Characters for STM publishing

Together with the DTDs we developed the so-called “special character grid,” i.e. a table of special characters for use in Scientific, Technical and Medical (STM) and linguistic publications. Each character from this grid can be used in an SGML file via a so-called SDATA entity, which acts as its name.

In 1997 Elsevier, together with other STM publishers, initiated the STIX workgroup, a workgroup of STIPUB. The aim of the workgroup is to provide a proper solution for the special characters used in STM publications. To that end a comprehensive overview of special characters used in publications of its member companies was compiled.

The first goal of the workgroup is to have unique, universally standardized computer codes for the special characters. To that end the workgroup has collaborated with the MathML workgroup and submitted documented lists of special characters for inclusion in Unicode. As a result many characters used in STM publications are now part of Unicode 3.2.

The second goal of the workgroup is the provision of a comprehensive set of fonts, to be made available under a royalty-free license to anyone. See the mission statement and the progress reports on the STIXfonts website.