Thursday, January 5, 2017

JATS Journal Archiving and Interchange Tag Library and tools convert from doc pdf, JATS editor

The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online.
It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012.

The NISO project was a continuation of the work done by NLM/NCBI, and popularized by the NLM's PubMed Central as an de facto standard for archiving and interchange of scientific open-access journals and its contents with XML.
With the NISO standardization the NLM initiative has gained a wider reach, and several other repositories, such as SciELO, adopted the XML formatting for scientific articles.

The JATS provides a set of XML elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews.JATS allows for descriptions of the full article content or just the article header metadata.
Journal Archiving and Interchange Tag Library
NISO JATS Version 1.1 (ANSI/NISO Z39.96-2015)
December 2015
National Center for Biotechnology Information (NCBI)
National Library of Medicine (NLM)

As NISO JATS began the de facto and de juri standard for open access journals, the scientific community has adopted the JATS repositories as a kind of legal deposit, more valuable than the traditional digital libraries where only a PDF version is stored. 
Open knowledge need richer and structured formats as JATS: PDF and JATS must be certified as "same content", and the set "PDF+JATS" forming the unit of legal deposit.

List of JATS repositories and its contained:

PubMed Central: US PubMed Central: in 2016 ~3.8 million articles
SciELO: in 2016 ~0.6 million articles

So you want to adopt JATS. What decisions do you need to make?

ATS users must decide which model to use (Archiving, Publishing, Authoring), which table model to adopt, and how to handle math. In addition they should select a citation model and style, an approach to encoding contributor names and affiliations, an approach to multi-language names

An intimidating number of variations are available for JATS version 1.1[1]. Each of the following models is available in three grammars: DTD, XSD (W3C XML Schema), and RNG (RELAX NG):

  • JATS Archiving with XHTML tables and MathML2
  • JATS Archiving with XHTML tables and MathML3
  • JATS Archiving with OASIS and XHTML tables and MathML2
  • JATS Archiving with OASIS and XHTML tables and MathML3
  • JATS Publishing with XHTML tables and MathML2
  • JATS Publishing with XHTML tables and MathML3
  • JATS Publishing with OASIS and XHTML tables and MathML2
  • JATS Publishing with OASIS and XHTML tables and MathML3
  • JATS Authoring with XHTML tables and MathML2
  • JATS Authoring with XHTML tables and MathML3
  • JATS Authoring with OASIS and XHTML tables and MathML2
  • JATS Authoring with OASIS and XHTML tables and MathML3

The first question a new JATS user must answer is “What Color”, meaning which of the JATS tag sets to adopt. There are three JATS tag sets:

Journal Archiving and Interchange — usually called ‘Archiving’ and nicknamed: Green
Journal Publishing — usually called ‘Publishing’ and nicknamed: Blue
Article Authoring — usually called ‘Authoring’ and nicknamed: Pumpkin

And there is an associated tag set for books and book-like material:
Book Interchange Tag Set — usually called ‘BITS’ and nicknamed: Chocolate
(The nicknames are the colors of the documentation for each of the Tag Sets, and are used as shorthand for the names of the Tag Sets.)


If existing documents, especially existing XML documents, are to be converted to JATS, Archiving is the most likely target. This is because Archiving is the most flexible, and it is likely to require less reorganization and regrouping of the existing content to convert it to XML according to Archiving than it would take to get the same material into the Publishing tag set. The archiving model even provides an element to record processing that may have been done in the pre-JATS XML (<x>), which may be useful to retain information from existing non-JATS XML documents. Archiving is intended for libraries and archives who must accept input from a wide variety of sources and convert that input to JATS (as expeditiously as possible).


The Publishing tag set is a good choice for publishers converting content from non-XML source material, such as proprietary typesetting formats and word processing files. It enables the tagging of metadata that is created during the publishing process, such as publication date, journal, volume, issue, and page numbers, and is flexible enough to accommodate most publisher styles. Publishing is more restrictive than Archiving, which reduces the variation in the files, reduces the options available for tagging, and makes editing the XML files more comfortable.

If the user is going to use the JATS content to create new journal articles for search and display or for repurposing such as combining fragments of multiple documents into new documents, Publishing may be the best choice because the content will be in a more predictable and thus tractable form than if it were tagged using Archiving.


A publisher soliciting new material from article authors in JATS probably wants to have them submit the Article Authoring tag set for incoming content. (Similarly, an author in some utopian future where XML tools are common on general purpose desktops, would use the Authoring tag set to create articles because it is not specific to any particular publisher or journal.) There are few other scenarios in which Authoring is appropriate.

The Authoring tag set was designed to allow as few tagging options as possible while enabling the expression of the full content of a journal article. Authoring does not provide tagging for metadata that will not be known at the time an article is authored, such as publication date, page number, history, or journal name.


JATS provides several ways to encode mathematical expressions including: as a graphic, as text, as TeX or LaTeX, and in MathML2 or MathML3. It also allows the user to provide the same expression in two or more of these formats and identify them as alternative versions of the same expression.

Because the MathML2 and MathML3 models cannot coexist in the same tag set, users must choose one or the other, even if you have no math in your documents or plan not to use MathML to encode the math you have (perhaps choosing to use graphics or TeX or LaTeX instead of MathML).


There are a variety of tools for create, edit, convert and transform JATS. They range from simple forms to complete conversion automation.


convert to JATS

Take as input a scientific document, and, with some human support, produce a JATS output.

OpenOffice (LibreOffice) and MS Word documents to JATS:
  • eXtyles
    automates time-consuming aspects of document editing in Microsoft Word and exports to JATS XML (as well as many other DTDs).
  • Article Authoring Add-in for Word (NLM JATS)
    As described at , "this add-in for Microsoft Word enables authors/editors to use Microsoft Office Word to create, edit, and save files in the National Library of Medicine's NLM DTD (article or book) format."
    Though released in 2012, it was developed many years ago.
  • meTypeset
    "is a fork of the OxGarage stack" "to convert from Microsoft Word .docx format to NLM/JATS-XML". It is implemented as part of the PKP XML parsing stack.
  • OxGarage can convert documents from various formats into "National Library of Medicine (NLM) DTD 3.0" (as well as various other formats)
  • Open Typesetting Stack (PKP XML Parsing Service)
    The Open Typesetting Stack (formerly the PKP XML Parsing Service: sourcecode, beta site with free SaaS conversion from .doc or .pdf (pdfs must have embedded text (i.e. created from an already-digital document); scanned PDFs must have been OCRed before being uploaded to our service) converts Word/pdf documents to NLM DTD 3.0 XML, HTML, and PDF using meTypeset and other tools. As of April 2015, it's in beta, with metadata parsing disabled because of poor results. A plugin is being developed for Open Journal Systems (OJS) that will call this service.
    For more information, see a paper presented at JATS-Con 2015.
  The Migrate JATS service enables journal, academic and book publishers to automate the conversion of MS Word documents to the Journal Article Tag Suite (JATS) specification, in the cloud.

this is a very difficult problem to solve. Success depends on how well structured your PDFs are and, for batch conversion, how consistently structured your PDFs are.

  • Shabash Merops
  • The Public Knowledge Project is developing a pipeline for converting PDF to JATS. It will include use of pdfx.

Markdown to JATS: pandoc's "pandoc-jats" plugin


  • JATS Framework for oXygen XML Editor: users of oXygen XML Editor and oXygen XML Author can now install support for current versions of NISO JATS (and as a bonus, NLM BITS). Based on an identifier given in a DOCTYPE declaration, oXygen will detect that you are editing a JATS document and provide stylesheets and utilities.
  • FontoXML for JATS: WYSIWYS editor for editing and reviewing JATS content
  • PubRef "Pipeline": Browser-based realtime-preview JATS editor
  • Annotum:[30] a WordPress theme that contains WYSIWYG authoring in JATS (Kipling subset), peer-review and editorial management, and publishing.
  • JATS edition for web-based XML editor Xeditor.
  • Texture Editor of the Substance Consortium


Tools that render JATS as HTML, usually on fly.

  • JATS Preview Stylesheets: the JATS Preview Stylesheets are a series of .xsl, .xpl, .css, and .sch files that will create .html or .pdf versions of valid NISO Z39.96-2012 JATS 1.0 files. It is primarily intended for internal use by publishers and a basis for customization.
  • PubReader – "The PubReader view is an alternative web presentation ... Designed particularly for enhancing readability on tablet and other small screen devices, PubReader can also be used on desktops and laptops and from multiple web browsers".


Jatsdoc produces documentation for any particular JATS customization. Jatsdoc is integrated with NCBI's DtdAnalyzer.


Conference JATS

JATS-Con 2015 Schedule with Abstracts
Automating Complex High-Volume Technical Paper and Journal Article Page Composition with NLM XML and InDesign
Becky Fadik, SAE International
Brian Trombley, Data Conversion Laboratory
SAE International is a global association of more than 138,000 engineers and related technical experts in the aerospace, automotive and commercial-vehicle industries. Annually, SAE organizes and manages an industry conference, its World Congress and Exhibition, where thousands of technical papers and journal articles are presented as part of the conference program. Leading up to the Word Congress, the technical papers and journal articles are reviewed for compliance to SAE publishing requirements and published for print and made available online in a very short time-frame. This paper describes how SAE evolved the production cycle from a less than efficient XSL-FO based process to a highly automated process leveraging NLM XML, XSLT and Adobe InDesign resulting in productivity gains and higher quality output. This paper will take you through the evolution of this project and talk to future enhancements aimed at driving additional benefits.
Full paper:
Materials (powerpoint):
The Public Knowledge Project XML Publishing Service and meTypeset: Don't call it "Yet Another Word to JATS conversion kit"
Alex Garnett, Simon Fraser University
Juan Pablo Alperin, Simon Fraser University
John Willinsky, Stanford University
The Public Knowledge Project's Open Journal Systems, which provides a robust workflow and interface for editing, publishing, and indexing scholarly journal content, has always been somewhat agnostic as to the format of the content itself. Most authors, unsurprisingly, use Microsoft Word for its familiarity and ubiquity during the actual writing process, and the act of getting content from Word format into something that can be easily consumed on the web after that has always been something of a mystery among publishers. Those who have the means will typically outsource or partially automate the markup of an article in XML or an XML-like format, and this XML can then be transformed into HTML on the fly for viewing (OJS does include some stylesheets for this purpose); others, including many smaller open source journals, convert directly to the printable PDF format as a path of least effort, leaving them with something that looks nice on a printed page but potentially not so nice and not too flexible otherwise. For the past two years, PKP has been working on a web service (which integrates into OJS' workflow via a provided plugin) to fully automate the conversion of Word/compatible documents into the National Library of Medicine's standard JATS XML format (the same format which underlies PubMed Central), using fuzzy parsing and machine learning heuristics, and transform documents from there into matching human-readable HTML and PDF. This development, broadly speaking, takes two parts: one, the core OpenOffice/Word 2007 "docx" XML to JATS XML conversion engine, called "meTypeset" developed jointly with Martin Eve of the Open Library of the Humanities, and two, the web service pipeline, which unites meTypeset and other open-source libraries (including LibreOffice, ParsCit, ExifTool, and others). This service provides citation parsing, XMP metadata, and other industry-standard features. Improvement of various parsing features and automated evaluation is ongoing.
Full paper:
Materials (powerpoint):
All Aboard! Round-tripping JATS in an HTML-based online CMS and editing platform
Wendell Piez, Piez Consulting Services
In a project just starting up, we are endeavoring to convert JATS (or actually near-NLM-Book) data into HTML for an HTML-based CMS, where the documents (structured reference material in short articles) will be edited (in the usual sort of web-based HTML editor) -- and then must be siphoned back up into the JATS-like textbase, for further processing using JATS-based tools.
Going both ways presents many interesting theoretical and technical challenges. This paper will describe the design principles we are following, with our solutions and findings by next April.
Full paper:
Materials (powerpoint):
Improving the reusability of JATS
John Chodacki, PLoS
Alf Eaton, PeerJ
Michael Evans, F1000
Rupert Gatti, Open Book Publishers
James Gilbert, eLife
Melissa Harrison, eLife
Christopher Maloney, NCBI/NLM/NIH
Daniel Mietchen, Museum für Naturkunde Berlin; NCBI/NLM/NIH
Tom Mowlam, Ubiquity Press
JATS4R Working Group,
Despite a fair degree of standardization, JATS is still not used consistently across publishers, which inhibits harvesting and reuse of JATS-tagged materials. To address these issues, a working group has been formed to evaluate existing flavours of JATS and to harmonize them by issuing recommendations on how reuse-relevant tags should be used and on how documentation and guidelines can be clarified. Besides establishing communication channels and working out procedures, the group has already tackled some specific tagging issues, most notably the machine readability of license statements and of mathematics in journal articles.
Full paper:
Materials (powerpoint):
The Long Road to JATS
Paul Donohoe, Macmillan Science and Scholarly
Jenny Sherman, Macmillan Science and Scholarly
Ashwin Mistry, Macmillan Science and Scholarly
Nature Publishing Group/Palgrave has over a million articles, 180 journals, three in-house DTDs, numerous workflows and production systems, as well as teams based across the world. The challenge: How do we move to using JATS as our single DTD, introduce a streamlined production process and increase the number of journals we publish? This paper describes our journey so far, the challenges we faced, the XML tools we have used, the decisions we made and the reasons for them, and the work still to be done.
Full paper:
Materials (powerpoint):
Adapting JATS to support data citation
Daniel Mietchen, Museum für Naturkunde Berlin; NCBI/NLM/NIH
Johanna McEntyre, EBI
Christopher Maloney, NCBI/NLM/NIH
Force11 Data Citation Implementation Group,
Data are often not cited in a consistent fashion. To address this, Force 11 have developed the Data Citation Principles. JATS 1.1d1 has provisions for citing articles and other sources, but does not offer straightforward ways of expressing some of the concepts needed for data citation. In order to facilitate the citation of data in JATS-tagged documents in a way that is compliant with the Data Citation Principles, the Force11 Data Citation Implementation Group held a meeting in June, at which several new elements, attributes and values for attributes have been suggested to be added to JATS. These have since been submitted to the JATS Standing Committee, which largely accepted them, so they are now included in JATS 1.1d2. This talk will explain the decision criteria behind the elements that were proposed, and how they were selected for JATS 1.1d2. It will in addition provide suggested examples for use of the new tags.
Full paper:
Materials (powerpoint):

JATS-Con 2016 Schedule with Abstracts

no full paper, no materials, no video, only abstract.
A Quality Assurance Tool for JATS/BITS with Schematron and HTML reporting
Martin Kraetke, le-tex publishing services GmbH
Franziska Bühring, De Gruyter
De Gruyter adopted the JATS/BITS schema for journal content and established De Gruyter specific XML guidelines for creating XML metadata and full text data. Together with le-tex, De Gruyter developed a submission checker to validate the data quality of book and journal packages delivered from their service vendors. The tool is based on the Open Source software Talend and Transpect.
The submission checker verifies the consistency of metadata, validates against the JATS schema and De Gruyter's business rules, which are specified with Schematron. An HTML report provides a rendering of the source files with the error messages. The messages are displayed at the error location and are grouped by their severity. Content passing the check is forwarded for archiving and publication. It guarantees a technically correct rendering of the content on and facilitates the retrieval and processing for future purposes.
So You Want to Adopt JATS. What Decisions Do You Need To Make?
B. Tommie Usdin, Mulberry Technologies, Inc.
Newcomers to JATS need to make decisions about which tag set to use (Authoring, Publishing, or Archiving), which table model to adopt, and how to handle math. In addition, they should consider citation model and style, contributor names and affiliations, alternative languages and encodings, and adoption of tagging guidelines from PMC, JATS4R, and/or their publishing partners.
Collecting XML at article submission at eLife: Two steps forward, one step back?
Melissa Harrison, eLife
When eLife was launched in 2012, almost all article metadata was collected from what the corresponding author entered into the submission form. These data were converted to JATS XML at acceptance. We will describe the benefits and limitations of this approach, and how and why we have reverted to a more traditional method of using an author's Word file to generate elements of the XML metadata within the production process for the final full-text XML/HTML version of record. However, we also publish the accepted manuscript PDF for approximately 60% of accepted articles, and this process still relies on the metadata entered via the submission system to generate the HTML heading and metadata information online. We will discuss aspects of the peer-review process and submission system that affect the acquisition and conversion of article metadata for both accepted article PDFs and the final version of record, and the challenges we encountered in our efforts to streamline the production process and improve the end-to-end author experience. We will describe our new production process and the conversion of Word to HTML at the point of acceptance, and how we could extend this to the peer-review workflow to minimise duplication of effort in the process of metadata acquisition.
An implementation of BITS: The Cambridge University Press experience
Mike Eden, Cambridge University Press
Tom Cleghorn, Cambridge University Press
Cambridge University Press's history of using mark-up for academic book content resulted in a proprietary model, loosely based on NLM DTDs. With new requirements and new types of content, the DTD and related business rules were added to. These changes became frequent and unpredictable, creating pain-points in production workflows and resulting in a heavy burden, both internally, in keeping automated processes up-to-date, and externally, in maintaining suppliers' knowledge.
A decision was taken to review the situation. Outside consultants were engaged to perform the review. It became clear that the choice of model and its manner of use constituted only one part of the picture, and that a review of the entire process would be beneficial.
BITS was chosen as the DTD to use, rather than redefining proprietary DTDs. As an emerging industry standard, it is closely aligned with the Journals NISO standard (NISO Z39.96-2012), already in use at CUP. It was also felt that as the standard grew, benefits would arise from the input and requirements of other publishers.
During implementation, consideration was given to, among other aspects:

  • control of use (sub-setting or Schematron)
  • use of MathML and other similar standards
  • schemes for persistent element identification
  • approaches to metadata encoding
  • output intentions and specifications

Workflows were amended from copy editorial through to final delivery, with an emphasis on control of mark-up throughout. This resulted in a considerably better-defined and predictable process for external providers and internal stakeholders, and has resulted in the creation of a robust foundation for book production into the future.
Wrangling Math from Microsoft Word into JATS XML Workflows
Caitlin Gebhard, Inera, Inc.
Bruce Rosenblum, Inera, Inc.
Mathematics is a fundamental building block of modern technology, research, and industry, and yet the technological means of communicating mathematics is still surprisingly primitive. As a result, anyone involved in producing, publishing, or reading mathematical equations electronically knows that it is not a simple process.
The majority of scholarly papers are authored today in Microsoft Word. Some of those papers include simple and/or complex math. Authors have multiple means at their disposal to insert equations in Word documents including several of Word's native equation editors and third-party applications like Design Science's MathType. Building workflows that smoothly and accurately transform all of these formats into the appropriate XML markup for use in multiple rendering environments has many challenges.
This paper will clarify the different forms of equations that can be encountered in Word documents, and discuss the issues and idiosyncrasies of converting these various forms to MathML, LaTeX, and/or images in the JATS XML model. It will also touch on workflow alternatives for handling of equations in various rendering environments and how those downstream requirements may affect the means of equation extraction from Word documents.
JATSKit: An oXygen framework for JATS, BITS, and kindred XML formats
Wendell Piez, Piez Consulting
JATSKit is the newly improved oXygen XML Editor framework supporting NISO JATS and now NLM/NCBI BITS XML. With support for conversions of JATS and BITS data into HTML, EPUB and PDF. In addition to one-button publication, new features in the oXygen Author user interface include fielded entry for structured data elements, Schematron validation and QuickFixes, collapsible display of structural elements, more button controls and more. Everything is customizable and extensible.

1 comment:

  1. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in talend , kindly contact us
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383