(PDF) Key components of data publishing:

Int J Digit Libr DOI 10.1007/s00799-016-0178-2 Key components of data publishing: using current best practices to develop a reference model for data publishing Claire C. Austin1,2 · Theodora Bloom3 · Sünje Dallmeier-Tiessen4 · Varsha K. Khodiyar5 · Fiona Murphy6 · Amy Nurnberger7 · Lisa Raymond8 · Martina Stockhause9 · Jonathan Tedds10 · Mary Vardigan11 · Angus Whyte12 Received: 30 June 2015 / Revised: 13 May 2016 / Accepted: 25 May 2016 © Springer-Verlag Berlin Heidelberg 2016 Abstract The availability of workflows for data publish- Data Publishing Workflows group set out to study the cur- ing could have an enormous impact on researchers, research rent data-publishing workflow landscape across disciplines practices and publishing paradigms, as well as on funding and institutions. A diverse set of workflows were exam- strategies and career and research evaluations. We present ined to identify common components and standard practices, the generic components of such workflows to provide a including basic self-publishing services, institutional data reference model for these stakeholders. The RDA-WDS repositories, long-term projects, curated data repositories, and joint data journal and repository arrangements. The Author statement: All authors affirm that they have no undeclared results of this examination have been used to derive a data- conflicts of interest. Opinions expressed in this paper are those of the publishing reference model comprising generic components. authors and do not necessarily reflect the policies of the organizations From an assessment of the current data-publishing landscape, with which they are affiliated. Authors contributed to the writing of the article itself and significantly to the analysis. Contributors Timothy we highlight important gaps and challenges to consider, espe- Clark, Eleni Castro, Elizabeth Newbold, Samuel Moore and Brian Hole cially when dealing with more complex workflows and their shared their workflows with the group (for the analysis). The authors integration into wider community frameworks. It is clear are listed in alphabetical order. that the data-publishing landscape is varied and dynamic Theodora Bloom is a member of the Board of Dryad Digital Repository, and that there are important gaps and challenges. The dif- and works for BMJ, which publishes medical research and has policies ferent components of a data-publishing system need to on data sharing. work, to the greatest extent possible, in a seamless and integrated way to support the evolution of commonly under- B Fiona Murphy stood and utilized standards and—eventually—to increased

[email protected]

reproducibility. We therefore advocate the implementation of 1 Research Data Canada, Toronto, Canada existing standards for repositories and all parts of the data- 2 Carleton University, Ottawa, Canada publishing process, and the development of new standards 3 where necessary. Effective and trustworthy data publishing BMJ, London, UK 4 should be embedded in documented workflows. As more CERN, Geneva, Switzerland research communities seek to publish the data associated 5 Nature Publishing Group, London, UK with their research, they can build on one or more of the 6 University of Reading, Reading, UK components identified in this reference model. 7 Columbia University, New York, USA 8 Keywords Data publishing · Open data · Open Science · Woods Hole Oceanographic Institution, Woods Hole, USA World Data System · Research Data Alliance 9 German Climate Computing Centre (DKRZ), Hamburg, Germany 10 1 Data availability University of Leicester, Leicester, UK 11 University of Michigan/ICPSR, Leicester, UK Data from the analysis presented in this article are available 12 Digital Curation Centre, Edinburgh, Scotland, UK in: 123 C.C. Austin et al. Bloom, T., Dallmeier-Tiessen, S., Murphy, F., Khodiyar, A significant barrier to moving forward is the wide vari- V.K., Austin, C.C., Whyte, A., Tedds, J., Nurnberger, A., ation in best practices and standards between and within Raymond, L., Stockhause, M., Vardigan, M. Zenodo doi:10. disciplines. Examples of good practice include standardized 5281/zenodo.33899 (2015) data archiving in the geosciences, astronomy, and genomics. Archiving for many other kinds of data is only just beginning to emerge or is non-existent [1]. A major disincentive for 2 Introduction sharing data via repositories is the amount of time required to prepare data for publishing, time that may be perceived as Various data-publishing workflows have emerged in recent being better spent on activities for which researchers receive years to allow researchers to publish data through reposi- credit (such as traditional research publications, obtaining tories and dedicated journals. While some disciplines, such funding, etc.). Unfortunately, when data are sequestered by as the social sciences, genomics, astronomy, geosciences, researchers and their institutions, the likelihood of retrieval and multidisciplinary fields such as Polar science, have declines rapidly over time [2]. established cultures of sharing research data.1 via repos- The advent of publisher and funding agency mandates to itories,2 it has traditionally not been common practice in make accessible the data underlying publications is shift- all fields for researchers to deposit data for discovery and ing the conversation from “Should researchers publish their reuse by others. Typically, data sharing has only taken place data?” to “How can we publish data in a reliable manner?”. when a community has committed itself towards open shar- We now see requirements for openness and transparency, ing (e.g. Bermuda Principles and Fort Lauderdale meeting and a drive towards regarding data as a first-class research agreements for genomic data),3 or there is a legal4 require- output. Data publishing can provide significant incentives ment to do so, or where large research communities have for researchers to share their data by providing measur- access to discipline-specific facilities, instrumentation, or able and citable output, thereby accelerating an emerging archives. paradigm shift. Data release is not yet considered in a comprehensive manner in research evaluations and promo- tions, but enhancements and initiatives are under way within 1 When we use the term ‘research data’ we mean data that are used various funding and other research spaces to make such as primary sources to support technical or scientific enquiry, research, evaluations more comprehensive [3]. While there is still scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research commu- a prevailing sense that data carry less weight than pub- nity as necessary to validate research findings and results. All digital and lished journal articles in the context of tenure and promotion non-digital outputs of a research project have the potential to become decisions, recent studies demonstrate that when data are research data. Research data may be experimental, observational, oper- publicly available, a higher number of publications results ational, data from a third party, from the public sector, monitoring data, processed data, or repurposed data (Research Data Canada (2015), [4,5]. Glossary of terms and definitions, http://dictionary.casrai.org/Category: The rationale for sharing data is based on assumptions of Research_Data_Domain). reuse— if data are shared, then users will come. However, the 2 A repository (also referred to as a data repository or digital data repos- ability to share, reuse, and repurpose data depends upon the itory) is a searchable and queryable interfacing entity that is able to availability of appropriate knowledge infrastructures. Unfor- store, manage, maintain, and curate Data/Digital Objects. A repository is a managed location (destination, directory or ‘bucket’) where dig- tunately, many attempts to build infrastructure have failed ital data objects are registered, permanently stored, made accessible because they are too difficult to adopt. The solution may be and retrievable, and curated (Research Data Alliance, Data Founda- to enable infrastructure to develop around the way scientists tions and Terminology Working Group. http://smw-rda.esc.rzg.mpg. and scholars actually work, rather than expecting them to de/index.php/Main_Page). Repositories preserve, manage, and provide access to many types of digital material in a variety of formats. Mate- work in ways that the data center, organizational managers, rials in online repositories are curated to enable search, discovery, publishers, or funders would wish them to [6]. Some sur- and reuse. There must be sufficient control for the digital mater- veys have found that researchers’ use of repositories ranks ial to be authentic, reliable, accessible, and usable on a continuing a distant third—after responding to individual requests and basis (Research Data Canada (2015), Glossary of terms and defin- itions, http://dictionary.casrai.org/Category:Research_Data_Domain). posting data on local websites [7]. Similarly, ‘data services’ assist organizations in the capture, storage, Traditionally, independent replication of published rese- curation, long-term preservation, discovery, access, retrieval, aggrega- arch findings has been a cornerstone of scientific validation. tion, analysis, and/or visualization of scientific data, as well as in the However, there is increasing concern surrounding the repro- associated legal frameworks, to support disciplinary and multidiscipli- nary scientific research. ducibility of published research, i.e. that a researcher’s 3 http://www.genome.gov/10506376. published results can be reproduced using the data, code, 4 and methods employed by the researcher [8–10]. Here, too, For example, the Antarctic Treaty Article III states that “scientific observations and results from Antarctica shall be exchanged and made a profound culture change is needed if reproducibility is to be freely available”. http://www.ats.aq/e/ats_science.html. integrated into the research process [11–13]. Data availability 123 Key components of data publishing… is key to reproducible research and essential to safeguarding periodic scientifically processed data releases through the trust in science. lifetime of a mission (e.g. XMM-Newton X-ray Telescope As a result of the move towards increased data availabil- source catalogue, [17], in addition to making underlying ity, a community conversation has begun about the standards, datasets available through archives according to embargo workflows, and quality assurance practices used by data policies. repositories and data journals. Discussions and potential In 2011, Lawrence et al. [18] defined the act of ‘publishing solutions are primarily concerned with how best to han- data,’ as: “to make data as permanently available as possi- dle the vast amounts of data and associated metadata in all ble on the Internet.” Published data will have been through their various formats. Standards at various levels are being a process guaranteeing easily digestible information as to its developed by stakeholder groups and endorsed through inter- trustworthiness, reliability, format, and content. Callaghan national bodies such as the Research Data Alliance (RDA), et al. [19] elaborate on this idea, arguing that formal publi- the World Data System of the International Council for Sci- cation of data provides a service over and above the simple ence (ICSU-WDS), and within disciplinary communities. act of posting a dataset on a website, in that it includes a For example, in astronomy there has been a long process series of checks on the dataset of either a technical (for- of developing metadata standards through the International mat, metadata) or a more content-based nature (e.g. are the Virtual Observatory Alliance (IVOA),5 while in the climate data accurate?). Formal data publication also provides the sciences the netCDF/CF convention was developed as a stan- data user with associated metadata, assurances about data dard format including metadata for gridded data. Even in persistence, and a platform for the dataset to be found and highly diverse fields such as the life sciences, the BioShar- evaluated—all of which are essential to data reuse. An impor- ing6 initiative is attempting to coordinate community use of tant consideration for our study is that support for ‘normal’ standards. Increasingly, there is a new understanding that data curation falls short of best practice standards. For exam- publishing ensures long-term data preservation and hence ple, having conducted a survey of 32 international online produces reliable scholarship, demonstrates reproducible data platforms [20], the Standards & Interoperability Com- research, facilitates new findings, enables repurposing, and mittee of Research Data Canada (RDC)9 concluded that thereby realizes benefits and maximizes returns on research there is still a great deal of work to be done to ensure that investments. online data platforms meet minimum standards for reliable But what exactly is data publishing? Parsons and Fox [14] curation and sharing of data, and developed guidelines for question whether publishing is the correct term when dealing the deposit and preservation aspects of publishing research with digital information. They suggest that the notion of data data. publishing can be limiting and simplistic and recommend With the present study, a first step is taken towards a that we explore alternative paradigms such as the models for reference model comprising generic components for data software release and refinement, rather than one-time pub- publishing—which should help in establishing standards lication [14]. Certainly, version control7 does need to be an across disciplines. integral part of data publishing and this can distinguish it We describe selected data-publishing solutions, the roles from the traditional journal article. Dynamic data citation of repositories and data journals, and characterize workflows is an important feature of many research datasets which currently in use. Our analysis involved the identification and will evolve over time, e.g. monitoring data and longitudinal description of a diverse set of workflows, including basic studies [15]. The data journal Earth System Science Data is self-publishing services, long-term projects, curated data addressing this challenge with its approach to ‘living data’.8 repositories, and joint data journal and repository arrange- The RDA Dynamic Citation Working group has also devel- ments. Key common components and standard practices oped a comprehensive specification for citing everything were then identified as part of a reference model for data pub- from a subset of a dataset to data generated dynamically, ‘on- lishing. These could help with standardizing data-publishing the-fly’ [16]. International scientific facilities typically plan activities in the future (while leaving enough room for dis- ciplinary or institutional practices). It is worth noting that 5 http://www.ivoa.net. there is continued discussion about many of the key def- 6 http://biosharing.org. initions. The working group presents core data-publishing 7 Version control (also known as ‘revision control’ or ‘versioning’) is control over a time period of changes to data, computer code, software, 9 Research Data Canada (RDC) is an organizational member of and documents that allows for the ability to revert to a previous revi- Research Data Alliance (RDA) and from the beginning has worked very sion, which is critical for data traceability, tracking edits, and correcting closely with RDA. See: “Guidelines for the deposit and preservation of errors. TeD-T: Term definition tool. Research Data Alliance, Data Foun- research data in Canada, http://www.rdc-drc.ca/wp-content/uploads/ dations and Terminology Working Group. http://smw-rda.esc.rzg.mpg. Guidelines-for-Deposit-of-Research-Data-in-Canada-2015.pdf and, de/index.php/Main_Page. “Research Data Repository Requirements and Features Review”, http:// 8 http://www.earth-system-science-data.net/living_data_process.html. hdl.handle.net/10864/10892. 123 C.C. Austin et al. terms (definitions) based on the analysis. We compare, con- Requirements12 and are particularly relevant for their empha- trast, and evaluate the key components, and identify and sis on making workflows explicit. assess their utility and value-enhancing capabilities. We dis- Our specific concerns in the working group build on such cuss the challenges inherent in citing and disseminating standards, to guide implementation of quality assurance and data, and then give context to already existing initiatives peer review of research data objects, their citation, and link- in this space. We outline continuing gaps and challenges— ing with other digital objects in the research and scholarly themselves opportunities for further research—and finally communication environment. include a practical, modular set of recommendations as part A case study approach was in keeping with this aim. Case of our conclusions. studies explore phenomena in their context and generalize to theory rather than to populations [21]. Similarly, drafting a conceptual model does not require us to make generalizable claims to the repository population as a whole, but it does 3 Methods and materials commit us to testing its relevance to repositories, and other stakeholders, through community review and amendment. The RDA-WDS Publishing Data Workflows Working Group As the membership of the RDA-WDS Publishing Data (WG) was formed to provide an analysis of a reasonably Workflows, WG was reasonably diverse in terms of discipli- representative range of existing and emerging workflows nary and stakeholder participation, we drew upon that group’s and standards for data publishing, including deposit and knowledge and contacts, and issued calls to participate under citation, and to provide components of reference models the auspices of the RDA and WDS, in collaboration with the and implementations for application in new workflows. Force11 Implementation Group13 to identify best practices The present work was specifically focused on articulat- and case studies in data-publishing workflows. Presentations ing a draft reference model comprising generic compo- and workshops at RDA plenary meetings were used to vali- nents for data-publishing workflows that others can build date the approach and progress. With this iterative approach, upon. We also recognize the need for the reference model we identified an initial set of repositories, projects, and to promote workflows that researchers find usable and publishing platforms which were thought to be reasonably attractive. representative of institutional affiliation and domain-specific To achieve this, the working group followed the OASIS or cross-disciplinary focus. These workflows served as a definition of a reference model as: case study for the analysis to identify likely examples of “…an abstract framework for understanding significant ‘data publishing’ from repositories, projects, and publishing relationships among the entities of some environment, and platforms, whether institutional, domain-specific, or cross- for the development of consistent standards or specifications disciplinary. supporting that environment. A reference model is based on a Publicly available information was used to describe the small number of unifying concepts and may be used as a basis workflows on a common set of terms. In addition, reposi- for education and explaining standards to a non-specialist. tory representatives were invited to present and discuss their A reference model is not directly tied to any standards, workflows via videoconference and face-to-face meetings. technologies or other concrete implementation details, but Emphasis was given to workflows facilitating data citation it does seek to provide a common semantics that can be used and the provision of ‘metrics’ for data was added as a consid- unambiguously across and between different implementa- eration. Information was organized into a comparison matrix tions”.10 and circulated to the group for review, whereupon a number A particularly relevant example is the OAIS Reference of annotations and corrections were made. Empty fields were Model for an Open Archival Information System.11 This populated, where possible, and terms were cross-checked and model has shaped the Trusted Digital Repository (TDR) stan- harmonized across the overall matrix. Twenty-six examples dard which frames repository best practice for ingesting, were used for comparison of characteristics and workflows. managing, and accessing archived digital objects. These have However, one workflow (Arkivum) was judged not to qual- recently been exemplified by the DSA-WDS Catalogue of ify for the definition of ‘data publishing’ as it emerged in the 10 Source: OASIS, http://www.oasis-open.org/committees/soa-rm/faq .php. 12 Draft available at: http://rd-alliance.org/group/repository-audit- 11 “Recommendation for Space Data System Practices: Reference and-certification-dsa-wds-partnership-wg/outcomes/dsa-wds- Model for an Opean Archival Information System (OAIS), CCSDS partership. 650.0-M-2.” http://public.ccsds.org/publications/archive/650x0m2. 13 Force11 (2015). Future Of Research Communications and pdf DataCite (2015). “DataCite Metadata Schema for the Publication e-Scholarship http://www.force11.org/group/data-citation-implement and Citation of Research Data”. http://dx.doi.org/10.5438/0010. ation-group. 123 Key components of data publishing… Table 1 Repositories, projects, and publishing platforms selected for analysis of workflows and other characteristics Workflow provider name Workflow provider type Workflow provider specialist Deposit initiator research area, if any ENVRI reference model Guidelines Environmental sciences Project-led PREPARDE Guidelines Earth sciences Researcher led (for Geoscience Data Journal) Ocean Data Publication Cookbook Guidelines Marine sciences Researcher-led Scientific Data, Nature Publishing Journal Researcher (author) led Group F1000Research Journal Life sciences Researcher led; editorial team does a check Ubiquity Press OHDJ Journal Life, health and social sciences Researcher led GigaScience Journal Life and biomedical sciences Researcher (author) led Data in Brief Journal Author led Earth System Science Data Journal Earth sciences Researcher led for data article. Journal, Copernicus Publications Researcher led for data submission to repository Science and Technology Facilities Repository Physics and space sciences Researcher led as part of project Council Data Centre deliverables National Snow and Ice Data Center Repository Polar Sciences Project or researcher led INSPIRE Digital library Repository High energy Physics Researcher led UK Data Archive (ODIN) Repository Social sciences Researcher led PURR Institutional Repository Repository Researcher-/librarian led ICPSR Repository Social and behavioural sciences Researcher, acquisitions officer, and funder led Edinburgh Datashare Repository Researcher led, librarian assists PANGAEA Repository Earth sciences Researcher led WDC Climate Repository Earth sciences Researcher or project led CMIP/IPCC-DDC Repository Climate sciences Project-leda Dryad Digital Repository Repository Life sciences Researcher led Stanford Digital Repository Repository Researcher led Academic Commons Columbia Repository Researcher and repository staff Data Repository for the University Repository Researchers from institution of Minnesota (DRUM) ARKIVUM and Figshare Repository Researcher led OJS/ Dataverse data repository, Repository Researcher led; part of journal all disciplines article publication process a Data Citation concept for CMIP6/AR6 is available as draft at: http://www.earthsystemcog.org/projects/wip/resources/ course of the research, so the final table consists of 25 entities • Technical review and checks (e.g. for data integrity at (Table 1). repository/data centre on ingest). Workflows were characterized in terms of the discipline, • Discoverability: was there indexing of the data and, if so, function, data formats, and roles involved. We also described where? the extent to which each exhibited the following ten charac- • Links to additional data products (data paper; review; teristics associated with data publishing: other journal articles) or “standalone” product. • Links to grant information, where relevant, and usage of author PIDs. • The assignment of persistent identifiers (PIDs) to datasets, • Facilitation of data citation. and the PID type used—e.g. DOI, ARK, etc. • Reference to a data life cycle model. • Peer review of data (e.g. by researcher and by editorial • Standards compliance. review). • Curatorial review of metadata (e.g. by institutional or The detailed information and categorization can be found in subject repository). the analysis dataset comprising the comparison matrix [22]. 123 C.C. Austin et al. 4 Analysis and results: towards a reference model data-publishing product, and/or the host institution of in data publishing the workflow (e.g. individual publisher/journal, insti- tutional repository, discipline-specific repository). 4.1 Definitions for data-publishing workflows and outputs Data article A data article is a ‘data-publishing’ product, also known The review of the comparison matrix of data-publishing as a ‘data descriptor’ that may appear in a data journal or workflows produced by the RDA-WDS Publishing Data any other journal. When publishers refer to ‘data pub- Workflows WG [22] revealed a need for standardization of lishing’, they usually mean a data article rather than the terminology. We therefore propose definitions for six key underlying dataset. Data articles focus on making data terms: research data publishing, research data-publishing discoverable, interpretable, and reusable, rather than workflows, data journal, data article, data review, and data testing hypotheses or presenting new interpretations repository entry. (by contrast with traditional journal articles). Whether Research data publishing linked to a dataset in a separate repository, or submitted “Research data publishing is the release of research in tandem with the data, the aim of the data article is to data, associated metadata, accompanying documenta- provide a formal route to data sharing. The parent jour- tion, and software code (in cases where the raw data nal may choose whether or how standards of curation, have been processed or manipulated) for re-use and formatting, availability, persistence, or peer review of analysis in such a manner that they can be discovered the dataset are described. By definition, the data article on the Web and referred to in a unique and persis- provides a vehicle to describe these qualities, as well tent way. Data publishing occurs via dedicated data as some incentive to do so. The length of such articles repositories and/or (data) journals which ensure that can vary from micro papers (focused on one table or the published research objects are well documented, plot) to very detailed presentation of complex datasets. curated, archived for the long term, interoperable, Data journal citable, quality assured and discoverable – all aspects of data publishing that are important for future reuse of A data journal is a journal (invariably open access) data by third party end-users.” that publishes data articles. The data journal usu- ally provides templates for data description and offers This definition applies also to the publication of confiden- researchers guidance on where to deposit and how tial and sensitive data with the appropriate safeguards and to describe and present their data. Depending on accessible metadata. A concrete example of such a workflow the journal, such templates can be generic or dis- may be a published journal article that includes discover- cipline focused. Some journals or their publishers ability and citation of a dataset by identifying access criteria maintain their own repositories. As well as support- for reuse.14 Harvard University is currently developing a tool ing bi-directional linking between a data article and that will eventually be integrated with Dataverse to share and its corresponding dataset(s), and facilitating persistent use confidential and sensitive data in a responsible manner.15 identification practices, data journals provide work- Research data-publishing workflows flows for quality assurance(i.e. data peer review) and Research data-publishing workflows are activities and should also provide editorial guidelines on data quality processes that lead to the publication of research data, assessment. associated metadata and accompanying documentation Data review and software code on the Web. In contrast to interim or final published products, workflows are the means Data review comprises a broad range of quality assess- to curate, document, and review, and thus ensure and ment workflows, which may extend from a technical enhance the value of the published product. Work- review of metadata accuracy to a double-blind peer flows can involve both humans and machines and often review of the adequacy of data files and documentation humans are supported by technology as they perform and accuracy of calculations and analyses. Multiple steps in the workflow. Similar workflows may vary variations of review processes exist and are dependant in their details, depending on the research discipline, upon factors such as publisher requirements, researcher expectations, or data sensitivity. Some workflows may 14 Indirect linkage or restricted access—see e.g. Open Health Data Jour- be similar to traditional journal workflows, in which nal, http://openhealthdata.metajnl.com. specific roles and responsibilities are assigned to edi- 15 http://privacytools.seas.harvard.edu/datatags. tors and reviewers to assess and ensure the quality of a 123 Key components of data publishing… Fig. 1 Data-publishing key components. Elements that are required to constitute data publication are shown in the left panel, and optional services and functions in the right panel data publication. The data review process may therefore such services exist, ranging from author-led, editor-driven, encompass a peer review that is conducted by invited librarian-supported solutions, to (open) peer review. Such domain experts external to the data journal or the repos- components are crucial enablers of future data reuse and itory, a technical data review conducted by repository reproducible research. Our analysis found that many ser- curation experts to ensure data are suitable for preser- vices offer or are considering offering such services. The third vation, and/or a content review by repository subject group of add-ons aims to improve visibility, as shown on the domain experts. right panel of Fig. 1. This set of services is not currently well established and this hampers data reuse. Other emerging ser- Data repository entry vices include connection of data-publishing workflow with indexing services, research information services (CRIS), or A data repository entry is the basic component of data metrics aggregators. publishing consisting of a persistent, unique identi- To ensure the possibility of data reuse, data publish- fier pointing to a landing page that contains a data ing should contain at least the basic elements of curation, description and details regarding data availability and QA/QC, and referencing, plus additional elements appropri- the means to access the actual data [22] ate for the use case (Fig. 1). Depending on the use case, however, it might be appropriate to select a specific set of 4.2 Key components of data publishing elements from the key components (following some best practices). In the light of future reuse, we would argue that the Analysis of workflows by the RDA-WDS data publishing basic elements of curation, QA/QC, and referencing should WG identified the components that contribute to a generic always be included. reference model for data publishing. We distinguish basic and add-on services. The basic set of services consists of entries in a trusted data repository, including a persistent identifier, 4.3 Detailed workflows and dependencies standardized metadata, and basic curation (Fig. 1). Optional add-ons could include components such as con- We present a traditional article publication workflow textualization through additional embedding into data papers (Fig. 2-1), a reproducible research workflow (Fig. 2-2), and or links to traditional papers. Some authors and solutions a data publication workflow (Fig. 2-3). make a distinction between metadata publication and data The workflow comparison found that it is usually the publication. We would argue that data and their associated researcher who initiates the publication process once data metadata must at least be bi-directionally linked in a per- have been collected and are in a suitable state for publication, sistent manner, and that they need to be published together or meet the repository requirements for submission. Datasets and viewed as a package, since metadata are essential for the may be published in a repository with or without an asso- correct use, understanding, and interpretation of the data. ciated data article. However, there are examples for which Important add-ons are quality assurance/quality control there is a direct ‘pipe’ from a data production ‘machine’ to (QA/QC)16 and peer review services. Different variations of a data repository (genome sequencing is one such example). Depending on the data repository, there are both scientific 16 Quality assurance: The process or set of processes used to mea- and technical [18,23] quality assurance activities regarding sure and assure the quality of a product. Quality control: The process of meeting products and services to consumer expectations (Research dataset content, description, format, and metadata quality Data Canada, 2015, Glossary of terms and definitions, http://dictionary. before data are archived for the long term. The typical data casrai.org/Category:Research_Data_Domain). repository creates an entry for a specific dataset or a collection 123 C.C. Austin et al. Fig. 2 Research data publication workflows. We present a traditional article publication workflow (2-1), a reproducible research workflow (2-2), and—as a more dynamic version of Fig. 1—a data publication workflow (2-3) 123 Key components of data publishing… thereof. Most repositories invest in standardized dissemina- cation as represented in the ‘Additional elements’ of Fig. 1. tion for datasets, i.e. a landing page for each published item, Data journals are similar to the traditional research journal as recommended by the Force11 Data Citation Implemen- (Fig. 2-1), in that their core processes consist of peer review tation Group [24].17 Some repositories facilitate third-party and dissemination of the datasets. Naturally, reviewers must access for discoverability or metrics services. have pre-publication access to the dataset in a data repository, As shown in Fig. 2, researchers can and do follow and there needs to be version control solutions for datasets a number of different pathways to communicate about and data papers. Whether publishing data via a data article their data. Traditionally, research results are published in or a data repository, both workflows have the potential to be journals, and readers (end user 1) interested in the data incorporated into the current system of academic assessment would need to contact authors to access underlying data and reward in an evolutionary process rather than a disruptive or attempt to access it from a researcher-supported website departure from previous systems. (Fig. 2-1). Emerging processes supporting greater repro- Data publication workflows supporting reproducible rese- ducibility in research include some form of data publication arch give end users access to managed and curated data, (Fig. 2-2). This includes the special case of standalone18 code, and supporting metadata that have been reviewed and data publications with no direct connection to a paper. These uploaded to a trusted repository (Fig. 2, end-user 2a). If an are common in multiple domain areas (e.g. the large climate associated data article is published, end users will also have data intercomparison study CMIP).19 Figure 2-3 illustrates further contextual information (Fig. 2, end-user 2b). The tra- the two predominant emerging data publication workflows ditional journal article may be published as usual and may emerging from our analysis: (a) submission of a dataset to be linked to the published data and/or data article as well. a repository; and, (b) submission of a data article to a data There are some hard-wired automated workflows for data journal. Both workflows require that datasets are submitted publishing (e.g. with the Open Journal Systems-Dataverse to a data repository. integration [25], or there can be alternate automated or man- The data publication process shown in Fig. 2-3 may ual workflows in place to support the researcher (e.g. Dryad). be initiated at any time during research once the data are sufficiently complete and documented, and may follow a variety of paths. A repository will typically provide spe- 4.4 Data deposit cific templates for metadata and additional documentation (e.g. methodology or code-specific metadata). The submis- We found that a majority of data deposit mechanisms under- sion may then be reviewed from a variety of perspectives lying data-publishing workflows are initiated by researchers, depending on the policies and practices of the repository. but their involvement beyond the initial step of deposition These review processes may include formatting issues, con- varied across repositories and journals. Platform purpose tent, metadata, or other technical details. Some repositories (e.g. data journal vs. repository) and the ultimate perceived may also require version control of the dataset. There is purpose and motivation of the depositor of the data all affect a great deal of variability between repositories in the type the process. For example, a subject-specialist repository, such of data accepted, available resources, the extent of services as is found at Science and Technology Facilities Council offered, and workflows. Figure 2-3 illustrates the elements (STFC) or the National Snow and Ice Data Center (NSIDC), common to the workflows of the data repositories selected for screens submissions and assesses the levels of metadata and the present study (Fig. 2-3) are consistent with those shown support required. Data journals, however, typically adopt a in Fig. 1. ‘hands-off’ approach: the journal is the ‘publication’ out- A researcher may also choose to initiate the data publi- let, but the data are housed elsewhere. Hence, the journal cation process by submitting a data article for publication in publishing team often relies on external parties—repository a data journal. This workflow is also illustrated in Fig. 2- managers and the research community in general21 —to man- 3, and while it is in part dependent on data repositories age data deposit and to assess whether basic standards are met (data journals typically identify approved repositories),20 the for data deposition or if quality standards are met (see details data article publication process has the opportunity to more below). consistently provide some of the advantages of data publi- 17 http://www.force11.org/datacitationimplementation. 18 Defined in e.g. [18]. 21 Post-publication peer review is becoming more prevalent and may 19 Program for Climate Model Diagnosis and Intercomparison. (n.d.). ultimately strengthen the Parsons–Fox continual release paradigm. See, Coupled Model Intercomparison Project (CMIP). Retrieved November for instance, F1000 Research and Earth System Science Data and the lat- 11, 2015, from http://www-pcmdi.llnl.gov/projects/cmip/. ter journal’s website: http://www.earth-system-science-data.net/peer_ 20 Approved by the data journal. review/interactive_review_process.html. 123 C.C. Austin et al. 4.5 Ingest or Dataverse [22]) tended to have dedicated data curation personnel to help in standardizing and reviewing data upon We found that discipline-specific repositories had the most submission and ingestion, especially in the area of metadata. rigorous ingest and review processes and that more general Some domain repositories (e.g. ICPSR) go farther and con- repositories, e.g. institutional repositories (IRs) or Dryad, duct in-depth quality control checks on the data, revising the had a lighter touch given the greater diversity of use data if necessary in consultation with the original investiga- cases and practice around data from diverse disciplines. tor. Other repositories responsible for the long-term archiving Some discipline-specific repositories have multiple-stage of project data (e.g. the IPCC-DDC23 ) document their QA processes including several QA/QC processes and workflows results. Some data repositories rely on researchers for the based on OAIS. Many IRs have adopted a broader approach QA/QC workflows to validate the scientific aspects of data, to ingest necessitated by their missions, which involves metadata, and documentation. Technical support, data vali- archiving research products generated across their campuses, dation, or QA/QC was also done by some repositories, but especially those found in the long tail of research data, includ- the level of engagement varied with the service and the indi- ing historical data that may have been managed in diverse vidual institutions: some checked file integrity, while others ways. As data standards are developed and implemented and offered more complex preservation actions, such as on-the- as researchers are provided with the tools, training, and incen- fly data format conversions. Some multi-purpose repositories tives needed to engage in modern data management practices, provided support to researchers for QA/QC workflows, but ingest practices will no doubt improve. this was not a standard practice. Overall, QA/QC in data When data journals rely on external data repositories to publishing is a ‘hot-button’ topic and is debated heavily and handle the actual data curation, there needs to be a strong col- continuously within the community. Mayernik et al. describe laboration between the journal and repository staff beyond a range of practice in technical and academic peer review for trust that the repository will pursue data management and publishing data [26]. ingestion according to acceptable standard procedures. Data The journal workflows we examined typically straddled journals and data repositories are encouraged to make pub- the dual processes of reviewing the dataset itself and the data lic and transparent any such agreements (e.g. service-level papers, which were carried out separately and then checked agreements). Ultimately, however, this level of one-to-one to ensure that the relationship between the two was valid. interaction is not scalable and automated procedures and Such QA/QC workflows for data journals demand a strong repository standards will be needed. collaboration with the research community and their peer reviewers, and also between publisher and data repository in 4.6 Quality assurance (QA) and quality control (QC) workflow co-ordination, versioning, and consistency. Given the wide range of QA/QC services currently We found that QA/QC typically occurs at three points during offered, future recommendations should consider the follow- the data-publishing workflow: (1) during data collection and ing: data processing, prior to submission of the data to a repos- itory; (2) during submission and archiving of the data; and • Repositories which put significant effort into high levels (3) during a review or the editorial procedure. We distinguish of QA/QC benefit researchers whose materials match the between traditionally understood peer review and the internal repository’s portfolio by making sure their materials are reviews that repositories and journals also generally conduct fit for reuse. This also simplifies the peer review process (Fig. 2), which may touch on content, format, description, for associated data journals and lowers barriers to uptake documentation, metadata, or other technical details. by researchers. QA/QC procedures vary widely and may involve authors/ • General research data repositories which must accom- reviewers for QA of the content and documentation, and data modate a wide variety of data may have some limitations managers/curators, librarians, and editors for technical QA. in QA/QC workflows and these should be made explicit. Quality criteria can include checks on data, metadata, and • Information about quality level definitions and quality documentation against repository, discipline.22 and project assessment procedures and results should be explicit and standards. readily available to users (and also possibly to third par- Most repositories and all of the data journals that we ties, such as aggregators or metric services). reviewed had some QA/QC workflows, but the level and type of services varied. Established data repositories (e.g. ICPSR There appears to be a trend towards data being shared earlier in the research workflow, at a stage where the data are still 22 An example for a discipline standard is the format and metadata stan- dard NetCDF/CF used in Earth system sciences: http://cfconventions. 23 Intergovernmental Panel on Climate Change Data Distribution Cen- org/. tre (IPCC-DDC): http://ipcc-data.org. 123 Key components of data publishing… dynamic (see for example Meehl et al. [27]). There is a need, be optimally useful.27 It should be noted that dissemination therefore, for QA/QC procedures that can handle dynamic of data-publishing products was, in some cases, enhanced data. through linking and exposure (e.g. embedded visualization) in traditional journals. This is important, especially given the 4.7 Data administration and long-term archiving culture shift needed within research communities, to make data publishing the norm. Data administration and curation activities may include deal- Dissemination practices varied widely. Many repositories ing with a variety of file types and formats, creation of supported publicly accessible data, but diverged in how opti- access-level restrictions, the establishment and implemen- mally they were indexed for discovery. As would be expected, tation of embargo procedures, and assignment of identifiers. data journals tended to be connected with search engines and We found an assortment of practices in each of these areas. with abstracting and indexing services. However, these often These vary from providing file format guidelines alone to (if not always) related to the data article rather than to the active file conversions; from supporting access restrictions dataset per se. The launch of the Data Citation Index28 by to supporting only open access; administering flexible or Thomson Reuters and projects such as the Data Discovery standardized embargo periods; and employing different types Index29 are working on addressing the important challenge of identifiers. Several discipline-specific repositories already of data discovery and could serve as an accelerator to a par- have a long track record of preserving data and have detailed adigm shift for establishing data publishing within research workflows for archival preservation. Other repositories are communities. fairly new to this discussion and continue to explore poten- One example of such a paradigm shift occurred in 2014 tial solutions. when the Resource Identifier Initiative (RII) launched a new Most repositories in our sample have indicated a commit- registry within the biomedical literature. The project cov- ment to persistence and the use of standards. The adoption ered antibodies, model organisms (mice, zebrafish, flies), and of best practices and standards would increase the likeli- tools (i.e. software and databases), providing a fairly com- hood that published data will be maintained over time and prehensive combination of data, metadata, and platforms to lead to interoperable and sustainable data publishing. Repos- work with. Eighteen months later, the project was able to itory certification systems have been gaining momentum in report both a cultural shift in researcher behaviour and a sig- recent years and could help facilitate data publishing through nificant increase in the potential reproducibility of relevant collaboration with data-publishing partners such as funders, research. As discussed in Bandrowski et al. [28], the criti- publishers, and data repositories. The range of certification cal factor in this initiative’s success in gaining acceptance schemes24 includes those being implemented by organiza- and uptake was the integrated way in which it was rolled tions such as the Data Seal of Approval (DSA)25 and the out. A group of stakeholders including researchers, journal World Data System (ICSU-WDS).26 Improved adoption of editors, subject community leaders, and publishers—within a such standards would have a big impact on interoperable and specific discipline, neuroscience—worked together to ensure sustainable data publishing. a consistent message. This provided a compelling rationale, coherent journal policies (which necessitated compliance for 4.8 Dissemination, access, and citation would-be authors to publish), and a specific workflow for the registration process (complete with skilled, human support if Data packages in most repositories we analyzed were sum- required). Further work is needed to determine exactly how marized on a single landing page that generally offered some this use case can be leveraged across the wider gamut of basic or enriched (if not quality assured) metadata. This usu- subjects, communities, and other players. ally included a DOI and sometimes another unique identifier FAIR principles30 and other policy documents [10] explic- as well or instead. We found widespread use of persistent itly mention that data should be accessible. Data-publishing identifiers and a recognition that data must be citable if it is to solutions ensure that this is the case, but some workflows allow only specific users to access sensitive data. An exam- 24 Data Seal of Approval (DSA); Network of Expertise in long- term Storage and Accessibility of Digital Resources in Germany 27 (NESTOR) seal/German Institute for Standardization (DIN) standard Among the analyzed workflows, it was generally understood that data 31644; Trustworthy Repositories Audit and Certification (TRAC) cri- citation which properly attributes datasets to originating researchers teria / International Organization for Standardization (ISO) standard can be an incentive for deposit of data in a form that makes the data 16363; and the International Council for Science World Data System accessible and reusable, a key to changing the culture around scholarly (ICSU-WDS) certification. credit for research data. 25 28 http://wokinfo.com/products_tools/multidisciplinary/dci/. Data Seal of Approval: http://datasealofapproval.org/en/. 26 29 http://grants.nih.gov/grants/guide/rfa-files/RFA-HL-14-031.html. World Data System certification. http://www.icsu-wds.org/files/ wds-certification-summary-11-june-2012.pdf. 30 http://www.force11.org/group/fairgroup/fairprinciples. 123 C.C. Austin et al. ple is survey data containing information that could lead to important for establishing author metrics. During our inves- the identification of respondents. In such cases, a prospective tigation, many data repository and data journal providers data user could access the detailed survey metadata to deter- expressed an interest in new metrics for datasets and related mine if meets his/her research needs, but a data use agreement objects. Tracking usage, impact, and reuse of the shared would need to be signed before access to the dataset would materials can enrich the content on the original platforms be granted. The metadata, data article, or descriptor could be and encourage users to engage in further data sharing or published openly, perhaps with a Creative Commons license, curation activities. Such information is certainly of interest but the underlying dataset would be unavailable except via to infrastructure and research funders.35 registration or other authorization processes. In such a case, the data paper would allow contributing researchers to gain 4.10 Diversity in workflows due credit, and it would facilitate data discovery and reuse.31 Citation policies and practice also vary by community and While workflows may appear to be fairly straightforward culture. Increasingly, journals and publishers are including and somewhat similar to traditional static publication proce- data citation guidelines in their author support services. In dures, the underlying processes are, in fact, quite complex terms of a best practice or standard, the Joint Declaration and diverse. The diversity was most striking in the area of of Data Citation Principles32 is gathering critical mass and curation. Repositories that offered self-publishing options becoming generally recognized and endorsed. Discussions without curation had abridged procedures, requiring fewer concerning more detailed community practices are emerg- resources but also potentially providing less contextual ing: for example, whether or not publishing datasets and information and fewer assurances of quality. Disciplinary data papers—which can then be cited separately from related repositories that performed extensive curation and QA had primary research papers—is a fair practice in a system that more complex workflows with additional steps, possibly con- rewards higher citation rates. However, sensible practices can secutive. They might facilitate more collaborative work at the be formulated.33 beginning of the process, or include standardized preserva- tion steps. 4.9 Other potential value-added services and metrics There was metadata heterogeneity across discipline- specific repositories. Highly specialized repositories fre- Many repository or journal providers look beyond work- quently focused on specific metadata schemas and pursued flows that gather information about the research data and curation accordingly. Some disciplines have established also want to make this information visible to other infor- metadata standards, similar to the social sciences’ use of the mation providers in the field. This can add value to the Data Documentation Initiative standard.36 In contrast, more data being published. If the information is exposed in a general repositories tended to converge on domain-agnostic standardized fashion, data can be indexed and be made metadata schemas with fields common across disciplines, discoverable by third-party providers, e.g. data aggregators e.g. the mandatory DataCite fields.37 (Fig. 1). Considering that such data aggregators often work Data journals are similar in overall workflows, but differ beyond the original data provider’s subject or institutional in terms of levels of support, review, and curation. As with focus, some data providers enrich their metadata (e.g. with repositories, the more specialized the journal (e.g. a disci- data-publication links, keywords, or more granular subject pline in the earth sciences with pre-established data-sharing matter) to enable better cross-disciplinary retrieval. Ideally, practices), the more prescriptive are the author guidelines information about how others download or use the data would and the more specialized the review and QA processes. With be fed back to the researcher. In addition, services such the rise of open or post-publication peer review, some data as ORCID.34 are being integrated to allow researchers to journals are also inviting the wider community to participate connect their materials across platforms. This gives more vis- in the publication process. ibility to the data through the different registries and allows The broader research community and some discipline- for global author disambiguation. The latter is particularly based communities are currently developing criteria and practices for standardized release of research data. The 31See e.g. Open Health Data journal http://openhealthdata.metajnl. services supporting these efforts, whether repositories or com/. 32 journals, also generally show signs of being works in progress Data Citation Synthesis Group, 2014. Accessed 17 November 2015: http://www.force11.org/group/joint-declaration-data-citation- principles-final. 35Funders have an interest in tracking Return on Investment to assess 33 See Sarah Callaghan’s blogpost: Cite what you use, 24 January which researchers/projects/fields are effective and whether the proposed 2014. Accessed 24 June 2015: http://citingbytes.blogspot.co.uk/2014/ new projects consist of new or repeated work. 01/cite-what-you-use.html. 36 Accessed 17 November 2015: http://www.ddialliance.org. 34 http://orcid.org/. 37 Accessed 17 November 2015: http://schema.datacite.org. 123 Key components of data publishing… or proof-of-concept exercises rather than finished products. • Bi-directional linking How do we link data and publica- This is reflected in our analysis dataset [22]. Depending tions persistently in an automated way? Several organi- partly on their state of progress during our review period zations, including RDA and WDS,40 are now working on (1 February–30 June 2015), and also on the specificity of the this problem. A related issue is the persistence of links subject area, some workflow entries were rather vague. themselves.41 • Software management Solutions are needed to manage, preserve, publish, and cite software. Basic workflows 5 Discussion exist (involving code sharing platforms, repositories, and aggregators), but much more work is needed to establish Although the results of our analysis show wide diversity in a wider framework, including community updating and data-publishing workflows, the key components were fairly initiatives involving linking to associated data . similar across providers. The common components were • Version control In general, we found that repositories grouped and charted in a reference model for data publish- handle version control in different ways, which is poten- ing. Given the rapid developments in this field and in light of tially confusing. While some version control solutions the disciplinary differences, diversity of workflows might be might be tailored to discipline-specific challenges, there expected to grow even further. Through the RDA Working is a need to standardize. This issue also applies to prove- Group we will seek further community review and endorse- nance information. ment of the generic reference model components and carry • Sharing restricted-use data Repositories and journals are out further analyses of such disciplinary variations. How- generally not yet equipped to handle confidential data. ever, the results of our study suggest that new solutions (e.g. It is important that the mechanism for data sharing be for underrepresented disciplines) could build on the identi- appropriate to the level of sensitivity of the data. The fied key components that best match their use case. Some time is ripe for the exchange of expertise in this area. evident gaps and challenges (described below) hinder global • Role clarity Data publishing relies on collaboration. interoperability and adoption of a common model. For better user guidance and greater confidence in the services, an improved understanding of roles, responsi- 5.1 Gaps and challenges bilities, and collaboration is needed. Documentation of ‘who does what’ in the current, mid and long term would Whilst our analysis extended across all the data-publishing ensure a smoother provision of service. entities we studied (repositories, journals, and projects), • Business models There is strong interest in establish- many of the most obvious gaps and challenges were observed ing the value and sustainability of repositories. Beagrie amongst the repository category. and Houghton42 produced a synthesis of data centre stud- While there are still many disciplines for which no specific ies combining quantitative and qualitative approaches to domain repositories exist, we are seeing a greater number quantify value in economic terms and present other, non- of repositories of different types (re3data.org indexes over economic, impacts and benefits. A recent Sloan-funded 1200 repositories). In addition to the disciplinary reposi- meeting of 22 data repositories led to a white paper tories, there are many new repositories designed to house on Sustaining Domain Repositories for Digital Data.43 broader collections, e.g. Zenodo, Figshare, Dryad, Dataverse, However, much more work is needed to understand viable and the institutional repositories at colleges and universi- financial models for publishing data44 and to distinguish ties. “Staging” repositories are also being established that trustworthy collaborations. extend traditional workflows into the collaborative working • Data citation support Although there appears to be space—e.g. Open Science Framework38 which has a pub- widespread awareness, there is only partial implemen- lishing workflow with Dataverse. Another example is the tation of the practices and procedures recommended by Sustainable Environment Actionable Data (SEAD)39 project, which provides project spaces in which scientists manage, 40 RDA/WDS Publishing Data Services WG: http://rd-alliance. find, and share data, and which also connects researchers to org/groups/rdawds-publishing-data-services-wg.html and http://www. icsu-wds.org/community/working-groups/data-publication/services. repositories that will provide long-term access and preserva- 41 See the hiberlink Project for information on this problem and work tion of data. being done to solve it: http://hiberlink.org/dissemination.html. Despite much recent data-publishing activity, our analysis 42http://blog.beagrie.com/2014/04/02/new-research-the-value-and- of the case studies found that challenges remain, in particular impact-of-data-curation-and-sharing/. when considering more complex workflows. These include: 43 http://datacommunity.icpsr.umich.edu/sites/default/files/WhitePap er_ICPSR_SDRDD_121113.pdf. 38 https://osf.io/. 44 RDA/WDS Publishing Data Costs IG addresses this topic: http://rd- 39 http://sead-data.net/. alliance.org/groups/rdawds-publishing-data-ig.html. 123 C.C. Austin et al. the Data Citation Implementation Group. There is a wide Data Seal of Approval,48 Nestor, ISO 16363 and World Data range of PIDs emerging, including ORCID, DOI, Fun- System) require that repositories document their processes, dRef, RRID, IGSN, ARK, and many more. Clarity and so this may change in the future, but we would add our rec- ease of use need to be brought to this landscape.45 ommendation that repositories publish their workflows in a • Metrics Creators of data and their institutions and fun- standard way for greater transparency. This would bolster ders need to know how, and how often, their data are confidence in repositories and also increase user engagement. being reused. The diversity we found is not surprising, nor is it neces- • Incentives Data publishing offers potential incentives to sarily undesirable. Case studies and ethnographies of data researchers, e.g. a citable data product, persistent data practices have found that workflows for dealing with data documentation, and information about the impact of the ‘upstream’ of repositories are highly diverse. Data sharing research. Also, many repositories offer support for data practices vary considerably at the sub-disciplinary level in submission. Benefits of data publishing need to be better many cases (e.g. Cragin et al. [30]), so there is likely to be communicated to researchers. In addition, stakeholders continued need to support diverse approaches and informed should disseminate the fact that formal data archiving choice rather than unified or monolithic models (Pryor [31]). results in greater numbers of papers and thus more sci- Our analysis shows that a variety of workflows has evolved, ence, as Piwowar and Vision, and Pienta et al. [4,5] and more are emerging, so researchers may be able to choose have shown. There should also be increased clarity with their best fit on the basis of guidance that distinguishes rele- respect to institutional and funder recognition of the vant features, such as QA/QC and different service or support impact of research data. levels. The challenges of more complex data—in particular, big 5.2 Best practice recommendations and conclusions data and dynamic data—need also to be addressed. Whereas processes from the past 10 years focus on irrevocable, fully Based on selected case studies, key components in data pub- documented data for unrestricted (research) use, data pub- lishing have been identified, leading to a reference model in lishing needs to be ‘future proof’ (Brase et al. [29]). There data publishing. The analysis, and in particular the conversa- is a requirement from research communities46 to cite data tions with the key stakeholders involved in data-publishing before it has reached an overall irrevocable state and before workflows, highlighted best practices which might be help- it has been archived. This particularly holds true for com- ful as recommendations for organizations establishing new munities with high volume data (e.g. high-energy physics; workflows and to those seeking to transform or standardize climate sciences), and for data citation entities including mul- existing procedures: tiple individual datasets for which the time needed to reach an overall stable data collection is long. Even though our case study analysis found that data citation workflows are imple- • Start small and build components one by one in a modular mented or considered by many stakeholder groups involved way with a good understanding of how each build- in data publishing, dynamic data citation challenges have not ing block fits into the overall workflow and what the been widely addressed. Version control and keeping a good final objective is. These building blocks should be open provenance record47 of datasets are also critical for citation source/shareable components. of such data collections and are indispensable parts of the • Follow standards whenever available to facilitate inter- data-publishing workflow. operability and to permit extensions based on the work With respect to gaps and challenges, we recognize that the of others using the same standards. For example, Dublin case studies we analyzed are limited in scope. This relates to Core is a widely used metadata standard, making it rel- an overall challenge we encountered during the project: it is atively easy to share metadata with other systems. Use difficult to find clear and consistent human-readable work- disciplinary standards where/when applicable. flow representations for repositories. The trust standards (e.g. • It is especially important to implement and adhere to stan- dards for data citation, including the use of persistent 45 http://project-thor.eu/. identifiers (PIDs). Linkages between data and publica- 46 For example, in genomics, there is the idea of numbered “releases” tions can be automatically harvested if DOIs for data are of, for example, a particular animal genome, so that while refinement used routinely in papers. The use of researcher PIDs such is ongoing it is also possible to refer to a reference dataset. as ORCID can also establish connections between data 47 For scientific communities with high volume data, the storage of and papers or other research entities such as software. every dataset version is often too expensive. Versioning and keeping a good provenance record of the datasets are crucial for citations of such data collections. Technical solutions are being developed, e.g. by the European Persistent Identifier Consortium (EPIC). 48 http://datasealofapproval.org. 123 Key components of data publishing… The use of PIDs can also enable linked open data func- to help establish collaborations with potential partners and to tionality.49 guide researchers, enabling and encouraging the deposit of • Document roles, workflows and services. A key difficulty reusable research data that will be persistent while preserving we had in conducting the analysis of the workflows was provenance. the lack of complete, standardized and up-to-date infor- mation about the processes and services provided by the platforms themselves. This impacts potential users of the References services as well. Part of the trusted repository reputation development should include a system to clarify ingest 1. Schmidt, B., Gemeinholzer, B., Treloar, A.: Open Data support levels, long-term sustainability guarantees, sub- in Global Environmental Research: The Belmont Forum’s Open Data Survey (2015). http://docs.google.com/document/d/ ject expertize resource, and so forth. 1jRM5ZlJ9o4KWIP1GaW3vOzVkXjIIBYONFcd985qTeXE/ed 2. Vines, T.H., Albert, A.Y.K., Andrew, R.L., DeBarre, F., Bock, D.G., In summary, following the idea of the presented reference Franklin, M.T., Gilbert, K.J., Moore, J.S., Renaut, S., Rennison, model and the best practices, we would like to see a work- D.J.: The availability of research data declines rapidly with article age. Curr. Biol. 24(1), 94–97 (2014) flow that results in all scholarly objects being connected, 3. Hicks, D., Wouters, P., Waltman, L., De Rijcke, S., Rafols, I.: linked, citable, and persistent to allow researchers to nav- Bibliometrics: The Leiden Manifesto for research metrics. Nature igate smoothly and to enable reproducible research. This 520, 429–431 (2015). http://www.nature.com/news/bibliometrics- includes linkages between documentation, code, data, and the-leiden-manifesto-for-research-metrics-1.17351. Accessed 10 November 2015 journal articles in an integrated environment. Furthermore, 4. Piwowar, H., Vision, T.: Data reuse and the open data citation in the ideal workflow, all of these objects need to be well advantage. PeerJ Comput. Sci. (2013). http://peerj.com/articles/ documented to enable other researchers (or citizen, scien- 175/. Accessed 10 November 2015 tists, etc.) to reuse the data for new discoveries. We would 5. Pienta, A.M., Alter, G.C., Lyle, J.A.: The enduring value of social science research: the use and reuse of primary research data (2010). like to see information standardized and exposed via APIs http://hdl.handle.net/2027.42/78307. Accessed 10 November 2015 and other mechanisms so that metrics on data usage can 6. Borgman, C.L.: Big data, little data, no data: scholarship in the be captured. We note, however, that biases in funding and networked world. MIT Press, Cambridge (2015) academic reward systems need value data-driven secondary 7. Wallis, J.C., Rolando, E., Borgman, C.L.: If we share data, will anyone use them? Data sharing and reuse in the long tail of sci- analysis and reuse of existing data, as well as data pub- ence and technology. PLoS One 8(7), e67332 (2013). doi:10.1371/ lishing as a first class object. More attention (i.e. more journal.pone.0067332 perceived value) from funders will be key to changing this 8. Peng, R.D.: Reproducible research in computational science. Sci- paradigm. ence 334(6060), 1226–1227 (2011) 9. Thayer, K.A., Wolfe, M.S., Rooney, A.A., Boyles, A.L., Bucher, One big challenge is that there is a need to collaborate J.R., Birnbaum, L.S.: Intersection of systematic review method- more intensively among the stakeholder groups. For exam- ology with the NIH reproducibility initiative. Environ. Health ple, repositories and higher education institutions (holding Perspect. 122, A176–A177 (2014). http://ehp.niehs.nih.gov/wp- a critical mass of research data) and the large journal content/uploads/122/7/ehp.1408671.pdf. Accessed 10 November 2015 publishers (hosting the critical mass of discoverable, pub- 10. George, B.J., Sobus, J.R., Phelps, L.P., Rashleigh, B., Simmons, lished research) have not yet fully engaged with each other. J.E., Hines, R.N.: Raising the bar for reproducible science at Although new journal formats are being developed that link the US Environmental Protection Agency Office of Research and data to papers and enrich the reading experience, progress is Development. Toxicol. Sci. 145(1), 16–22 (2015). http://toxsci. oxfordjournals.org/content/145/1/16.full.pdf+html still being impeded by cultural, technical, and business model 11. Boulton, G., et al.: Science as an open enterprise. R. Soc. issues. Lond. (2012). https://royalsociety.org/policy/projects/science- We have demonstrated that the different components of public-enterprise/Report/. Accessed 10 November 2015 a data-publishing system need to work, where possible, in 12. Stodden, V., Bailey, D.H., Borwein, J., LeVeque, R.J., Rider, W., Stein, W.: Setting the default to reproducible. Repro- a seamless fashion and in an integrated environment. We ducibility in computational and experimental mathematics. therefore advocate the implementation of standards, and the Institute for Computational and Experimental Research in development of new standards where necessary, for repos- Mathematics (2013). http://icerm.brown.edu/tw12-5-rcem/icerm_ itories and all parts of the data-publishing process. Data report.pdf. Workshop report accessed 10 November 2015 13. Whyte, A., Tedds, J.: Making the case for research data manage- publishing should be embedded in documented workflows, ment. DCC briefing papers. Digital Curation Centre, Edinburgh (2011). http://www.dcc.ac.uk/resources/briefing-papers/making- 49 At the time of writing, CrossRef had recently announced the con- case-rdm. Accessed 10 November 2015 cept and approximate launch date for a ‘DOI Event Tracker’, which 14. Parsons, M., Fox, P.: Is data publication the right metaphor? Data could also have considerable implications for the perceived value of Sci. J. 12 (2013). doi:10.2481/dsj.WDS-042. Accessed 10 Novem- data publishing as well as for the issues around the associated metrics ber 2015 (Reference: http://crosstech.crossref.org/2015/03/crossrefs-doi-event- 15. Rauber, A., Pröll, S.: Scalable dynamic data citation approaches, tracker-pilot.html by Geoffrey Bilder, accessed 26 October 2015). reference architectures and applications RDA WG Data Citation 123 C.C. Austin et al. position paper. Draft version (2015). http://rd-alliance.org/groups/ 24. Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R.R., data-citation-wg/wiki/scalable-dynamic-data-citation-rda-wg-dc Duerr, R., Haak, L.L., Haendel, M., Herman, I., Hodson, S., Hour- -position-paper.html. Accessed 13 November 2015 clé, J., Kratz, J.E., Lin, J., Nielsen, L.H., Nurnberger, A., Proell, 16. Rauber, A., Asmi, A., van Uytvanck, D., Pröll, S.: Data cita- S., Rauber, A., Sacchi, S., Smith, A., Taylor, M., Clark, T.: Achiev- tion of evolving data: recommendations of the Working Group on ing human and machine accessibility of cited data in scholarly Data Citation (WGDC) Draft—request for comments (2015). Revi- publications. PeerJ Comput. Sci. 1(e1) (2015). doi:10.7717/peerj- sion of 24th September 2015. http://rd-alliance.org/system/files/ cs.1 documents/RDA-DC-Recommendations_150924.pdf. Accessed 25. Castro, E., Garnett, A.: Building a bridge between journal articles 6 November 2015 and research data: The PKP-Dataverse Integration Project. Int. J. 17. Watson, et al.: The XMM-Newton serendipitous survey. V. The Digital Curation 9(1), 176–184 (2014). doi:10.2218/ijdc.v9i1.311 Second XMM-Newton serendipitous source catalogue. Astron. 26. Mayernik, M.S., Callaghan, S., Leigh, R., Tedds, J.A., Worley, S.: Astrophys. 493(1), 339–373 (2009). doi:10.1051/0004-6361: Peer review of datasets: when, why, and how. Bull. Am. Meteorol. 200810534 Soc. 96(2), 191–201 (2015). doi:10.1175/BAMS-D-13-00083.1 18. Lawrence, B., Jones, C., Matthews, B., Pepler, S., Callaghan, S.: 27. Meehl, G.A., Moss, R., Taylor, K.E., Eyring, V., Stouffer, R.J., Citation and peer review of data: moving toward formal data pub- Bony, S., Stevens, B.: Climate Model Intercomparisons: preparing lication. Int. J. Digital Curation (2011). doi:10.2218/ijdc.v6i2.20r for the next phase. Eos Trans. AGU 95(9), 77 (2014). doi:10.1002/ 19. Callaghan, S., Murphy, F., Tedds, J., Allan, R., Kunze, J., Lawrence, 2014EO090001 R., Mayernik, M.S., Whyte , A.: Processes and procedures for data 28. Bandrowski, A., Brush, M., Grethe, J.S., Haendel, M.A., Kennedy, publication: a case study in the geosciences. Int. J. Digital Curation D.N., Hill, S., Hof, P.R., Martone, M.E., Pols, M., Tan, S., Wash- 8(1) (2013). doi:10.2218/ijdc.v8i1.253 ington, N., Zudilova-Seinstra, E., Vasilevsky, N.: The Resource 20. Austin, C.C., Brown, S., Fong, N., Humphrey, C., Leahey, L., Web- Identification Initiative: a cultural shift in publishing [version 1; ster, P.: Research data repositories: review of current features, gap referees: 2 approved] F1000Research4, 134 (2015). doi:10.12688/ analysis, and recommendations for minimum requirements. Pre- f1000research.6555.1 sented at the IASSIST Annual Conference. IASSIST Quarterly 29. Brase, J., Lautenschlager, M., Sens, I.: The Tenth Anniversary of Preprint. International Association for Social Science, Information Assigning DOI Names to Scientific Data and a Five Year History of Services, and Technology. Minneapolis (2015). http://drive.google. DataCite. D-Lib Mag. 21(1/2) (2015). doi:10.1045/january2015- com/file/d/0B_SRWahCB9rpRF96RkhsUnh1a00/view. Accessed brase 13 November 2015 30. Cragin, M.H., Palmer, C.L., Carlson, J.R., Witt, M.: Data sharing, 21. Yin, R.: Case study research: design and methods, 5th edn. Sage small science and institutional repositories. Philos. Trans. R. Soc. Publications, Thousand Oaks (2003) A 368(1926), 4023–4038 (2010) 22. Murphy, F., Bloom, T., Dallmeier-Tiessen, S., Austin, C.C., Whyte, 31. Pryor, G.: Multi-scale data sharing in the life sciences: Some A., Tedds, J., Nurnberger, A., Raymond, L., Stockhause, M., lessons for policy makers. Int. J. Digital Curation 4(3), 71–82 Vardigan, M.: WDS-RDA-F11 Publishing Data Workflows WG (2009). doi:10.2218/ijdc.v4i3.115 Synthesis FINAL CORRECTED. Zenodo. 2015 (2015). doi:10. 5281/zenodo.33899. Accessed 17 November 2015 23. Stockhause, M., Höck, H., Toussaint, F., Lautenschlager, M.: Qual- ity assessment concept of the World Data Center for Climate and its application to the CMIP5 data. Geosci. Model Dev. 5(4), 1023– 1032 (2012). doi:10.5194/gmd-5-1023-2012 123

(PDF) Key components of data publishing: Using current best practices to develop a reference model for data publishing