Data on the Web Best Practices Use Cases

Data on the Web Best Practices Use Cases & Requirements
Data on the Web Best Practices Use Cases & Requirements
W3C
Working Group Note
24 February 2015
This version:
Latest published version:
Latest editor's draft:
Previous version:
Editors:
Deirdre Lee
Derilinx
(formerly at Insight@NUIG, Ireland)
Bernadette Farias Lóscio
Centro de Informática - Universidade Federal de Pernambuco, Brazil
Phil Archer
W3C
ERCIM
W3C
MIT
ERCIM
Keio
Beihang
). W3C
liability
trademark
and
document use
rules apply.
Abstract
This document lists use cases, compiled by the Data on the
Web Best Practices Working Group, that represent scenarios of how data
is commonly published on the Web and how it is used. This document also
provides a set of requirements derived from these use cases that will be
used to guide the development of the set of Data on the Web Best
Practices and the development of two new vocabularies: Quality and
Granularity Description Vocabulary and Data Usage Description
Vocabulary.
Status of This Document
This section describes the status of this document at the time of its publication.
Other documents may supersede this document. A list of current
W3C
publications and the
latest revision of this technical report can be found in the
W3C
technical reports index
at
This document is considered to be approaching its final version. Only one
or, at most, two further iterations are expected during the life time of the working group.
This document was published by the
Data on the Web Best Practices Working Group
as a Working Group Note.
If you wish to make comments regarding this document, please send them to
public-dwbp-comments@w3.org
archives
).
All comments are welcome.
Publication as a Working Group Note does not imply endorsement by the
W3C
Membership. This is a draft document and may be updated, replaced or obsoleted by other
documents at any time. It is inappropriate to cite this document as other than work in
progress.
This document was produced by a group operating under the
5 February 2004
W3C
Patent
Policy
W3C
maintains a
public list of any patent
disclosures
made in connection with the deliverables of the group; that page also includes
instructions for disclosing a patent. An individual who has actual knowledge of a patent
which the individual believes contains
Essential
Claim(s)
must disclose the information in accordance with
section
6 of the
W3C
Patent Policy
This document is governed by the
1 August 2014
W3C
Process Document
Table of Contents
1.
Introduction
2.
Use Cases
2.1
ASO: Airborne Snow
Observatory
2.2
BBC
2.3
Bio2RDF
2.4
BuildingEye:
SME use of public data
2.5
Dados.gov.br
2.6
Digital
archiving of Linked Data
2.7
Dutch Base
Registers
2.8
GS1 Digital
2.9
ISO GEO Story
2.10
The Land Portal
2.11
LA Times' Reporting
of Ron Galperin's Infographic
2.12
LusTRE: Linked
Thesaurus fRamework for Environment
2.13
Machine-readability
and Interoperability of Licenses
2.14
Mass Spectrometry
Imaging (MSI)
2.15
OKFN
Transport WG
2.16
Open
City Data Pipeline
2.17
Open Experimental
Field Studies
2.18
Resource Discovery
for Extreme Scale Collaboration (RDESC)
2.19
Recife Open
Data Portal
2.20
Retrato
da Violência (Violence Map)
2.21
Share-PSI 2.0:
Uses of Open Data Within Government for Innovation and Efficiency
2.22
Tabulae - how to
get value out of data
2.23
UK Open
Research Data Forum
2.24
Uruguay
Open Data Catalog
2.25
Web
Observatory
2.26
Wind
Characterization Scientific Study
3.
General Challenges
3.1
A Word on Open and
Closed Data
3.2
Requirements by
Challenge
4.
Requirements
4.1
Requirements for Data on the
Web Best Practices
4.2
Requirements for Quality and
Granularity Description Vocabulary
4.3
Requirements for Data Usage
Description Vocabulary
5.
Reading Material
5.1
General
Resources
5.2
Relevant
Vocabularies
5.3
Communities
of Interest
A.
Acknowledgements
B.
Change history
1.
Introduction
This section is non-normative.
There is a growing interest in publishing and consuming data on the
Web. Both government and non-government organizations already make a
variety of data available on the Web, some openly, some with access
restrictions, covering many domains like education, the economy,
security, cultural heritage, eCommerce and scientific data. Developers,
journalists and others manipulate this data to create visualizations and
to perform data analysis. Experience in this field shows that several
important issues need to be addressed in order to meet the requirements
of both data publishers and data consumers.
To address these issues, the Data on the Web Best Practices Working
Group seeks to provide guidance to all stakeholders that will improve
consistency in the way data is published, managed, referenced and used
on the Web. The guidance will take two forms: a set of best practices
that apply to multiple technologies, and vocabularies that are currently
missing but that are needed to support the data ecosystem on the Web.
In order to determine the scope of the best practices and the
requirements for the new vocabularies, a set of use cases has been
compiled. Each use case provides a narrative describing an experience of
publishing and using Data on the Web. The use cases cover different
domains and illustrate some of the main challenges faced by data
publishers and data consumers. A set of requirements, used to guide the
development of the set of best practices as well as the development of
the vocabularies, have been derived from the compiled use cases.
Interpretations of each use case could lead to an unmanageably large
number of requirements and so before including them, each potential
requirement has been assessed against three specific criteria:
Is the requirement specifically relevant to data published on the
Web?
Does the requirement encourage reuse or publication of data on the
Web?
Is the requirement testable?
Only requirements meeting those three criteria have been included.
2.
Use Cases
A use case illustrates an experience of publishing and using Data on
the Web. The information gathered from the use cases should be helpful
for the identification of the best practices that will guide the
publishing and usage of Data on the Web. In general, a use case will be
described at least by a statement and a discussion of how the use case
is currently implemented. Use case descriptions demonstrate some of the
main challenges faced by publishers or developers. Information about
challenges will be helpful to identify areas where Best Practices are
necessary. According to the challenges, a set of requirements are
abstracted in such a way that a requirement motivates the creation of
one or more best practices.
2.1
ASO: Airborne Snow
Observatory
(Contributed by Lewis John McGibbney, NASA Jet
Propulsion Laboratory/California Institute of Technology)
URL:
The two most critical properties for understanding snowmelt runoff
and timing are the spatial and temporal distributions of snow water
equivalent (SWE) and snow albedo. Despite their importance in
controlling volume and timing of runoff, snowpack albedo and SWE are
still largely unquantified in the US and not at all in most of the
globe, leaving runoff models poorly constrained. NASA/JPL, in
partnership with the California Department of Water Resources, has
developed the Airborne Snow Observatory (ASO), an imaging spectrometer
and scanning Lidar system, to quantify SWE and snow albedo, generate
unprecedented knowledge of snow properties for cutting edge
cryospheric science, and provide complete, robust inputs to water
management models and systems of the future.
Elements:
Domains:
Digital Earth Modeling, Digital Surface
Modeling, Spatial Distribution Measurement, Snow Depth, Snow Water
Equivalent, Snow Albedo.
Obligation/motivation:
Funding provided by NASA
Terrestrial Hydrology, NASA Applied Sciences, and California
Department of Water Resources.
Usage:
Example data usage include < 24hrs
turnaround of flight data which is passed on to numerous Water
Resource Managers aiding in water conservation usage, policy and
decision making processes. Accurate and weekly spatially distributed
SWE has never been produced before, and is highly informative to
reservoir managers who must make tradeoffs between storing water for
summer water supply versus using water before snowmelt recedes for
generation of clean hydropower. Accurate SWE information, when
coupled with runoff forecasting models, can also have ecological
benefits through avoidance of late-spring high flows released from
reservoirs that are not part of the natural seasonal variability.
Quality:
Available in a number of scientific
formats to customers and stakeholders based on customer
requirements.
Lineage:
All ASO data stems directly from
on-board imaging spectrometer and scanning Lidar system instruments.
Size:
Many many TB in size. Raw data acquisition
is dependent on the basin/survey size. Recent individual flights
generate in the order of ~500GB which include imaging spectrometer
and Lidar data. This does however shrink considerably if we just
consider the data that we would distribute.
Type/format:
Digital Elevation Model / binary
image (not public atm), Lidar (Raw Point Clouds)/ las (not public
atm), Raster Zonal Stats / text (not public atm), Snow Water
Equivalent / tiff, Snow Albedo / tiff
Rate of change:
Recent weekly flights have
provided information on a scale and timing that has never occurred
before. Distributed SWE increases after storms, and decreases during
melt events in patterns that have never before been measured and
will be studied by snow hydrologists for years to come. Once data is
captured it is not updated, however subsequent data is generated
from the original data within processing pipelines which as
screening for data quality control and assurance.
Data lifespan:
For immediate operational
purposes, the last flight's data become obsolete when a new flight
is made. However, the annual sequence of data sets will be leveraged
by snow hydrologists and runoff forecasters during the next decade
as they are used to improve models and understanding of the spatial
nature of the mountain snowpack.
Potential audience:
(snow) hydrologists,
hydrologic modelers, runoff forecasters, and reservoir operators and
reservoir managers.
Positive aspects:
This use case provides insight into what a NASA funded demonstration
mission looks like (from a data provenance, archival point of view).
It is an excellent opportunity to delve into an earth science mission
which is actively addressing the global problem of water resource
management. Recently senior officials have declared a statewide (CA)
drought emergency and are asking all Californians to reduce their
water use by 20 percent. California, and other U.S. states are
experiencing a serious drought and the state will be challenged to
meet its water needs in the upcoming year. Calendar year 2013 was the
driest year in recorded history for many areas of California, and
current conditions suggest no change is in sight for 2014. ASO is at
the front line of cutting edge scientific research meaning that the
data that backs the mission, as well as the practices adopted within
the project execution, are extremely important to addressing this
issue.
Project collaborators and stakeholders are sent data and information
when it is produced and curated. For some stakeholders, the data (in
an operational sense) they require is very small in size and in such
cases ASO emphasizes speed. It's more like a sharing of information
than delivering a product for the short-term turnaround of
information.
Negative aspects:
Demonstration missions of this caliber also have downsides. With
regards to data best practices, more work is required in the following
areas:
Documentation of processes including data acquisition, provenance
tracking, curation of data products such as bare earth digital earth
models (DEM), full surface digital surface models (DSM), snow
products, snow water equivalents (SWE), etc.
Currently data is not searchable, this makes retrieval of specific
data difficult when data volumes grow to this size and nature
There is no publicly available guidance regarding suggested tools
which can be used to interact with the data sources.
Quick turnarounds of operational data may be compromised when ASO
moves beyond a demonstration mission and picks up new customers etc.
This will most likely be attributed to the time associations for the
generation and distribution of science grade products.
Challenges:
Data volumes are large, and will grow by year on year. The volume
of generated data grew by 50% between 2013 and 2014.
On many occasions we require a very quick turn around on
inferences which can be made from the data. This sometimes (but not
always) comes at the cost of reducing the emphasis of best practices
for the generation, storage and archival of projects data.
The data takes the form of science oriented representational
formats. Such formats are non-typical of the typical data many
people publish on the Web. A lot of thought needs to be put in to
how this data can be better accessed.
Requires:
R-AccessUpToDate
R-Citable
R-DataIrreproducibility
R-DataMissingIncomplete
R-FormatMachineRead
R-GeographicalContext
R-GranularityLevels
R-LicenseLiability
R-MetadataAvailable
R-ProvAvailable
R-QualityCompleteness
R-QualityMetrics
R-TrackDataUsage
R-UsageFeedback
and
R-VocabDocum
2.2
BBC
Contributors:
Ghislain
Atemezing (EURECOM)
URL:
Overview:
the BBC provides a
list
of the ontologies
they implement and use for their Linked Data
platform. The site provides access to the ontologies the BBC is using
to support its audience using their applications, such as
BBC
Sport
or
BBC Education
Each ontology has a short description with metadata information, an
introduction, sample data, an ontology diagram and the terms used in
the ontology. The metadata includes 6 fields that are generally
filled: mailto authors, created data, version (current version
number), prior version (decimal), license (a link to the license) and
a link for downloading the
RDF
version. For example, see the
description of the “
Core
concepts ontology
.” However, this metadata that is available in
the HTML page is NOT present in a machine-readable format, i.e. in the
ontology itself.
Versioning:
each ontology uses a decimal notation
for the version and the URL for accessing each version file of the
ontology is constructed as {BASE-URI}/{ONTO-PREFIX}/{VERSION}.ttl;
where {BASE-URI} is
For example: the file of version 1.9 of the “core concepts” ontology
is located at
However, between different versions, the URI of the ontology used is
the same and is of the form : {BASE-URI}/{ONTO-PREFIX}/.
Elements:
Domains:
vocabulary catalog, versioning, metadata
Obligation/motivation:
Provide a unique point of
vocabularies built within BBC
Usage:
The site provides access to the ontologies
the BBC is using to support its audience using their applications,
Quality:
High level and domain vocabularies
adapted to BBC applications.
Size:
currently, there are 12 ontologies of
different sizes, from 40 triples to 750 triples.
Type/format:
RDF
/TURTLE, and html pages
describing each ontology
Rate of change:
Depends on the vocabulary, may
depends on the different versions; although there is not such
metadata information
Data lifespan:
n/a
Potential audience:
BBC applications and any user
interested in the domains of the vocabularies (publishers,
researchers or developers)
Challenges
It could be nice and consistent to add systematically the metadata
provided in the html pages describing each BBC ontology in the
RDF
vocabulary.
How to dereference from a unique URI, different versions of the
ontology in different flavor of
RDF
(XML, TURTLE, etc.)
Need to add the modified date along with the version of each
ontology.
Requires
R-MetadataDocum
R-MetadataMachineRead
R-FormatMultiple
R-MetadataStandardized
and
R-VocabVersion
2.3
Bio2RDF
(Contributed by Carlos Laufer)
URL:
Bio2RDF
is an open source project that uses Semantic Web technologies to make
possible the distributed querying of integrated life sciences data.
Since its inception
, Bio2RDF has made
use of the Resource Description Framework (
RDF
) and the
RDF
Schema
(RDFS) to unify the representation of data obtained from diverse
fields (molecules, enzymes, pathways, diseases, etc.) and
heterogeneously formatted biological data (e.g. flat-files,
tab-delimited files, SQL, dataset specific formats, XML etc.). Once
converted to
RDF
, this biological data can be queried using the SPARQL
Protocol and
RDF
Query Language (SPARQL), which can be used to
federate queries across multiple SPARQL endpoints.
Elements:
Domains:
Biological data
Obligation/motivation:
Biological researchers are
often confronted with the inevitable and unenviable task of having
to integrate their experimental results with those of others. This
task usually involves a tedious manual search and assimilation of
often isolated and diverse collections of life sciences data hosted
by multiple independent providers including organizations such as
the National Center for Bio-technology Information (
NCBI
and the European Bioinformatics Institute (
EBI
that provide dozens of user-submitted and curated datasets, as well
as smaller institutions such as the Donaldson group that publishes
iRefIndex
a database of molecular interactions aggregated from 13 data
sources. While these mostly isolated silos of biological information
occasionally provide links between their records (e.g.
UniProt
links its entries to hundreds of other datasets), they are typically
serialized in either HTML elements or in flat file data dumps that
lack the semantic richness required to serialize the intent of the
linkage between data records. With thousands of biological databases
and hundreds of thousands of datasets, the ability to find relevant
data is hampered by non-standard database interfaces and an enormous
number of haphazard data formats
Moreover, metadata about these biological data providers (dataset
source data information, dataset versioning, licensing information,
date of creation, etc.) is often difficult to obtain. Taken
together, the inability to easily navigate through available data
presents an overwhelming barrier to their reuse.
Usage:
Biological research
Quality:
Bio2RDF scripts generate provenance records
using the
W3C
Vocabulary of Interlinked Datasets (
VoID
),
the Provenance vocabulary (
PROV
and
Dublin Core
vocabulary.
Each data item is linked to a provenance object that indicates the
source of the data, the time at which the
RDF
was generated,
licensing (if available from the data source provider), the SPARQL
endpoint in which the resource can be found, and the downloadable
RDF
file where the data item is located. Each dataset provenance
object has a unique IRI and label based on the dataset name and
creation date. The date-specific dataset IRI is linked to a unique
dataset IRI using the PROV predicate
wasDerivedFrom
such that one can query the dataset SPARQL endpoint to retrieve all
provenance records for datasets created on different dates. Each
resource in the dataset is linked the date-unique dataset IRI that
is part of the provenance record using the VoID
inDataset
predicate. Other important features of the provenance record include
the use of the Dublin Core
creator
term to link a
dataset to the script on Github that was used to generate it, the
VoID predicate
sparqlEndpoint
to point to the dataset
SPARQL endpoint, and VoID predicate
dataDump
to point
to the data download URL.
Dataset metrics
total number of triples
number of unique subjects
number of unique predicates
number of unique objects
number of unique types
unique predicate-object links and their frequencies
unique predicate-literal links and their frequencies
unique subject type-predicate-object type links and their
frequencies
unique subject type-predicate-literal links and their
frequencies
total number of references to a namespace
total number of inter-namespace references
total number of inter-namespace-predicate references
Size:
At the time of writing, thirty five datasets
have been generated as part of the
Bio2RDF
3 release
. Several of the datasets are themselves collections
of datasets that are now available as one resource. Each dataset has
been loaded into a dataset-specific SPARQL endpoint using Openlink
Virtuoso. All updated Bio2RDF linked data and their corresponding
Virtuoso DB files are available for
Type/format:
RDF
Rate of change: depends on data source
Data lifespan: depends on data source
Potential audience: Biological researchers
References:
Callahan A, Cruz-Toledo J, Ansell P, Klassen D,
Tumarello G, Dumontier M:
Improved
dataset coverage and interoperability with Bio2RDF Release 2
(PDF). SWAT4LS 2012, Proceedings of the 5th International Workshop
on Semantic Web Applications and Tools for Life Sciences, Paris,
France, November 28-30, 2012.
Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette
J: Bio2RDF: towards a mashup to build bioinformatics knowledge
systems. J Biomed Inform 2008, 41(5):706-716.
Razick S, Magklaras G, Donaldson IM: iRefIndex: a
consolidated protein interaction database with provenance. BMC
Bioinformatics 2008, 9:405.
Goble C, Stevens R: State of the nation in data
integration for bioinformatics. J Biomed Inform 2008, 41(5):687-693.
Challenges:
Lack of human-readable metadata.
Data variability (models, sources, etc.).
RDFizations of Datasets.
Wide variety of formats and technologies.
Potential Requirements:
Dataset versioning and updating mechanisms
Standardization of schemas
Integration with other platforms/services
Data Persistence
Requires:
R-AccessLevel
R-AccessUpToDate
R-DataLifecyclePrivacy
R-FormatMultiple
R-FormatStandardized
R-PersistentIdentification
and
R-VocabReference
2.4
BuildingEye:
SME use of public data
(Contributed by Deirdre Lee)
URL:
Buildingeye.com makes building and planning information easier to
find and understand by mapping what's happening in your city. In
Ireland local authorities handle planning applications and usually
provide some customized views of the data (PDFs, maps, etc.) on their
own Web site. However there isn't an easy way to get a nationwide view
of the data. BuildingEye, an independent SME, built
to achieve this. However as each local authority didn't have an Open
Data portal, BuildingEye had to directly ask each local authority for
its data. It was granted access to some authorities, but not all. The
data it did receive was in different formats and of varying
quality/detail. BuildingEye harmonized this data for its own system.
However, if another SME wanted to use this data, they would have to go
through the same process and again go to each local authority asking
for the data.
Elements:
Domains:
Planning data
Obligation/motivation:
demand from SME
Usage:
Commercial usage
Quality:
standardized, interoperable across local
authorities
Size:
medium
Type/format:
structured according to legacy system
schema
Rate of change:
daily
Potential audience:
Business, citizens
Governance:
local authorities
Challenges:
Access to data is currently a manual process, on a case by case
basis
Data is provided in different formats, e.g. database dumps,
spreadsheets
Data is structured differently, depending on the legacy system
schema, concepts and terms not interoperable
No official Open license associated with the data
Data is not available for further reuse by other parties
Potential Requirements:
Creation of top-down policy on open data to ensure common
understanding and approach
Top-down guidance on recommended Open license usage
Standardized, non-proprietary formats
Availability of recommended domain-specific vocabularies.
Requires:
R-AccessBulk
R-AccessRealTime
R-DataLifecyclePrivacy
R-DataMissingIncomplete
R-DataProductionContext
R-AccessLevel
R-FormatMachineRead
R-FormatOpen
R-FormatStandardized
R-GeographicalContext
R-LicenseAvailable
R-MetadataAvailable
R-MetadataDocum
R-QualityCompleteness
R-QualityComparable
R-SensitivePrivacy
R-SensitiveSecurity
and
R-VocabDocum
2.5
Dados.gov.br
(Contributed by Yasodara)
URL:
Dados.gov.br is the open data portal of Brazil's Federal Government.
The site was built by a community network pulled together by three
technicians from the Ministry of Planning. They managed the group from
INDA
or
"National Infrastructure for Open Data." CKAN was chosen because it is
free software and presents independent solutions for the placement of
a data catalog of the Federal Government provided on the internet.
Elements:
Domains:
federal budget, addresses, Infrastructure
information, e-gov tools usage, social data, geographic information,
political information, Transport information.
Obligation/motivation:
Data that must be provided to
the public under a legal obligation, the called LAI or Brazilian
Information Access Act, edited in 2012.
Usage:
Data that is the basis for services to the
public; Data that has commercial reuse potential.
Quality:
Authoritative, clean data, vetted and
guaranteed.
Lineage/Derivation:
Data came from various
publishers. As a catalog, the site has faced several challenges, one
of them was to integrate the various technologies and formulas used
by publishers to provide datasets in the portal.
Type/format:
Tabular data, text data.
Rate of change:
There is fixed data and data with
high rate of change.
Challenges:
Data integration (lack of vocabularies).
Collaborative construction of the portal: managing online sprints
and balancing public expectatives.
Licensing the data of the portal. Most of data that is in the
portal does not have a special licence so there are types of license
applied to different datasets.
Requires:
R-AccessLevel
R-DataLifecyclePrivacy
R-DataLifecycleStage
R-DataMissingIncomplete
R-FormatStandardized
R-LicenseAvailable
R-MetadataAvailable
R-GeographicalContext
R-MetadataDocum
R-ProvAvailable
R-QualityOpinions
R-UsageFeedback
R-VocabReference
and
R-VocabVersion
2.6
Digital
archiving of Linked Data
(Contributed by Christophe Guéret)
URL:
Digital archives, such as
DANS
in the Netherlands,
have so far been concerned with the preservation of what could be
defined as "frozen" datasets. A frozen dataset is a finished,
self-contained set of data that does not evolve after it has been
constituted. The goal of the preserving institution is to ensure this
dataset remains available and readable for as many years as possible.
This can for example concern an audio recording, a digitized image,
e-books or database dumps. Consumers of the data are expected to look
for specific content based on its associated identifier, download it
from the archive and use it. Now comes the question of the
preservation of Linked Open Data. In opposition to "frozen" data sets,
linked data can be qualified as "live" data. The resources it contains
are part of a larger entity to which third parties contribute, one of
the design principles indicate that other data producers and consumers
should be able to point to data. As
LD
publishers stop offering their data (e.g. at the end of a project),
taking the
LD
off-line as a dump and putting it in an archive
effectively turns it into a frozen dataset, likewise SQL dumps and
other kind of databases. The question then is to what extent this is
an issue.
Challenges:
The archive has to think about whether
dereferencing for resources found in preserved datasets is required or
not, also to think about providing a SPARQL endpoint or not. If data
consumers and publishers are fine with having
RDF
data dumps to be
downloaded from the archive prior to its usage - just like any other
digital item so far - the technical challenges could be limited to
handling the size of the dumps and taking care of serialization
evolution over time (e.g. from N-Triples to TriG, or from
RDF
/XML to
HDT
) as the preference for these
formats evolves. Turning a live dataset into a frozen dump also raises
the question of the scope. Considering that
LD
items are only part of
a much larger graph that gives them meaning through context the only
valid dump would be a complete snapshot of the entire connected
component of the Web of Data graph the target dataset is part of.
Potential Requirements:
Decide on the importance
of the de-referencability of resources and the potential implications
for domain names and naming of resources. Decide on the scope of the
step that will turn a connected sub-graph into an isolated data dump.
Requires:
R-AccessLevel
R-PersistentIdentification
R-UniqueIdentifier
and
R-VocabReference
2.7
Dutch Base
Registers
(Contributed by Christophe Guéret)
URL:
The Netherlands has a
set of registers
that are under consideration for exposure as
Linked (Open) Data in the context of the
"PiLOD"
project. The registers contain information about buildings, people,
businesses that other individual public bodies may want to refer to
for they daily activities. One of them is, for instance, the service
of public taxes ("BelastingDienst") which regularly pulls out data
from several registers, stores this data in a big Oracle instance and
curates it. This costly and time consuming process could be optimized
by providing on-demand access to up-to-date descriptions provided by
the register owners.
Challenges:
In terms of challenges, linking is for once not much of an issue as
registers already cross-reference unique identifiers
(see also
).
URI scheme
with predicable and persistent URIs is being considered for
implementation. Actual challenges include:
Capacity: at this point, it is considered unreasonable to ask
every register to publish its own data. Some of them export what
they have on the national open data portal. This data has been used
to do some testing with third-party publications from PiLOD project
members but this is rather sensitive as a long term strategy
(governmental data has to be traceable/trustable as such). The middle
ground solution currently deployed is the PiLOD platform, a
(semi)-official platform for publishing register data.
Privacy: some of the register data is personal or may become so
when linked to others (e.g. when addresses are used to disambiguate
personal data). Some registers will require secure access to some of
their data to some people only (an example of non-open Linked Data).
Some others can go along with open data as long as they get a
precise log of who is using what.
Revenue: institutions working under mixed
government/non-government funding generate part of their revenue by
selling some of the data they curate. Switching to an open data
model will cause a direct loss in revenue that has to be compensated
for by other means. This does not have to mean closing the data,
e.g. a model of open dereferencing plus paid dumps can be
considered, as well as other indirect revenue streams.
Requires:
R-AccessLevel
R-FormatMultiple
R-PersistentIdentification
R-SensitivePrivacy
R-UniqueIdentifier
and
R-VocabReference
2.8
GS1 Digital
(Contributed by Mark Harrison, University of
Cambridge & Eric Kauz, GS1).
Retailers and Manufacturers / Brand Owners are beginning to
understand that there can be benefits to openly publishing structured
data about products and product offerings on the Web as Linked Open
Data. Some of the initial benefits may be enhanced search listing
results (e.g. Google Rich Snippets) that improve the likelihood of
consumers choosing such a product or product offer over an alternative
product that lacks the enhanced search results. However, the longer
term vision is that an ecosystem of new product-related services can
be enabled if such data is available. Many of these will be
consumer-facing and might be accessed via smartphones and other mobile
devices, to help consumers to find the products and product offers
that best match their search criteria and personal preferences or
needs — and to alert them if a particular product is incompatible with
their dietary preferences or other criteria such as ethical /
environmental impact considerations — and to suggest an alternative
product that may be a more suitable match. A more
complete
description
of this use case is available.
Elements:
Domains:
Product master data (e.g. technical specifications,
ingredients, nutritional information, dimensions, weight,
packaging).
Product offerings (e.g. sales price, availability (online,
locally), payment options, delivery/collection options.
Ethical / environmental claims about a product and its
production process.
Obligation/motivation:
initially, enhanced search result listings (e.g. Google Rich
Snippets);
vision is to enable an ecosystem of new digital apps around
product data;
the food sector in the
EU
is already obliged under new food
labelling legislation (
EU
1169 / 2011, Article 14) to provide
the same amount of information about a food product that is sold
online to consumers as the information that would be available
to them from the product packaging if they picked up the product
in-store. Although the legislation does not suggest that Linked
Open Data technology should be used to make the same information
available in a machine-readable format, there is currently
significant investment and effort to upgrade Web sites to
provide accurate and detailed information about food products;
the GS1 Digital team consider that for a relatively small amount
of effort, these companies could gain some tangible benefits
(e.g. enhanced search results) from such compliance efforts by
using Linked Open Data technology within their Web pages.
Usage:
data providing transparency about product characteristics
data used to help consumers make informed choices about which
products to buy/consume
Quality:
Very important to have trustworthy
authoritative data from respective organizations.
Size:
Typically 20+ factual claims per product -
probably 40+
RDF
triples.
Type/format:
HTML + RDFa / JSON-
LD
/ Microdata.
Rate of change:
mostly static data initially — but
subject to some variation over time
Data lifespan:
data should remain accessible until
products are no longer considered to be in circulation; this
represents a challenge for deprecated product lines data that is
stated authoritatively by one organization might be embedded /
referenced in the data asserted by another organization; this raises
concerns about whether embedded data becomes stale if it is
inadequately synchronized, that referenced data is not dereferenced
(and therefore not discovered / gathered) by consumers or the data.
From a liability perspective, there also needs to be clarity about
which organization asserted which factual information — and also
information about which organization has the authority to assert
specific factual claims.
Potential audience:
machine-readable (search
engines, data aggregators, mobile apps etc.)
Challenges:
Linked Open Data about products is likely to be highly distributed
in nature and various parties have authority over specific claims.
Accreditation agencies have authority over ethical/environmental
claims.
Brand owners / manufacturers have authority over product master
data.
Retailers have authority over facts related to product offerings
(price, availability etc.).
An organization (e.g. retailer) might embed authoritative data
asserted by another organization (e.g. brand owner) and there is the
risk that such embedded information becomes stale if it is not
continuously synchronized.
An organization (e.g. retailer) might reference a graph of
authoritative data that can be retrieved via an HTTP request to a
remote HTTP URI. There is a risk that software or search engines
consuming Linked Open Data containing such references may fail to
dereference such HTTP URIs and in doing so may fail to gather all of
the relevant data.
Organizations are currently faced with a choice of whether to
embed machine-readable structured data in their Web pages using a
block approach (e.g. using JSON-
LD
) or using an inline approach
(e.g. using RDFa, RDFa Lite or Microdata). A block approach
(JSON-
LD
) may be simpler and less brittle than inline annotation,
especially as it can be easily decoupled from structural changes to
the body of the Web page that may happen over time in the redesign
of a Web site. At present, tool support for the 3 major markup
approaches for embedded Linked Open Data (RDFa, JSON-
LD
, Microdata)
is unequal across the three formats and some tools may not export or
import / ingest all 3 formats - some tools even fail to extract data
from JSON-
LD
markup created by their corresponding export tool.
There are some significant challenges to ensure that the structured
data embedded within a Web page is correctly linked to form coherent
RDF
triples, without any dangling nodes that should be connected to
the Subject or other nodes.
Only through the provision of best-in-class tool support that
recognize all three major formats on a completely equal footing can
organizations have any confidence that they can use any of the 3
major markup formats and the ability to verify / validate that their
own markup does result in the correct
RDF
triples.
Potential Requirements:
The ability to determine who asserted various facts — and whether
they are the organization that can assert those facts
authoritatively.
Where data from other sources is embedded, there is a risk that
the embedded data might be stale. It is therefore helpful to
indicate which graph of triples is a snapshot in time from data from
another source - and to provide a link to the original source, so
that the consumer of the data has the opportunity to obtain a fresh
version of the live data rather than relying on a potentially stale
snapshot graph of data.
DWBP
could provide guidance about how to indicate which graph of data is
a snapshot and where it came from.
Consumers of Linked Open Data about products might rely on it for
making decisions — not only about purchase but even consumption. If
the data about a product is inaccurate or out-of-date, we might need
to provide some guidance about how liability terms and disclaimers
can be expressed in Linked Open Data. We’re not suggesting that we
define such terms from a legal perspective, but perhaps there is an
existing framework in a similar way that there is an existing
framework for expressing various licences of the data? If not,
perhaps such a framework needs to be developed - but outside of the
DWBP
group? Licensing generally says what you’re allowed to do with
the data - but I don’t think it says anything about liability for
using the data or making decisions based on that data. This area
probably needs some clarification, particularly if there is a risk
of injury or death (due to inaccurate information about allergens in
a food product).
Requires:
R-AccessUpToDate
R-Citable
R-FormatMultiple
R-FormatStandardized
R-LicenseLiability
R-PersistentIdentification
and
R-ProvAvailable
2.9
ISO GEO Story
(Contributed by Ghislain Atemezing)
ISO GEO manages catalog records of geographic information in XML
that conform to ISO-19139, a French adaptation of ISO-19115 (
data
sample
). They export thousands of records like that today but
they need to manage them better. In their platform, they store the
information in a more conventional manner and use this standard for
export datasets compliant to the
INSPIRE
standards or via the
OGC
's
CSW
protocol.
Sometimes, they have to enrich their metadata using tools like
GeoSource and accessed through an
SDI
with their own metadata records. ISO GEO wants to be able to integrate
all the different implementations of ISO-19139 in different tools in a
single framework to better understand the thousands of metadata
records they use in their day-to-day business. Types of information
recorded in each file include: contact info (metadata) [data issued],
spatial representation, reference system info [code space], spatial
resolution, geographic extension of the data, file distribution, data
quality and process step (
example
).
Challenges:
Achieve interoperability between supporting applications, e.g.
validation and discovery services built over a metadata repository.
Capture the semantics of the current metadata records with respect
to ISO-19139.
A unified way to have access to each record within the catalog at
different levels: local, regional, national or
EU
level.
Requires:
R-AccessUpToDate
R-DataEnrichment
R-FormatLocalize
R-FormatMachineRead
R-GranularityLevels
R-LicenseAvailable
R-MetadataMachineRead
R-MetadataStandardized
R-PersistentIdentification
R-ProvAvailable
and
R-VocabReference
2.10
The Land Portal
(Contributed by Carlos Iglesias)
URL:
The IFAD Land Portal platform has been completely rebuilt as an Open
Data collaborative platform for the Land Governance community. Among
the new features the Land Portal provides access to more than 100
indicators from more than 25 different sources on land governance
issues for more than 200 countries over the world, as well as a
repository of land related-content and documentation. Thanks to the
new platform people could
curate and incorporate new data and metadata by means of different
data importers and making use of the underlying common data model;
search, explore and compare the data through countries and
indicators; and
consume and reuse the data by different means (i.e. raw data
download at the data catalog; linked data and SPARQL endpoint at
RDF
triplestore; RESTful API; and built-in graphic visualization
framework).
Elements:
Domains:
Land Governance; Development
Obligation/motivation:
To find reliable data driven
indicators on land governance and put all them together to
facilitate access, study, analysis, comparison and data gaps
detection.
Usage:
Research; Policy Making, Journalism;
Development; Investments; Governance; Food security; Poverty; Gender
issues.
Quality:
Every sort of data, from high quality to
unverified.
Size:
Varies, but low-medium in general.
Type/format:
Varies: APIs; JSON; spreadsheets; CSV;
HTML; XML; PDF...
Rate of change:
Usually yearly, but also higher
rates (monthly, quarterly...).
Data lifespan:
Unlimited.
Potential audience:
Practitioners; Policy makers;
Activists; Researchers; Journalists.
Challenges:
Data coverage.
Quality of data and metadata.
Lack of machine-readable metadata.
Inconsistency between different data sources.
Wide variety of formats and technologies.
Some non machine-readable formats.
Data variability (models, sources, etc.).
Data provenance.
Diversity and (sometimes) complexity of Licenses.
Internationalization issues (e.g. different formats for numbers,
dates, etc.) and multilingualism.
Potential Requirements:
Availability of general use taxonomies (countries, topics, etc.).
Data interoperability i.e. domain-specific vocabularies for a
common data model with reference formats and protocols.
Data persistence.
Versioning mechanisms.
Requires:
R-AccessBulk
R-AccessRealTime
R-DataEnrichment
R-DataVersion
R-FormatLocalize
R-FormatMachineRead
R-FormatMultiple
R-FormatStandardized
R-GeographicalContext
R-GranularityLevels
R-MetadataAvailable
R-MetadataMachineRead
R-MetadataStandardized
R-ProvAvailable
R-PersistentIdentification
R-QualityCompleteness
R-QualityMetrics
R-TrackDataUsage
R-UniqueIdentifier
R-VocabDocum
R-VocabOpen
R-VocabReference
and
R-VocabVersion
2.11
LA Times' Reporting
of Ron Galperin's Infographic
(Contributed by Phil Archer )
URL:
On 27 March 2014, the LA Times published a story
Women earn 83
cents for every $1 men earn in L.A. city government
. It was
based on an Infographic released by LA's City Controller, Ron
Galperin. The Infographic was based on a dataset published on LA's
open data portal,
Control Panel LA
. That portal uses the
Socrata
platform which offers a number of spreadhseet-like tools for examining
the data, the ability to download it as CSV, embed it in a Web page
and see its metadata.
Positive aspects:
The LA Times story makes its sources clear (it also links to a
related
Pew Research Center article
).
It offers readers a commentary on the particular issue raised and
is easy for anyone to digest.
Data sources are cited directly and can be followed up on by
(human) readers.
Negative aspects:
The Infographic itself only cites the
data
portal
, not the
specific
dataset
The
metadata
provided on the data portal is very sparse with many
fields left empty.
The dataset is itself the result of an analysis (there are only 8
lines in the table), the raw data on which it is based is not cited,
let alone made available, and the methods used are not described.
Challenges:
Data Citation - how could Ron Galperin have referred to the source
data in the Infographic? (the URI is way too long). QR code? Short
PURL?
How could the publisher of the data link to the Infographic as a
visualization of it?
In this case, the creator of the underlying data is the same as
the creator of the Infographic, but if they were different, how
could the data creator discover the Infographic, still less the
media report about it?
The methodology used is not explained - making it hard to assess
trustworthiness. How can provenance be described?
The metadata is incomplete and does not used a recognized standard
vocabulary making automated discovery and use by anyone other than
the data creator difficult.
Other Data Journalism blogs:
FiveThirtyEight
Wall Street Journal’s
Number Guy column
Guardian’s
data blog
Requires:
R-Citable
R-DataMissingIncomplete
R-DataProductionContext
R-FormatMultiple
R-FormatOpen
R-GeographicalContext
R-LicenseAvailable
R-MetadataAvailable
R-MetadataStandardized
R-QualityMetrics
R-UniqueIdentifier
and
R-TrackDataUsage
2.12
LusTRE: Linked
Thesaurus fRamework for Environment
(Contributed by Riccardo Albertoni, CNR-IMATI,
Genoa, Italy)
URL:
LusTRE is a framework that combines existing thesauri to support the
management of environmental resources. It considers the heterogeneity
in scope and levels of abstraction of existing environmental thesauri
as an asset when managing environmental data, thus it exploits Linked
Data (
SKOS
RDF
etc.) in order
to provide a multi-thesauri solution for
INSPIRE
data themes
related to nature conservation.
LusTRE is intended to support metadata compilation and data/service
discovery according to ISO 19115/19119. The development of LusTRE
includes:
a review of existing environmental thesauri and their
characteristics in term of multilingualism, openness and quality;
the publication of environmental thesauri as Linked Data;
the creation of linksets among published thesauri as well as
well-known thesauri exposed as Linked Data by third-parties;
the exploitation of aforementioned linksets to take advantage of
thesaurus complementarities in terms of domain specificity and
multilingualism.
Quality of thesauri and linksets is an issue that is not necessarily
limited to the initial review of thesauri, it should be monitored and
promptly documented.
In this respect, a standardized vocabulary for expressing dataset and
linkset quality would be needed to make accessible the quality
assessment of thesauri included in LusTRE. Considering the importance
of linkset quality in the achievement of an effective cross-walking
among thesauri, further services for assessing the quality of linksets
are going to be investigated. Such services might be developed
extending the measure proposed in
Albertoni
et al, 2013
(PDF), so that, linksets among thesauri can be
assessed considering their potential when exploiting interlinks for
thesaurus complementarities.
LusTRE is currently under development within the
EU
project eENVplus
(CIP-ICT-PSP grant No. 325232), it extends the common thesaurus
framework
De Martino et al.
2011
previously resulting from the
EU
project NatureSDIplus
(ECP-2007-GEO-317007).
Elements:
Domains:
Geographic information. Thesauri and
Controlled vocabularies provided within LusTRE's are meant to ease
the management of Geographical Data and Services.
Obligation/motivation:
Activity foreseen in
EU
project which encourages the adoption of
INSPIRE
metadata
implementation rules.
Usage:
Data that is the basis for services to the
public.
Quality:
Largely variable.
Lineage:
Thesauri and controlled vocabulary
provided come from third parties.
Size:
Small, most of the thesauri size is less
than 100MB.
Type/format:
LusTRE publishes
SKOS
RDF
, but the
thesauri considered for inclusion in LusTRE are not necessarily in
that format.
Rate of change:
Depends on the thesaurus, in
average it is a low rate of Change.
Data lifespan:
Beyond the lifespan of eENVPlus
project (2013 – 2015).
Potential audience:
Public administrations
involved in the cataloguing of geographical information and Spatial
Data Infrastructure. Decision makers searching in Spatial Data
Infrastructure.
Positive aspects:
The use case includes publication
as well as consumptions of data.
Challenges:
Diversity and (sometimes) complexity of Licenses.
Issues pertaining to multilingualism.
Assessment and documentation of dataset and linkset quality with
domain-dependent quality metrics.
Requires:
R-AccessBulk
R-Citable
R-DataEnrichment
R-DataVersion
R-FormatMachineRead
R-FormatMultiple
R-FormatOpen
R-FormatStandardized
R-LicenseAvailable
R-MetadataAvailable
R-MetadataDocum
R-MetadataMachineRead
R-MetadataStandardized
R-PersistentIdentification
R-ProvAvailable
R-QualityComparable
R-QualityCompleteness
R-QualityMetrics
R-QualityOpinions
R-TrackDataUsage
R-UniqueIdentifier
R-UsageFeedback
R-VocabDocum
R-VocabOpen
R-VocabReference
and
R-VocabVersion
2.13
Machine-readability
and Interoperability of Licenses
(Contributed by Deirdre Lee, based on
post
by Leigh Dodds
There are many different
licenses
available under which data can be published on the Web, e.g.
Creative
Commons
Open Data
Commons
, national licenses, etc. It is important that the
license is available in a machine-readable format. Leigh Dodds has
done some work towards this with the
Open
Data Rights Statement Vocabulary
including guides for
publishers
and
reusers
Another issue is when data under different licenses are combined, the
license terms under which the data is available also have to be
merged. This interoperability of licenses is a challenge.
Challenges:
Standard vocabulary for data licenses.
Machine-readability of data licenses.
Interoperability of data licenses.
Requires:
R-LicenseAvailable
NB
there is also a requirement for licenses to be
interoperable but this is out of scope as defined by the Working
Group's
charter
2.14
Mass Spectrometry
Imaging (MSI)
(Contributed by Annette Greiner, Lawrence
Berkeley National Laboratory, California)
URL:
Mass spectrometry imaging (MSI) is widely applied to image complex
samples for applications spanning health, microbial ecology, and high
throughput screening of high-density arrays. MSI has emerged as a
technique suited to resolving metabolism within complex cellular
systems; where understanding the spatial variation of metabolism is
vital for making a transformative impact on science. Unfortunately,
the scale of MSI data and complexity of analysis presents an
insurmountable barrier to scientists where a single 2D-image may be
many gigabytes and comparison of multiple images is beyond the
capabilities available to most scientists. The OpenMSI project will
overcome these challenges, allowing broad use of MSI to researchers by
providing a Web-based gateway for management and storage of MSI data,
the visualization of the hyper-dimensional contents of the data, and
the statistical analysis.
Elements:
Domains:
imaging mass spectrometry, life
sciences, microscopy, analytical chemistry.
Obligation/motivation:
scientific analysis,
reporting results, collaboration
Usage:
Data sets can be contributed by
researchers anywhere in the world and perused/analyzed by anyone.
Users can share their data with individuals and the public using a
familiar group and users view/edit/own permission scheme. Once their
dataset is in the system, a researcher can select subsets of the
data for viewing as an image or spectrum. Researchers can perform
statistical analysis of their data, e.g, via non-negative matrix
factorization, while the API and online viewers enable users to
interact with derived analytics in the same way as with raw data.
Users can also download individual images. A REST API provides
programmatic access to enable custom remote data analytics and
retrieval of data subsets.
Quality:
varies with mass spectrometry instrument
used, preparation of sample.
Size:
Average sizes typically range from 10-50 GB
per sample (before compression). Larger images of 50 - 500GB can
already be generated today. Each lab with an OpenMSI account
generates typically 2-5 samples per week.
Type/format:
Multiscale, multimodal, and
multidimensional data stored using the OpenMSI file format based on
HDF5.
Rate of change:
Underlying data for an experiment
does not generally change, though new analyses and metadata will be
added over time.
Data lifespan:
years to decades.
Potential audience:
working scientists interested
in obtaining spatially resolved chemical information about samples
including scientists researching cancer, agriculture, and synthetic
biology.
Positive aspects:
huge improvement in ease of
analysis over traditional methods, ability to readily share results
with other researchers, ability to download relevant subsets of data,
provides metadata for each sample, self-describing data format, fast
and flexible Web API, interactive Web-based exploration that enables
user to view data that cannot be opened using standard MSI tools.
Negative aspects:
submission of metadata should be
easier and automated. As it scales, we'll need to facilitate discovery
of datasets of interest via search.
Challenges:
Project is largely unfunded and
resources are vitally needed for project to succeed.
Requires:
R-AccessRealTime
R-APIDocumented
R-DataEnrichment
R-FormatMachineRead
R-FormatOpen
R-FormatStandardized
R-MetadataAvailable
R-MetadataDocum
R-MetadataMachineRead
and
R-SensitiveSecurity
2.15
OKFN
Transport WG
(Contributed by Deirdre Lee based on the
2012
ePSI Open Transport Data Manifesto
The Context: Transportation is an important contemporary issue that
has a direct impact on economic strength, environmental sustainability
and social equity. Accordingly, transport data — largely produced or
gathered by public sector organisations or semi-private entities, quite
often locally — represents one of the most valuable sources of public
sector information (PSI, also called ‘open data’), a key policy area
for many, including the European Commission.
The Challenge: Combined with the advancement of Web technologies and
the increasing use of smart phones, the demand for high quality
machine-readable and openly licensed transport data, allowing for
reuse in commercial and non-commercial products and services, is
rising rapidly. Unfortunately this demand is not met by current
supply: many transport data producers and holders (from the public and
private sectors) have not managed to respond adequately to these new
challenges set by society and technology.
So what do we need?
Access to any transport data of any operator, of high quality, in
real time, against free or at least fair standard conditions.
An inclusive infrastructure, based on common open,
non-discriminatory and interoperable standards and APIs, to which
operators, service providers, developers and users can connect.
An ecosystem wherein universal access and re-usability of transport
data is the rule, not the exception.
Why is this not happening?
Data that is necessary for integrated personal transportation
solutions is rich and encompasses several domains (geospatial data,
environmental data, private service provider data), involving a wide
array of data holders from the public and private sectors. Because
of its very nature, transport data is often held locally.
Legacies create lock-ins that prevent adoption of open standards
and hamper interoperability.
Many operators and incumbent service providers, in particular
those relying on income from sales of data, still regard selective
and exclusive access to transport data as a competitive advantage,
restricting access and reuse through the exercise of intellectual
property rights.
Perceived liability risks, often associated with data quality
issues, prevent operators from opening up their data.
Significant differences between countries, regions and transport
modalities in terms of level of development, market maturity and
associated business models prevent a ‘one size fits all’ solution.
A lack of leadership in the value chain, either by the industry or
from the authorities (whatever the level), limits governance
capabilities as to establishment of access, accessibility and other
framework conditions, creating a need for a subtle mix of mostly
bottom-up instruments and a dash of top-down measures.
Existing market players with associated interests turn
governmental actions into a delicate matter, in particular as to the
question of where the role of the government should start and end
within the value chain and where the market parties should take over
and become the driving factor.
Where market parties need to step in, the lack of a clear and
predictable environment prevents businesses from establishing a
long-term perspective, whereby fair competition needs to be
safeguarded.
Requires:
R-AccessBulk
R-AccessLevel
R-AccessRealTime
R-AccessUpToDate
R-APIDocumented
R-DataMissingIncomplete
R-DataProductionContext
R-FormatLocalize
R-FormatMachineRead
R-FormatOpen
R-GeographicalContext
R-LicenseAvailable
R-LicenseLiability
R-MetadataAvailable
R-QualityComparable
R-QualityCompleteness
R-QualityMetrics
R-SLAAvailable
R-UsageFeedback
and
R-VocabOpen
2.16
Open
City Data Pipeline
(Contributed by Deirdre Lee, based on a
presentation
by Axel Polleres
at
EDF
2014
).
The Open City Data Pipeline aims to provide an extensible platform
to support citizens and city administrators by providing city key
performance indicators (KPIs), leveraging open data sources. An
assumption of open data is that “Added value comes from comparable
open datasets being combined.” Open data needs stronger standards to
be useful, in particular for industrial uptake. Industrial usage has
different requirements than that of an application-building hobbyist
or civil society so it's important to think how open data can be used
by industry at the time of publication. The Open City Data Pipeline
project has developed a data pipeline to:
(semi-)automatically collect and integrate various open data
sources in different formats;
compose and calculate complex city KPIs from the collected data.
Current Data Summary
Ca. 475 different indicators
Categories: Demography, Geography, Social Aspects, Economy,
Environment, etc.
from 32 sources (html, CSV,
RDF
…)
Wikipedia, urbanaudit.org, Statistics from City homepages, country
Statistics, iea.org.
Covering 350+cities in 28 European countries.
District data for selected cities (Vienna, Berlin).
Mostly snapshots, partially covering timelines.
On average ca. 285 facts per city.
Base assumption (for our use case): Added value comes from comparable
open datasets being combined
Challenges & Lessons Learnt:
Incomplete
Data: can be partially overcome by:
ontological reasoning (
RDF
& OWL), by aggregation, or by
rules & equations, e.g.
:populationDensity =
:population/:area
By statistical methods or Multi-dimensional Matrix
Decomposition (unfortunately only partially successful, because
these algorithms assume normally-distributed data).
Incomparable
data:
dbpedia:populationTotal
dbpedia:populationCensus
Heterogeneity
across open government data efforts:
different
indicators
, different temporal and spatial
granularity;
different
licenses
of open data: e.g. CC-BY, country
specific licences, etc.
Heterogeneous
formats
and heterogeneity within
formats, especially CSV.
Challenges:
Incomplete data (can be overcome using semantic technologies
and/or statistical methods).
Heterogeneity (indicators, licenses, formats).
Open data needs stronger standards to be useful (in particular for
industrial uptake), at a metadata level, and dataset level.
Metadata is not always uniform, not only titles of columns, but
standardization about units, etc.
Requires:
R-DataMissingIncomplete
R-DataProductionContext
R-FormatMachineRead
R-FormatOpen
R-FormatStandardized
R-FormatLocalize
R-GeographicalContext
R-LicenseAvailable
R-MetadataAvailable
R-MetadataStandardized
R-QualityComparable
R-QualityCompleteness
R-VocabDocum
R-VocabOpen
and
R-VocabReference
2.17
Open Experimental
Field Studies
(Contributed by Eric Stephan)
In 2013 the United States Whitehouse published an executive order on
Open Data to help make publically available data: understandable,
accessible, and searchable. A number of historical and on-going
atmospheric studies fall into this category but are not currently
open. This use case describes characteristics of laboratory
experiments and field studies that could be published as open data.
For measurements to be considered useful and comparable to other
findings scientists need to track every aspect of their laboratory and
field experiments. This can include: background describing the purpose
of the experiment, field site selected, instrumentation deployed,
configuration settings, house keeping data, types of measurements that
need to be taken, work performed on field visits, processing the raw
measurements, intermediate processing data, value added data products,
quality assurance, problem reporting, and standards relied upon for
disseminating the study results including selected data formats,
quality control codes selected, engineering units selected, and
metadata vocabularies relied upon for describing the measurements.
Traditionally knowledge and data about the studies have either been
kept in separate local databases, file systems and spreadsheets, or in
non-record keeping systems. If kept electronically the experiment in
its entirety may be kept in bulk by way of archive files (tar, zip
etc). Measurements from the study may be shared along with background
information in the form of a summarized report or publication, content
management system or wiki site and the bulk of knowledge is largely
retained internally by data providers.
Elements:
Domains:
Open scientific experimental research
relying upon in situ and remote sensing instruments. E.g. wind
studies that may use anemometers and Lidar to study wind
measurements.
Obligation/motivation:
Answer scientific
questions about the characteristics and behavior of the physical
system being studied.
Usage:
Data may analyzed and visualized by
applications, used in computational models or combined in larger
data sets for larger studies.
Quality:
House keeping data, problem reporting,
maintenance history, calibration history.
Size:
Dependent on the length of the study,
measurement rate, and the size of each sample. Size can vary from
kilobytes to tens of gigabytes daily for a single instrument.
Type/format:
raw data is dictated by the
instrument producing the measurements. Intermediate results and
value added products can be in binary, delimited text file, NetCDF,
or stored in other formats.
Rate of change:
depends on the measurement rate.
Data lifespan:
This may vary between scientific
communities. For atmosphere field studies data cannot be reproduced
and may be retained forever. If a laboratory experiment can be
repeated, it may have a limited lifespan. In cases where data is
cited even repeatable experiments will be available to back up the
published research findings.
Potential audience:
domain experts and scientific
peers, science teachers and students. Other domains will use these
results.
Positive aspects:
The Web of Things (instruments),
Linked Services (processing software), and Linked Data communities
offer an opportunity to field or laboratory experiments by coupling
all the elements of the experiment into one composite product.
Leveraging these technologies it is possible to construct a catalog
that acts as a concierge to any collaborator giving them perspectives
on things, services, and data.
Negative aspects:
When data is published on the Web there is no mechanism for users
to rate and review data.
Data providers usually are unaware of new user communities using
measurements.
Challenges:
Publishing experiments to publically accessible Web-based
archives.
Advertising experiments in catalogs that includes comprehensive
information about the things and services used in the experiment.
Providing composite experiment in such a way that it is useful to
users that are not fellow collaborators.
Identifying new emerging target user communities
Without specific best practices guidance data may not be published
and irreproducible data risks being lost.
Policies need to be provided when in the experimental design when
it is acceptable to publish data and when to keep it initially
private.
Requires:
R-AccessRealTime
R-DataIrreproducibility
R-DataLifecycleStage
R-DataProductionContext
R-FormatMachineRead
R-FormatMultiple
R-FormatStandardized
R-TrackDataUsage
R-UsageFeedback
R-VocabOpen
R-VocabReference
and
R-UsageFeedback
2.18
Resource Discovery
for Extreme Scale Collaboration (RDESC)
(Contributed by Sumit Purohit)
URL:
RDESC's objective is to develop a capability for describing, linking,
searching and discovering scientific resources used in collaborative
science. For the purpose of capturing semantic context, RDESC adopts
sets of existing ontologies where possible such as
FOAF
BIBO
and
schema.org
RDESC also introduced new concepts in order to provide a semantically
integrated view of the data. Such concepts have two distinct
functions. The first is to preserve the semantics of the source that
are more specific than what already existed in the ontology. The
second is to provide broad categorization of existing concepts as it
becomes clear that concepts are forming general groups. These
generalizations enable users to work with concepts they understand,
rather than needing to understand the semantics of many different
systems. It strives to provide a lightweight enough framework to be
used as a component in any software system such as desktop user
environments or dashboards but also be scalable to millions of
resources.
Elements
Domains:
Scientific Resources: Instruments, Organizations, People
Bibliographic Resources : Publications, Citations
Physical Properties : soil moisture
Digital Data Curation.
Obligation/motivation:
Show value to data publishers in publishing High Quality
Linked Data based resources.
Search/Browse/Discover semantically tagged data.
Recommend "Similar" data.
Use of RDFa, Schema.org in HTML to let standard Web search
engine index published pages.
Usage:
User providing more expressive queries to search data.
User able to reach to as close as possible to the source of
data.
User able to find "Similar" data.
Quality:
is important to maintain correctness and
quality of search result.
Size:
Order of 1-2B triples as of 19 September
2014
Type/format:
RDF
LinkedData
Rate of change:
No Formal Update Cycle as of now
but data has been updated Quarterly
Potential audience:
Scientific Community,
Decision Makers
Positive aspects:
Persistent URI with content negotiation. RDESC uses persistent URI
to describe all the entities in the system.
Use of existing ontologies such as foaf, bibo, schema.org
Published specialized RDESC ontology for scientific resources :
RDESC
Ontology
(Turtle)
Enable application developers to use any kind of user interface
suitable for their user needs.
The provision of
examples
Negative aspects:
Difficulties in Data Curation
Challenges:
Scalability of such systems.
Automated data curation pipelines.
Metadata about Quality of Published Data.
Frequency of Data Update..
User Feedback for data correction/annotation
Potential Requirements:
Use of Persistent URIs.
Recommending abstract and domain specific ontologies/vocabularies.
Requirements to publish quality of published data.
Requires:
R-AccessLevel
R-AccessRealTime
R-Citable
R-DataLifecyclePrivacy
R-DataMissingIncomplete
R-FormatStandardized
R-PersistentIdentification
R-ProvAvailable
R-SensitiveSecurity
R-SLAAvailable
R-TrackDataUsage
R-UniqueIdentifier
R-VocabOpen
R-VocabReference
and
R-UsageFeedback
2.19
Recife Open
Data Portal
(Contributed by Bernadette Lóscio )
URL:
Recife is a city situated in the Northeast of Brazil and it is famous
for being one of the Brazil’s biggest tech hubs. Recife is also one of
the first Brazilian cities to release data generated by public sector
organizations for public use as open data. Then
Open
Data Portal Recife
was created to offer access to a repository
of governmental machine-readable data about several domains,
including: finances, health, education and tourism. Data is available
in CSV and GeoJSON formats and every dataset has metadata that helps
in the understanding and usage of the data. However, the metadata is
not provided using standard vocabularies or taxonomies. In general,
data is created in a static way, where data from relational databases
are exported in a CSV format and then published in the data catalog.
Currently, work is under way to dynamically generate data from
relational databases so that data will be available as soon as it is
created. The main phases of the development of this initiative were:
to educate people with appropriate knowledge concerning open data,
relevant data identification in order to identify the sources of data
that their potential consumers could find useful, data extraction and
transformation from the original data sources to open formats,
configuration and installation of the open data catalog tool, data
publication and portal release.
Elements:
Domains:
Base registers, cultural heritage
information, geographic information, infrastructure information,
social data and tourism information
Obligation/motivation:
Data that must be provided to
the public under a legal obligation (Brazilian Information Access
Act, edited in 2012); Provide public data to citizens.
Usage:
Data that supports democracy and
transparency; data used by application developers.
Quality:
Verified and clean data.
Size:
in general small to medium CSV files.
Type/format:
CSV, GeoJson
Rate of change:
different rates of change depending
on the data source.
Potential audience:
application developers,
startups, government organizations.
Challenges:
Use of common vocabularies to facilitate data integration.
Provide structural metadata to help understanding and usage.
Automate the data publishing process to keep data up to date and
accurate.
Requires:
R-MetadataMachineRead
R-MetadataDocum
R-MetadataStandardized
R-QualityComparable
R-QualityCompleteness
R-VocabDocum
R-VocabOpen
and
R-VocabReference
2.20
Retrato
da Violência (Violence Map)
(Contributed by Yasodara )
URL:
This
is a data
visualization made in 2012 by
Vitor
Batista
Léo tartari
and
Thiago Bueno
for a
W3C
Brazil Office challenge about data from Rio Grande do Sul (a brazilian
region). The data was released in a .zip package, the original format
was .csv. The code and the documentation of the project are
in it's GitHub repository
Elements:
Domains:
political information, regional security
information.
Obligation/motivation:
Data that must be provided to
the public under a legal obligation, the LAI or Brazilian
Information Access Act, edited in 2012.
Quality:
not guaranteed.
Type/format:
Tabular data.
Rate of change:
There is no new releases of the
data, this was a one-off
Positive Aspects:
the decision to transform the
CSV in to JSON was based on the necessity to have hierarchical data.
The ability to map the CSV structure to XML or JSON was considered as
a positive since JSON can cover more complex structures.
Negative Aspects:
the data is already outdated (in
2014), there is no provision for new releases, and there's no
associated metadata.
Requires:
R-AccessUpToDate
R-MetadataAvailable
R-MetadataStandardized
R-PersistentIdentification
R-QualityCompleteness
and
R-SensitiveSecurity
2.21
Share-PSI 2.0:
Uses of Open Data Within Government for Innovation and Efficiency
(Contributed by Phil Archer on behalf of
Share-PSI 2.0)
URL:
The
Share-PSI 2.0
Thematic Network
, co funded by the European Commission, is
running a series of workshops throughout 2014 and 2015 examining
different aspects of how to share Public Sector Information (PSI).
This is in the context of the
revised
European Directive on the Public Sector Information
. The
network's focus is therefore slightly different than Data on the Web
Best Practices as it covers a number of policy issues that are out of
scope for
W3C
, and only covers public sector information. However, the
overlap is substantial. There are more than 40 partners in the
Share-PSI 2.0 network from 25 countries including many government
departments as well as academics, consultants, citizen's organizations
and standards bodies involved directly with PSI provision.
The
report
from the first Share-PSI 2.0 workshop, held as part of the
Samos
Summit
30 June - 1 July 2014, summarizes the many papers and
discussions held at that event. From it, we can derive a long list of
requirements.
Elements
and
challenges
not
included here as the report summarizes many use cases.
Requires:
R-AccessRealTime
R-AccessUpToDate
R-Citable
R-GeographicalContext
R-MetadataDocum
R-MetadataStandardized
R-ProvAvailable
R-QualityComparable
R-QualityOpinions
R-SensitivePrivacy
R-SensitiveSecurity
R-UsageFeedback
and
R-VocabReference
2.22
Tabulae - how to
get value out of data
(Contributed by Luis Polo )
URL:
Tabul.ae is a framework to publish and visually explore data that can
used to deploy powerful and easy-to-exploit open data platforms, so
allowing organizations to unleash the potential of their data. The aim
is to enable data owners (public organizations) and consumers
(citizens and business reusers) to transform the information they
manage into added-value knowledge, empowering them to easily create
data-centric Web applications. These applications are built upon
interactive and powerful graphs, and take the shape of interactive
charts, dashboards, infographics and reports. Tabulae provides a high
degree of assistance to create these apps and also automate several
data visualization tasks (e.g. recognition of geographical entities to
automatically generate a map). In addition, the charts and maps are
portable outside the platform and can be smartly integrated with any
Web content, enhancing the reusability of the information.
Elements:
Domains:
Quantitative and geographical information:
stats, biodiversity, socio-economic indicators, environment,
security, etc.
Obligation/motivation:
to help citizens and
companies (especially, consultancy firms) to understand and create
value from open data by means of reusable, user-made visualizations.
Usage:
Data used by citizens, public employees and
companies.
Quality:
The information must be at least
semi-structured (for instance, an spreadsheet).
Size:
Medium and large datasets (hundreds of
thousands and millions rows).
Type/format:
Tabulae can manage relational
databases, GeoJSON, CSV files and spreadsheets, and provides an API
for programmatic access.
Rate of change:
depending on the original datasets.
The platform enables automatic update from original sources.
Data lifespan:
depending on the original datasets.
Potential audience:
Organizations that want to
publish their catalogue of datasets and aim to maximize their impact
and consumption.
Challenges:
Quality of data and metadata.
Inconsistency between different data sources.
Wide variety of formats and technologies.
Different data schemas that complicates the integration of data
sources.
Diversity and (sometimes) complexity of Licenses.
Data persistence.
Internationalization and format issues (e.g., languages, numbers,
dates, etc.)
Potential Requirements:
Dataset versioning and updating mechanisms.
Standardization of schemas.
Integration with other platforms/services.
Requires:
R-AccessUpToDate
R-FormatLocalize
R-FormatMachineRead
R-FormatMultiple
R-FormatStandardized
R-ProvAvailable
R-QualityComparable
R-QualityCompleteness
R-SensitiveSecurity
R-VocabReference
and
R-VocabVersion
2.23
UK Open
Research Data Forum
(Contributed by Phil Archer)
URL:
(PDF)
In 2013, the
Royal Society
lead the formation of the
UK
Open Research Data Forum
. This effort is a national reflection
of a global trend towards the open publication of research data; see,
for instance, the work of the
Research
Data Alliance
DataCite
and
the US National Institutes of Health as
described
in a talk
by its Associate Director for Data Science, Philip
Bourne. Following a workshop in April 2014, the UK Open Research Data
Forum and US Committee on Coherence at Scale issued a
joint
statement
(PDF) of the principles of open research data.
The data that provide the evidence for the concepts in a published
paper or its equivalent, together with the relevant metadata and
computer code must be concurrently available for scrutiny and
consistent with the criteria of “intelligent openness”. The data
must be:
discoverable – readily found to exist by online search;
accessible – when discovered they can be interrogated;
intelligible – they can be understood;
assessable – e.g. the provenance and reliability of data;
reuseable – they can be reused and re-combined with other
data.
The data generated by publicly - or charitably - funded research
that is not used as evidence for a published scientific concept
should also be made intelligently open after a pre-specified period
in which originators have exclusive access.
Those who reuse data but were not their orginators must formally
acknowledge their originators.
The cost of creating intelligently open data from a research
project is an intrinsic part of the cost of research, and should not
be considered as an optional extra.
Although the default position for data generated by publicly - or
charitably -- funded research should be one of “intelligent
openness”, there are justifiable limits to openness. These are where
commercial exploitation is in the public interest and the sectoral
business model requires limitations on openness; in preserving the
privacy of individuals whose personal information is contained in
databases; where data release would endanger safety (unintended
accidents) or security (deliberate attack). However, these instances
do not provide justification for blanket exceptions to the default
position for those researchers or research institutions whose role
is to disseminate openly their finding, and should be argued on a
case- by case basis.
Existing processes, reward structures and norms of behavior that
inhibit or prevent data sharing or new forms of open collaboration
should, wherever possible, be reformed so that data sharing and
collaboration are encouraged, facilitated and rewarded.
At the time of writing, these are undergoing review and refinement
but the aims are clear. In the context of the Data on the Web Best
Practices Working Group, many requirements stem from this list.
Challenges
Each principle listed here represents one or more challenges, with
points 1, 3 and 5, being particularly relevant to Data on the Web Best
Practices. Matters of policy and culture within any domain, whilst
certainly challenging, are out of scope for the current work.
Elements:
Domains:
Research data.
Obligation/motivation:
Cultural/professional
obligation.
Usage:
Data that supports the scientific method.
Quality:
Variable - often empirical, often messy.
Some of the data may not be repeatable.
Size:
Highly variable but it's noteworthy that
research data can be very large (e.g. genomics).
Type/format:
variable including some specialist
formats, XML dialects etc. but often CSV.
Rate of change:
Usually the data is static.
Data lifespan:
Publication often associated with
a journal publication that marks the end of the cycle.
Potential audience:
Research peers.
Requires
R-AccessBulk
R-Citable
R-FormatMachineRead
R-FormatOpen
R-FormatStandardized
R-LicenseAvailable
R-MetadataAvailable
R-MetadataDocum
R-MetadataMachineRead
R-MetadataStandardized
R-PersistentIdentification
R-ProvAvailable
R-SensitivePrivacy
R-SensitiveSecurity
R-TrackDataUsage
and
R-UniqueIdentifier
2.24
Uruguay
Open Data Catalog
(Contributed by AGESIC )
URL:
Uruguay's open data portal
was
launched in December 2012 and at the time of writing holds 85 datasets
containing 114 resources. The open data initiative prioritizes the
“use of data” rather than “quantity of data”, that’s why the catalog
also promotes a number of applications using data resources in some
way (in common with many other data portals). It’s important for the
project to keep the ratio 1:3 between applications and datasets. Most
of the resources are CSV and ESRI Shapefiles making this a catalog of
2 and 3 star resources according to the
Stars of Linked Open Data
scheme. AGESIC does not have
sufficient resources at government agencies to implement an open data
liberation strategy and go to the next level. So when we are asked
about opening data, keep it simple is the answer, and CSV is by far
the easiest and smart way to start. Uruguay has an access to public
information law but doesn't have legislation about open data. The open
data initiative is lead by AGESIC with the support of an open data
working group drawn from multiple government agencies.
Elements:
Domains:
Infrastructure: Most of the datasets are shapefiles
Transportation: Shapefiles and CSV, containing information
about public transportation (stops and frequency), roads,
accidents, etc.
Tourism: data about regional events, cultural agenda, hotels,
camp sites, statistics
Economics: budget, consumer price declarations, etc.
Social development
Environment
Health
Education
Culture
Obligation/motivation:
There is no obligation for
the government agencies to publish open data. All initiatives were
carried on by agencies that want to support the initiative.
Usage:
Develop applications and new services for
citizens, agencies interoperability (exchange of information in open
data formats), transparency.
Quality:
Most of the data is realized properly, with
complete or near complete metadata.
Size:
Small; most of the datasets are less than 1Gb.
Type/format:
At the time of writing: ESRI Shapefile
(35), CSV (26), TXT (19), ZIP (12), HTML (7), XLS (6),PDF (4), XML
(3), RAR (2)
Rate of change:
Depends on the dataset.
Data lifespan:
Depends on the dataset, some change
in real time, other monthly, every 6 months, annual or static.
Potential audience:
Developers, journalists, civil
society, entrepreneurs.
Challenges:
Consolidation of tools to manage
datasets, improve visualizations and transform resources to higher
level (4 – 5 stars). Automated publication process using harvesting or
similar tools. Alerts or control panels to keep data updated.
Requires:
R-DataMissingIncomplete
R-AccessLevel
R-VocabReference
and
R-TrackDataUsage
2.25
Web
Observatory
Contributors:
Adriano C.
Machado Pereira, Adriano Veloso, Gisele Pappa, Wagner Meira Jr.
City/country:
Belo Horizonte, Brazil
URL:
Overview:
There are almost 65 million Brazilians
connected to the Internet - 3 6% of the Brazilian Population,
according to Comitê Gestor da Internet no Brasil. As a consequence,
events such as the Brazilian Election Running have become popular
topics in the Web, mainly in Online Social Networks. Our goal is to
understand this new reality and present new ways to watch facts,
events and entities on the fly using the Web and user-generated
content available in Online Social Networks and Blogs. The Web
Observatory is a research project part of the Instituto Nacional de
Ciência e Tecnologia para a Web (INWEB), sponsored by CNPq and
Fapemig. There are over 30 experts involved in the project, from four
differente Federal Universities: Universidade Federal de Minas Gerais
(UFMG), Centro Federal de Educação Tecnológica de Minas Gerais
(CEFET-MG), Universidade Federal do Amazonas (UFAM) e Universidade
Federal do Rio Grande do Sul (UFRGS). The INWEB researchers use a set
of new techniques related to information recovery, data mining and
data visualization to understand and summarize what the media and
users are talking about on the Web. This provides the fundamental
basis for an evaluation of the impact of the Olympic Campaigns and how
users react to news and discussions. One new feature in this project
is the possibility to see the propagation of the Tweets.
Elements:
Domains:
Different contexts or domains, related
to data from the Web. For example: Health (for example, diseases);
Tourism; Sports (for example, soccer championship and Olympic
games); Politics; Finance; Etc.
Obligation/motivation:
Data must be obtained from
different public data sources from the Web.
Usage:
DProvide different data analysis,
indicators or visualizations to allow a better understand of a
context.
Quality:
Variable, depend on the data source, can
be structured or not.
Size:
Variable, can be small data instances to a
huge amount of data, depending on the context under investigation.
In general, there are a huge amount of data.
Type/format:
Diverse, like CSV, HTML, JSON, XML,
etc.
Rate of change:
Different rates of change,
usually very dynamic.
Data lifespan:
n/a
Potential audience:
Diverse, different Web users.
Challenges
Data volume;
Data velocity;
Data variety;
Data value;
Complexity
Requires
R-DataEnrichment
R-GranularityLevels
R-MetadataDocum
R-MetadataMachineRead
R-MetadataStandardized
R-ProvAvailable
R-VocabDocum
R-VocabOpen
and
R-VocabReference
2.26
Wind
Characterization Scientific Study
(Contributed by Eric Stephan)
This use case describes a data management facility being constructed
to support scientific offshore wind energy research for the U.S.
Department of Energy’s Office of Energy Efficiency and Renewable
Energy (EERE) Wind and Water Power Program. The Reference Facility for
Renewable Energy (RFORE) project is responsible for collecting wind
characterization data from remote sensing and in-situ instruments
located on an offshore platform. This raw data is collected by the
Data Management Facility and processed into a standardized NetCDF
format. Both the raw measurements and processed data are archived in
the
PNNL
Institutional Computing (PIC) petascale computing facility. The DMF
will record all processing history, quality assurance work, problem
reporting, and maintenance activities for both instrumentation and
data. All datasets, instrumentation, and activities are cataloged
providing a seamless knowledge representation of the scientific study.
The DMF catalog relies on linked open vocabularies and domain
vocabularies to make the study data searchable. Scientists will be
able to use the catalog for faceted browsing, ad-hoc searches, query
by example. For accessing individual datasets a REST GET interface to
the archive will be provided.
Challenges:
For accessing numerous datasets scientists will be accessing the
archive directly using other protocols such as sftp, rsync, scp, and
access techniques such as
HPN-SSH
Requires:
R-AccessRealTime
R-FormatStandardized
R-VocabOpen
and
R-VocabReference
3.
General Challenges
The use cases presented in the previous section illustrate a number of
challenges faced by data publishers and data consumers. These challenges
show that some guidance is required on specific areas and therefore best
practices should be provided. According to the challenges, a set of
requirements were defined in such a way that a requirement motivates the
creation of one or more best practices. Challenges related to Data
Quality and Data Usage motivated the definition of specific requirements
for the Quality and Granularity Description Vocabulary and the Data
Usage Vocabulary.
3.1
A Word on Open and
Closed Data
The Open Knowledge Foundation
defines
open data
most succinctly as
data that can be freely used,
modified, and shared by anyone for any purpose
. Data on the Web
may be open but Web technologies are equally applicable to data that
is not open, or to scenarios where open and closed data are combined.
There are a number of areas where data may be on the Web but not open.
Behind The Firewall
Closed data may be generated in an organization that then blocks
general access using a firewall or other access control system.
Generated data may have links to other "open" data hosted
elsewhere and it may be represented using open Web standards but
this cannot be considered "open data."
Proprietary data by policy
Data can be closed through the policies of the data publisher and
data provider. Business-sensitive data that is not made accessible
to rest of the world is an example of closed data. Data controlled
by law or government policies are further examples of closed data
e.g. national security data, law enforcement, health care etc.
Lifecycle state of data
There is often a period between the generation of data and its
publication as open data and data in this state should be
considered as "closed." The data may remain in a closed state for
an indefinite period of time while it is validated and analyzed,
and insights and discoveries are published. It may also remain
closed because the data publisher prefers to maximize their
advantage gained by availability of data before they publish it
openly. This is current common practice in scientific research.
Non HTTP protocol based data
Historically data has been exposed using various non-HTTP IETF
protocol based end points including, but not limited to FTP, SFTP,
SCP, Rsync. While these protocols are considered "open," their
inter-operability with HTTP based Web protocol is currently a
limiting factor. From an open data perspective, data only
available using these these non-HTTP protocols should be
considered as closed data and, by definition, is not on the Web.
It follows that data accessible by private or
application-specific proprietary access protocol end points are
also deemed as both closed data and out of scope for data on the
Web.
In the following section we summarize the
requirements derived from all the use cases, grouped according to
theme. Closed data cuts across those themes (it's all data on the Web)
but it's worth highlighting
R-AccessLevel
R-DataMissingIncomplete
R-DataLifecyclePrivacy
and
R-SensitiveSecurity
as being
of particular relevance to closed data.
3.2
Requirements by
Challenge
The table below groups the requirements derived from the use cases
according to the challenges faced by producers and users of data on
the Web.
Challenge
Requirements
Data Access
Requirements for Data Access
R-AccessBulk
R-AccessLevel
R-AccessRealTime
R-AccessUpToDate
R-APIDocumented
Data Enrichment
Requirements for Data Enrichment
R-DataEnrichment
Data Formats
Requirements for Data Formats
R-FormatLocalize
R-FormatMachineRead
R-FormatMultiple
R-FormatStandardized
R-FormatOpen
Data Granularity
Requirements for Data
Granularity
R-GranularityLevels
Data Identification
Requirements for Data
Identification
R-UniqueIdentifier
Data Quality
Requirements for Data Quality
R-DataMissingIncomplete
R-QualityComparable
R-QualityCompleteness
R-QualityMetrics
R-QualityOpinions
Data Selection
Requirements for Data
Selection
R-DataIrreproducibility
R-DataLifecyclePrivacy
R-DataLifecycleStage
Data Usage
Requirements for Data Usage
R-TrackDataUsage
R-UsageFeedback
R-Citable
Data Vocabularies
Requirements for Data
Vocabularies
R-VocabDocum
R-VocabOpen
R-VocabReference
R-VocabVersion
Licenses
Requirements for Licenses
R-LicenseAvailable
R-LicenseLiability
Metadata
Requirements for Metadata
R-DataProductionContext
R-GeographicalContext
R-MetadataAvailable
R-MetadataDocum
R-MetadataMachineRead
R-MetadataStandardized
R-SLAAvailable
Preservation
Requirements for
Preservation
R-PersistentIdentification
Provenance
Requirements for Provenance
R-ProvAvailable
R-DataVersion
Sensitive Data
Requirements for Sensitive Data
R-SensitivePrivacy
R-SensitiveSecurity
4.
Requirements
4.1
Requirements for Data on the
Web Best Practices
4.1.1
Requirements
for Data Access
R-AccessBulk
Data should be available for bulk download
Motivation:
BuildingEye
LandPortal
LusTRE
OKFNTransport
and
UKOpenResearchForum
R-AccessLevel
The access level of the data should be
provided along with conditions of access, for example, Open,
restricted or closed.
Motivation:
Bio2RDF
DadosGovBr
DigitalArchiving
DutchBaseReg
OKFNTransport
RDESC
and
UruguayOpenData
R-AccessRealTime
Where data is produced in real-time,
it should be available on the Web in real-time
Motivation:
BuildingEye
LandPortal
MSI
OKFNTransport
OEFS
RDESC
SharePSI
Share-PSI Gijon
and
WindCharacterization
R-Access Up To Date
Data should be available in an
up-to-date manner and the update cycle made explicit
Motivation:
ASO
Bio2RDF
GS1Digital
ISO GEO Story
OKFNTransport
RetratoDaViolencia
SharePSI
Share-PSI Emergency Response
and
Tabulae
R-APIDocumented
If the data is available via an API, the API should be documented.
Motivation:
MSI
and
OKFNTransport
4.1.2
Requirements
for Data Enrichment
R-DataEnrichment
It should be possible to perform some data enrichment tasks in order to aggregate value to data, therefore providing more value for user applications and services.
Motivation:
ISO GEo Story
LandPortal
LusTRE
MSI
and
WebObservatory
4.1.3
Requirements
for Data Formats
R-FormatLocalize
Information about locale parameters
(date and number formats, language) should be made available
Motivation:
ISO GEO Story
LandPortal
OKFNTransport
and
Tabulae
R-FormatMachineRead
Data should be available in a
machine-readable format that is adequate for its intended or
potential use
Motivation:
ASO
BuildingEye
ISO GEO Story
LandPortal
LusTRE
MSI
OKFNTransport
OpenCityDataPipeline
OEFS
Tabulae
and
UKOpenResearchForum
R-FormatMultiple
Data should be available in multiple formats
Motivation:
BBC
Bio2RDF
DutchBaseReg
GS1Digital
LandPortal
LATimes
LusTRE
OEFS
and
Tabulae
R-FormatOpen
Data should be available in an open
format
Motivation:
BuildingEye
LATimes
LusTRE
MSI
OKFNTransport
OpenCityDataPipeline
and
UKOpenResearchForum
R-FormatStandardized
Data should be available in a
standardized format. Through standardization, interoperability
is also expected.
Motivation:
Bio2RDF
BuildingEye
DadosGovBr
GS1Digital
LandPortal
LusTRE
MSI
OpenCityDataPipeline
OEFS
Tabulae
UKOpenResearchForum
and
WindCharacterization
4.1.4
Requirements
for Data Identification
R-UniqueIdentifier
Each data resource should be
associated with a unique identifier
Motivation:
DigitalArchiving
DutchBaseReg
LandPortal
LATimes
LusTRE
RDESC
and
UKOpenResearchForum
4.1.5
Requirements
for Data Selection
R-DataIrreproducibility
Data should be designated if it is irreproducible.
Motivation:
ASO
and
OEFS
R-DataLifecyclePrivacy
Preliminary steps in the data
lifecycle should not infringe upon individual’s intellectual
property rights
Motivation:
Bio2RDF
BuildingEye
DadosGovBr
and
RDESC
R-DataLifecycleStage
Data should be identified by a designated lifecycle stage
Motivation:
DadosGovBr
and
OEFS
4.1.6
Requirements
for Data Vocabularies
R-VocabDocum
Vocabularies should be clearly documented
Motivation:
ASO
BuildingEye
LandPortal
LusTRE
OpenCityDataPipeline
RecifeOpenData
and
WebObservatory
R-VocabOpen
Vocabularies should be shared in an open way
Motivation:
LandPortal
LusTRE
OKFNTransport
OpenCityDataPipeline
OpenExperimenatlFieldStudies
RDESC
RecifeOpenData
WebObservatory
and
WindCharacterization
R-VocabReference
Existing reference vocabularies should be reused where possible
Motivation:
Bio2RDF
BuildingEye
DadosGovBr
DigitalArchiving
DutchBaseReg
ISO GEO Story
LandPortal
LusTRE
OpenCityDataPipeline
OpenExperimenatlFieldStudies
RDESC
RecifeOpenData
SharePSI
Tabulae
UruguayOpenData
WebObservatory
and
WindCharacterization
Issue 1
This requirement needs further discussion wrt.
Issue-48
as this seems to cover code lists/enumerated values but not explicitly.
R-VocabVersion
Vocabularies should include versioning information
Motivation:
BBC
DadosGovBr
LandPortal
LusTRE
and
Tabulae
4.1.7
Requirements
for Industry Reuse
Note:
SLA
s are a form of metadata and so inherit metadata requirements
R-SLAAvailable
Service Level Agreements (SLAs) for
industry reuse of the data should be available if requested
(via a defined contact point). An SLA is
a type of
metadata, so all metadata requirements also apply here.
Motivation:
OKFNTransport
and
RDESC
4.1.8
Requirements
for Licenses
Note: Licenses are a form of metadata and so inherit metadata requirements.
R-LicenseAvailable
Data should be associated with a license.
Motivation:
BuildingEye
DadosGovBr
ISO GEO Story
LATimes
LusTRE
OKFNTransport
OpenCityDataPipeline
and
UKOpenResearchForum
R-LicenseLiability
Liability terms associated with usage
of Data on the Web should be clearly outlined
Motivation:
ASO
GS1Digital
and
OKFNTransport
4.1.9
Requirements
for Metadata
R-DataProductionContext
Production context
information should be associated with data if relevant, e.g.
service/process descriptions. DataProductijonContext is a type
of metadata, so all metadata requirements also apply here.
Motivation:
BuildingEye
LATimes
OKFNTransport
OpenCityDataPipeline
OpenExperimenatlFieldStudies
R-GeographicalContext
GeographicalContext (countries,
regions, cities etc.) must be referred to consistently.
GeographicalContext is a type of metadata, so all metadata
requirements also apply here.
Motivation:
ASO
BuildingEye
LandPortal
LATimes
OKFNTransport
OpenCityDataPipeline
SharePSI
Share-PSI Location
Share-PSI Emergency Response
).
R-MetadataAvailable
Metadata should be available
Motivation:
ASO
BuildingEye
DadosGovBr
LandPortal
LATimes
LusTRE
MSI
OKFNTransport
OpenCityDataPipeline
RetratoDaViolencia
UKOpenResearchForum
R-MetadataDocum
Metadata vocabulary, or values if
vocabulary is not standardized, should be well-documented
Motivation:
BBC
BuildingEye
DadosGovBr
LusTRE
MSI
RecifeOpenData
SharePSI
Share-PSI Austria
),
UKOpenResearchForum
and
WebObservatory
R-MetadataMachineRead
Metadata should be machine-readable
Motivation:
BBC
ISO GEO Story
LandPortal
LATimes
LusTRE
MSI
RecifeOpenData
UKOpenResearchForum
and
WebObservatory
R-MetadataStandardized
Metadata should be
standardized. Through standardization, interoperability is
also expected.
Motivation:
BBC
ISO GEO Story
LandPortal
LATimes
LusTRE
OpenCityDataPipeline
RecifeOpenData
RetratoDaViolencia
SharePSI
Share-PSI Federation
),
UKOpenResearchForum
and
WebObservatory
4.1.10
Requirements
for Preservation
R-PersistentIdentification
An identifier for a particular resource
should be resolvable on the Web and associated for the
foreseeable future with a single resource or with information
about why the resource is no longer available.
Motivation:
Bio2RDF
DigitalArchiving
DutchBaseReg
GS1Digital
ISO GEO Story
LandPortal
LusTRE
RDESC
RetratoDaViolencia
and
UKOpenResearchForum
4.1.11
Requirements
for Provenance
Note: Provenance data is a form of metadata and so inherits
metadata requirements.
R-DataVersion
If different versions of data exist, data
versioning should be provided.
Motivation:
BBC
LandPortal
LusTRE
R-ProvAvailable
Data provenance information should be
available. Provenance data is a type of metadata, so all
metadata requirements also apply here.
Motivation:
ASO
DadosGovBr
GS1Digital
ISO GEO Story
LandPortal
LusTRE
RDESC
SharePSI
Share-PSI 270a
),
Tabulae
UKOpenResearchForum
and
WebObservatory
4.1.12
Requirements
for Sensitive Data
R-SensitivePrivacy
Data should not infringe a
person's right to privacy
Motivation:
BuildingEye
DutchBaseReg
SharePSI
Share-PSI T Lights
and
Share-PSI Snap
), and
UKOpenResearchForum.
R-SensitiveSecurity
Data should not infringe an
organization's security (local government, national
government, business)
Motivation:
BuildingEye
MSI
RDESC
RetratoDaViolencia
SharePSI
Share-PSI Emergency Response
),
Tabulae
and
UKOpenResearchForum
4.2
Requirements for Quality and
Granularity Description Vocabulary
4.2.1
4.2.1
Requirements for Data Quality
R-DataMissingIncomplete
Publishers should indicate if
data is partially missing or if the dataset is incomplete
Motivation:
ASO
BuildingEye
DadosGovBr
LATimes
OKFNTransport
OpenCityDataPipeline
RDESC
and
UruguayOpenData
R-QualityComparable
Data should be comparable with other datasets
Motivation:
BuildingEye
LusTRE
OKFNTransport
OpenCityDataPipeline
RecifeOpenData
SharePSI
Share-PSI Emergency Response
), and
Tabulae
R-QualityCompleteness
Data should be complete
Motivation:
ASO
BuildingEye
LandPortal
LusTRE
OKFNTransport
OpenCityDataPipeline
RecifeOpenData
RetratoDaViolencia
and
Tabulae
R-QualityMetrics
Data should be associated with a set
of documented, objective and, if available, standardized
quality metrics. This set of quality metrics may include
user-defined or domain-specific metrics.
Motivation:
ASO
LandPortal
LATimes
LusTRE
and
OKFNTransport
R-QualityOpinions
Subjective quality opinions on the data should be supported
Motivation:
DadosGovBr
LusTRE
SharePSI
Share-PSI Feedback
Share-PSI Feedback 2
Share-PSI France
).
4.2.2
Requirements
for Data Granularity
R-GranularityLevels
Data available at different
levels of granularity should be accessible and modelled in a
common way
Motivation:
ASO
ISO GEO Story
and
LandPortal
4.3
Requirements for Data Usage
Description Vocabulary
4.3.1
Requirements
for Data Usage
R-Citable
It should be possible to cite data on the Web
Motivation:
ASO
GS1Digital
LATimes
LusTRE
RDESC
SharePSI
Share-PSI Emergency Albania
), and
UKOpenResearchForum
R-TrackDataUsage
It should be possible to track the usage of data
Motivation:
ASO
LandPortal
LusTRE
OEFS
RDESC
and
UKOpenResearchForum
R-UsageFeedback
Data consumers should have a way of sharing feedback and rating data.
Motivation:
ASO
DadosGovBr
LusTRE
OKFNTransport
OEFS
SharePSI
Share-PSI Feedback
Share-PSI Feedback 2
Share-PSI France
).
5.
Reading Material
5.1
General
Resources
Government Linked Data (GLD) Glossary of Terms
Open Knowledge Foundation (OKFN) Linked Open Vocabulary Browser
Best Practices for Publishing Linked Data
10 Rules for Persistent URIs
Linked Data Platform Working Group
The
PRELIDA
and
Diachron
projects are concerned with preserving
LOD
5.2
Relevant
Vocabularies
The Organization Ontology (ORG)
Data Catalog Vocabulary(DCAT)
The
RDF
Data Cube Vocabulary (QB)
The Provenance (PROV) Ontology
Simple
Knowledge Organization System Reference (
SKOS
5.3
Communities
of Interest
W3C
Data Activity
W3C
Comma Separated
Values (CSV) On the Web Working Group
CSV
On the Web Use Cases
W3C
Government Linked
Data Working Group
(This WG is now closed but in some respects
is the forerunner of the
DWBP
W3C
Privacy on the
Web(PING) Working Group
A.
Acknowledgements
The editors wish to thank all those who have contributed use cases or
commented on those provided by others.
B.
Change history
Changes since the
previous version
include a re-ordering
of the use cases as well as the following:
R-IndustryReuse, R-PotentialRevenue, R-Archive, R-SynchronizedData
and R-CoreRegister removed as deemed out of scope for a technical
requirements document.
New use cases added
Removed use-case: Documented Support and Release of Data (too generic)
Removed use-case: Feedback Loop for Corrections (too generic)
Removed use-case: Datasets required for Natural Hazards Management (too generic)
Removed use-case: Tracking of Data Usage (too generic)
Removed use-case: Machine-readability of SLAs (too generic)
Removed use-case: of Data via APIs (too generic)
Renamed req R-DataUnavailabilityReference as R-AccessLevel
Added req R-DataVersion
Renamed req R-Designatedthingsserviceproviders as R-DataProductionContext
Renamed req R-Location with R-GeographicalContext
Removed req R-IncorporateFeedback, rephrased description of R-UsageFeedback to ‘It should be possible to provide feedback on and/or rate the data.’
Removed req R-MultipleRepresentations (duplicate of R-FormatMultiple)