RDF Spaces and Datasets

RDF Spaces and Datasets
This specification introduces the notion of RDF
space
s—places to store RDF triples—and defines a
set of mechanisms expressing and manipulating information about
them. Examples of RDF spaces include: an HTML page with embedded
RDFa or microdata, a file containing RDF/XML or Turtle data, and a
SQL database viewable as RDF using R2RML. RDF spaces are a
generalization of SPARQL's
named graph
s, providing a standard
model with formal semantics for systems which manage multiple
collections of RDF data.
Editor's Draft Status
Closing in on FPWD IMHO, but not there yet. The
"@@@" flags mark the places where I'm pretty sure something is
needed before FPWD.
This text might be re-factored into other the other RDF
documents. The Use Cases and Example would probably end up in a
WG Note.
Introduction
The
Resource
Description Framework (RDF)
provides a simple declarative way
to store and transmit information. It also provides a trivial but
effective way to combine information from multiple sources, with
graph merging. This allows information from different people,
different organizations, different units within an organization,
different servers, different algorithms, etc, to all be combined
and used together, without any special processing or understanding
of the relationships among the providers.
For some applications, the basic RDF merge operation is overly
simplistic, as extra processing and an understanding of the
relationships among the providers may be useful. This document
specifies a way to conveniently handle information coming from
multiple sources, by modeling each one as a separate
space
, and using RDF to express information about these
spaces. In addition to this important concept, we provide a pair
of languages—extensions to existing RDF syntaxes—
which can be used to store or transmit in one document the
contents of multiple spaces as well as information about them.
This approach allows for a variety of use cases (immediately
below) to be addressed in a straightforward manner, as shown in
Use Cases
Each of these use cases is initally described in terms of the
following scenario. Details of how each use case might be addressed
using the technologies specified in this document are in
The Example Foundation is a large organization with more than
ten thousand employees and volunteers, spread out over five
continents. It has branches in
25 different countries
, and
those divisions have considerable autonomy; they are only loosely
controlled by the parent organization (called "headquarters" or
"HQ") in Geneva.
HQ wants to help the divisions work together better. It
decides a first step is to provide a simple but complete directory
of all the Example personnel. Until now, each division has
maintained its own directory, using its own technology. HQ wants to gather them all together, building a
federated phonebook
. They want
to be able to find someone's phone number, mailing address,
and job title, knowing only their name or email addresses. Later,
they hope to extend the system to allow finding people based on
their areas of interest and expertise.
HQ understands that people will want access to the phonebook in
many different computing environments and with different
languages, social norms, and application styles. Users are going
to want at least one Web based user interface (UI), but they will
also want mobile UIs for different platforms, desktop UIs for
different platforms, and even to look up information via text
messaging. HQ does not have the resources to build all of these,
so they intend to provide
direct access to the data
so that the
divisions can do it themselves as needed.
Each of the sections below, after the first, contains a new
requirement, something additional that users in this scenario want
the system to do. Each of these will motivate the features of the
technologies specified in this rest of document.
Baseline Solution (Just Triples)
As a starting point, HQ needs to gather data from each
division and re-publish it, in one place, for use by the
different UIs.
This is a general use case for RDF, with no specific need for
using
space
s or
dataset
s. It simply involves
divisions pubishing RDF data on the web (with some common
vocabulary and with access control), then HQ merging it and
putting it on their website (with access control).
For an example of how this baseline could be implemented, see
Showing Provenance
A user says: I'm looking at an incorrect phonebook entry. It
has the name of the person I'm looking for, but it's missing
most of the record. I can't even tell which division the person
works for. I need to know who is responsible for this
information, so I can get it corrected.
While this might be address by including a "report-errors-to"
field in each phonebook entry, HQ is looking ahead to the day
when other information is in the phonebook — like which
projects the person has worked on — which might be come
from a variety of others sources, possibly other divisions.
For a discussion of how this use case could be addressed, see
Maintaining Derived Data
It turns out different divisions are using somewhat different
vocabularies for publishing their data. HQ writes a program to
translate, but they need the output of that program to be
correctly attributed, in case it turns out to be wrong.
This use case motivates sharing of blank nodes between named
graphs, as seen in the example.
For a discussion of how this use case could be addressed, see
Distributed Harvesting
It turns out some divisions do not have centralized
phonebooks. Division 3 has twelve different departments, each
with its own phonebook. Divsion 3 can do the harvesting from
its departments, but it does not want to be in the loop for
corrections; it wants those to go straight back to the relevant
department.
For a discussion of how this use case could be addressed, see
Loading Untrusted Datasets
A user reports: There's information here that says it's from
our department, but it's not. Somehow your provenance
information is wrong. We need to see the provenance of the
provenance!
For a discussion of how this use case could be addressed, see
Showing Revision History
Division 14's legal department says: "We're doing an
investigation and we need to be able to connect people's names
and phone numbers as they used to be. Can you include archival
data in the data feed, so we we can search the phonebook as it
was on each day of September, last year?"
For a discussion of how this use case could be addressed, see
Expressing Past or Future States
Division 5 says: "We're planning a major move in three
months, to a neighboring city. Everybody's office and phone
number will have to change. Can we start putting that
information in the phonebook now, but mark it as not effective
until 20 July? After the move, we'll also need to see the old
(no-longer-in-effect) data for a while, until we get everything
straightened out.
This use case, contrasted with the previous one, shows the
difference between
Transaction Time
and
Valid
Time
in bitemporal databases. After Division 5's move, the
"old" phone numbers are not just the old state of the database;
they reflect the old state of the world. It is possible that some
time after the move, an error in the pre-move data might need to
be corrected. This would require a new transaction time, even
though the valid-time has already ended.
Use case sightings:
Temporal Scope for RDF Triples
, Jeni Tennison's report of attempting to solve this problem in UK Government data.
Vocab
terms for owner, validFrom and validUntil
, Manu Sporny
reports PaySwarm wants to record ownership information for
particular time ranges.
For a discussion of how this use case could be addressed, see
Vendor-Neutral SPARQL Backup
@@@ we want to be able to dump the database and load it in a different system
@@@ This doesn't seem to belong here. Maybe we have Federated Phonebook use cases, and *other* ones, too?
Concepts
Space
The term "space" might change. The final
terminology has not yet been selected by the Working Group. Other
candidates include "g-box", "data space", "graph space", "(data)
surface", "(data) layer", "sheet", and "(data) page".
An RDF
space
is anything that can reasonably be said
to explicitly contain zero or more RDF triples and has an identity distinct
from the triples it contains. Examples include:
a human-readable Web page, such as an HTML page containing
RDFa markup, microdata markup, or embedded turtle.
a file, in a computer's filesystem, containing RDF data
expressed in RDF/XML, N-Triples, Turtle, etc.
a machine-readable Web page containing RDF data expressed in
RDF/XML, N-Triples, Turtle, etc.
a SQL database which provides an RDF view of its data,
perhaps using R2RML
the default graph or any of the named graphs available via a
SPARQL endpoint
Examples of things that are not spaces:
Natural language text. While it might be possible extract
some of the meaning of the text and express that meaning in RDF
triples, those triples are not explicit and in practice might
vary from one extractor to the next.
RDF Graphs. Since they are just mathematical sets of RDF
triples, they have no distinct identity apart from their
contents. For example, if two systems have in memory the RDF
graph { }, any metadata about the graph in
one system logically applies to the graph in the other system,
since technically it is the same graph. (If this seems
counter-intuitive, you may be among the many who have been using
the term "graph" to refer to what we now call a space. It may
help to think about a "graph space" (a place to put a graph) and
a "graph state" (the state of that space). That "graph state"
is what the existing specifications call an "RDF Graph").
Web pages containing embedded RDF but which do not contain a
well-defined set of triples at any given point in time. For
example: a Web Service which returns RDF data including the
client's IP address, or a site which customizes the data
presented based on client login cookies. Such resources might
be called "hyperspaces".
Quad and Quadset
We define an RDF
quad
as the 4-tuple
subject
predicate
object
space
).
Informally, a quad should be understood as a statement that the
RDF triple (
subject
predicate
object
) is in
the
space
space
We define an RDF
quadset
as a set containing (zero
or more) RDF Quads and (zero or more) RDF Triples. A quadset is
thus an extension to the concept of an RDF Graph (a set containing
zero or more RDF triples) to also potentially include statements
about triples being in particular spaces.
Dataset
dataset
is defined by
SPARQL
1.1
as a structure consisting of:
A distinguished RDF Graph called the
default graph
A set of (
name
graph
) pairs, where
name
is an IRI and the
graph
is an RDF Graph. No
two pairs in a dataset may have the same
name
This definition forms the basis of the SPARQL Query semantics;
each query is performed against the information in a specific
dataset.
Although the term is sometimes used more loosely, a dataset is
a pure mathematical structure, like an RDF Graph or a set of
integers, with no identity apart from its contents. Two datasets
with the same contents are in fact the same dataset, and one
dataset cannot change over time.
The word
"default"
in the term "default graph"
refers to the fact that in SPARQL, this is the graph a server uses
to perform a query when the client does not specify which graph to
use. The term is not related to the idea of a graph containing
default (overridable) information. The role and purpose of the
default graph in a dataset varies with application.
Named Graph
SPARQL formally defines a
named graph
following
[Carroll]
, to be any of the (name, graph) pairs in a
dataset
In practice, the term is often used to refer to the graph part
of those pairs. This is the usage we follow in this document,
saying that a graph is a
named graph
in some dataset if
and only if it appears as the graph part of a (name, graph) pair
in that dataset. Note that "named graph" is a relation, not a
class: we say that something is a named graph
of a
dataset
, not simply that it is a named graph.
The term is also sometimes used to refer to the slot part of
the (name, slot) pairs in a
graph store
. For example, the
text of
SPARQL
1.1 Update
says, "This example copies triples from one named
graph to another named graph". For clarity, we avoid calling
these "named graphs" and instead call them "named slots" of the
graph store.
Quadset/Dataset Relationship
quad-equivalent dataset
is a
dataset
with
no empty named graphs. A
non-quad-equivalent dataset
is a dataset in which one or more of its named graphs is empty.
Every non-quad-equivalent dataset has a corresponding
quad-equivalent dataset formed by removing the (name, graph) pairs
where the graph is empty.
Quadset
s and quad-equivalent datasets are isomorphic,
and given identical declarative semantics in
. The isomorphism is:
the triples in the quadset correspond to the triples in default
graph of the dataset;
each quad corresponds to a triple in named graph: the quad (S
P O Sp) corresponds to the triple (S P O) in the graph paired
with the name Sp.
The phrasing
quads in a dataset
is thus shorthand
for: quads in some quadset which is isomorphic to a given dataset.
If the dataset is a
non-quad-equivalent dataset
, then the
isomorphism is to the dataset produced by removing all its empty
named graphs.
In order to promote interoperability and flexibility in
implementation techniques — to allow datasets and quadsets
to be used interchangably — systems which handle datasets
SHOULD NOT give significance to empty named graphs.
Can we take a stronger stand against non-quad-equivalent
datasets? Maybe we can use the terms "proper" and "improper",
or something like that. Improper datasets might also include
ones which use the same name in more than one pair. Combining
these, like removing empty named graphs, is how you convert an
improper dataset to a proper one.
Graph Store
SPARQL 1.1 Update defines a mutable (time-dependent) structure
corresponding to a
dataset
, called
graph store
It is defined as:
A distinguished slot for an RDF Graph
A set of (
name
slot
) pairs, where the slot holds an RDF Graph
and the name is an IRI. No two pairs in a graph store may have the same
name
A "slot" in this definition is an RDF space.
A dataset can be thought of as the state of a
graph
store
, just like an RDF graph can be thought of as the state
of a
space
Merge and Union
RDF graphs are usually combined in one of two ways:
The
union
of two graphs is the set-union of the set of triples in each graph.
The
merge
of two graphs is the set-union of the set of triples in each graph, after any blank nodes that occur in both graphs are "renamed apart".
This difference is not noticable when graphs are being
expressed in an orginary RDF syntax, like RDF/XML, RDFa, or
Turtle, because they provide no mechanism for transmitted two
graphs which have a blank node in common. The difference can
appear, however, in systems and languages which handle datasets or
in APIs which allow blank nodes to be shared between graphs.
We define a
union dataset
to be a
dataset
in
which its
default graph
is the
union
of all its
named graph
s. Some systems provide special, simplified
handling of union datasets.
We define a
merge dataset
to be a
dataset
in
which its
default graph
is the
merge
of all its
named graph
s.
We define the union and merge of quadsets (and thus datasets)
as the set merge of their constituent triples and quads; in the
case of a merge, it is after any shared blank nodes have been
renamed apart.
Untrusting Merge
The act of
renaming the graphs
in a dataset is to
create another dataset which differs from the first only in that
all the IRIs used as graph names are replace by fresh "Skolem"
IRIs. This replacement occurs in the name slot of the
(name,graph) pairs, and in the triples in the default graph, but
not
in the triples in the named graphs.
Logically, this operation is equivalent to partially
un-labeling an RDF Graph (turning some IRIs into blank nodes),
then Skolemizing those blank nodes. As an operation, it discards
some of the information and adds more true information; it is a
sound but not complete reasoning step. It can be made complete by
recording
the relationship between the old graph names
and the new ones, using some vocabulary such as owl:sameAs.
For example, a recording graph_rename operation might take as input:
@prefix :
:g1 { :a :b :c }
:d :e :f
and produce:
@prefix :
:fe2b9765-ba1d-4644-a335-80a8c3786c8d { :a :b :c }
:d :e :f
:fe2b9765-ba1d-4644-a335-80a8c3786c8d owl:sameAs :g1
Given the semantics of datasets, informally described above and
formally stated in
and the semantics of OWL, where { ?a owl:sameAs ?b } means that
the terms ?a and ?b both denote the same thing, the second dataset
above entails the first and includes only additional information
that is known to be true. (Slight caveat: the new information is
only true if the assumptions of the name-generation function are
correct, that the name is previously unused and this naming agent
has the right to claim it.)
A relatated operation,
sequestering
the default
graph, is to create a new dataset which differs from the first
only in that the the triples in the default graph of the input
appear instead in a new, freshly-named,
named graph
of the
output. Sequestering returns both the new dataset and the name
generated for the new graph:
sequester(D1) -> (D2,
generatedIRI)
Used together, the operations of
renaming the graphs
sequestering
the default graphs, and then
merging
datasets
, constitutes an
untrusting merge
of
datasets. This operation provides the functionality required for
addressing the use case described in
and is illustrated in
. It uses quads
to address some—perhaps all—of the need for quints
or nested graphs.
More precisely:
function untrusted_merge(D1, ... Dn):
for i in 1..n:
RDi = rename_graphs(Di)
(SRDi, DGNi) = sequester(RDi)
return (merge(SRD1, ... SRDn), (DGN1, ... DGNn))
Here,
untrusted_merge
returns a single dataset and a list of
the names of the graphs (in that dataset) which contain the triples
that were in the default graphs, possibly augmented with
recording
triples. Whether recording is done or not is
hidden inside the rename_graphs function, and is
application-dependent.
Semantics
This section specifies a declarative semantics for
quad
s,
quadset
s, and
dataset
s, allowing them to be used to
express knowledge, especially knowledge about spaces. This makes
the languages defined in
suitable for conveying knowledge about
spaces and providing a foundation for addressing the challenges
described in
@@@ the section needs some revision by someone with a good ear
for formal semantics, and probably some references to the old and/or new versions of RDF Semantics.
The fundamental notion of RDF spaces is that they can contain
triples. This is formalized with the relation CT(S, T) which is
informally understood to hold true for any triple T and space S such
that S explicitely contains T.
The basic declarative meaning (that is, the truth condition) of
RDF quads is this:
The RDF
quad
(s, p, o, sp) is true in I if and only if CT(I(sp), triple(s, p, o)).
The declarative meaning of a quadset is to simply read the
quadset as a conjunction of its quads and its triples. Given
the structural mapping between quadsets and
datasets
, the truth condition for datasets follows:
The RDF
dataset
(DG, (N1,G1),... (Ni,Gi), ...(Nn,Gn)) is
true in I if and only if:
DG is true in I, and
For every (Ni,Gi) (1<=i<=n):
For every triple T in Gi:
CT(I(Ni),T)
Some implications of these truth conditions:
A dataset with no named graphs has the same declarative
meaning as its default graph. A quadset with no quads has the
same declarative meaning as the RDF graph consisting of the
triples in the quadset.
This fits the intuition that datasets and
quadsets are extensions of RDF Graphs and applies to the syntax as
well: a Trig document without any named graphs is syntactically
and semantically a Turtle document; an N-QUads document without
any quads is syntactically and semantically an N-Triples
document.
The empty named graphs in a
non-quad-equivalent dataset
have no effect on its meaning. Replacing such a dataset with its
equivalent without the empty named graphs does not change its
meaning.
We say nothing here about the fact that the truth value of a quad
is likely to change over time. Time is orthogonal to RDF
semantics, and quads present no fundamentally different issue
here. When the world changes state, the truth value of RDF
triples or quads might change. This occurs when a triple is put
in or taken out of a space, but it also occurs with "normal" RDF
when, for instance, someone changes their address and different
vcard triples about them become true. Some approaches to handling
change-over-time are discussed in
and
@@@ explain why we use partial-graph semantics, and how in most
applications its bad to drop information, but sometimes it's
necessary, and sometimes you only have incomplete information.
Dataset Languages
This section contains specifications of languages for serializing
quad-equivalent dataset
s. N-Quads documents and Trig
documents have identical semantics, since they each serialize the
same structure and follow
Dataset information may also be conveyed and manipulated using
SPARQL or using RDF triple-based tools and languages as per
N-Quads
The syntax of N-Quads is the same as the syntax of N-Triples,
except that a fourth term, identifying an RDF space, may
optionally be included on each line, after the "object" term.
Formally, the N-Quads grammar is
the N-Triples
Grammar
modified by removing productions [1] and [2], and
adding the following productions:
1q
nquadsDoc
::=
statement
? (
EOL
statement
)*
EOL
2q
statement
::=
subject
predicate
object
space
"."
3q
space
::=
IRIREF
The grammar symbols
EOL
subject
predicate
object
, and
IRIREF
are defined in the
the
N-Triples Grammar
The following example shows a
quadset
consisting of two
triples and two
quads
. The quads both use the same triple,
but express the fact that it is in two spaces, "space1" and
"space2".
.
.
.
.
Trig
The syntax of Trig is the same as the syntax of Turtle except
that (name, graph) pairs can be specified by giving an optional
GRAPH
keyword, a "name" term, and a nested Turtle graph expression
in curly braces.
Formally, the Trig grammar is
the
Turtle Grammar
modified by removing productions [1] and [2],
and adding the following productions:
1g
trigDoc
::=
statement
2g
statement
::=
directive
"."
triples
"."
naming
wrappedDefault
3g
naming
::=
"GRAPH"
spaceName
","
spaceName
)*
"{"
triples
"."
"}"
4g
spaceName
::=
iri
"DEFAULT"
5g
wrappedDefault
::=
"{"
triples
"."
"}"
The grammar symbols
directive
triples
, and
iri
are defined in
the
Turtle Grammar
Parsing a Trig document is like parsing a Turtle document
except:
The result is a
dataset
, not an RDF Graph
The triples generated during parsing of the
naming
production go into each
named graph
and/or the default graph as given in the
spaceName
productions.
The triples generated during other parsing go into the
default graph.
Note that the grammar forbids directives between curly braces
and empty curly-brace expressions. Also, note that blank node
processing is not affected by curly braces, so conceptually blank
node identifiers are scoped to the entire document.
There is no requirement that
named graph
names be unique
in a document, or that triples in the default graph be
continguous. For example, these two Trig document parse to exactly
the same Dataset:
# Trig Example 1
@prefix : .
:a :b 1.
:s1 { :a :b 10 }
:s2 { :a :b 20 }
:s1 { :a :b 11 }
:s2 { :a :b 21 }
:a :b 2.
# Trig Example 2
@prefix : .
:a :b 1,2.
:s1 { :a :b 10,11. }
:s2 { :a :b 20,21. }
The same dataset could be expressed in N-Quads as:
# N-Quads for TriG Example 1 and 2
"1"^^.
"2"^^.
"10"^^ .
"11"^^ .
"20"^^ .
"21"^^ .
There are several open issues concernting Trig syntax:
Should we call this something other than Trig, since it's a bit different? Qurtle? Mugr (multi-graph-rdf)? Turtle2?
Are braces around default-graph triples required,
optional, or disallowed? Assuming "optional" for now.
Is the name prefixed by a keyword? If so, is the
keyword "@graph" or "GRAPH"? Assuming optional "GRAPH" for now.
Are blank node labels scoped to the document, the
curly-brace expression, or the graph name? Assuming
document-scope for now. This is
Issue-21
Can blank node labels be used as space names?
Assuming not, for now.
Can we provide a way to say a graph is in multiple spaces without repeating it? Something like [GRAPH] g1, g2, DEFAULT { ... } (where default is a keyword stand-in for the default graph; assuming yes.
Can we allow allow people to re-use subject, like: g1 { ... }; :lastModified .... ? assuming no; it interacts a bit confusingly with repeated spaceName, and it's not clear what it means for spaceName DEFAULT.
Conformance
@@@ what to say here? What kind of think might conform or not
conform to this spec?
Detailed Example
This section presents a design for using
space
s in constructing a
federated information system. It is intended to help explain and
help motivate the designs specified in this document.
The example covers the same federated phonebook scenario used in
, with each specific use
case having an example here.
@@@ An obsolete but complete version was in the
May 10 Version
Showing Triples (v1)
@@@ Shows the baseline in
Showing Web Provenance (v2)
@@@ Shows how to address
Showing Process Provenance(v3)
@@@ Shows how to address
Showing Reported Provenance (v4)
@@@ Shows how to address
Showing Untrusted Quads(v5)
@@@ Show how to address
@@@ uses
renaming the graphs
Showing Change History (v6)
To keep versions, as required by
, we simply copy the old data into a new
named graph and record some metadata about it.
In this example, we handle this by defining the following vocabulary:
@@@ tdb can we define each property separately with any sense, or just the block, together?
If Marvin changes, rather absurdly, changes his email address
every day, to include the date, we might have a dataset like
this:
@prefix transt: .
@prefix hq: .
@prefix v: .
@prefix : <>.

:g32201 {
#... various data, then:
[] a v:VCard
v:fn "Marvin Mover" ;
v:email "marvin-0101@example.org".
#... more data from other people
[] a transt:Snapshot;
transt:source ;
transt:result :g32201;
transt:starts "2012-01-01T00:00:00"^^xs:dateTime;
transt:ends "2012-01-02T00:00:00"^^xs:dateTime.

:g32202 {
#... various data, then:
[] a v:VCard
v:fn "Marvin Mover" ;
v:email "marvin-0102@example.org".
#... more data from other people
[] a transt:Snapshot;
transt:source ;
transt:result :g32202;
transt:starts "2012-01-02T00:00:00"^^xs:dateTime;
transt:ends "2012-01-03T00:00:00"^^xs:dateTime.

# the current data
{
#... various data, then:
[] a v:VCard
v:fn "Marvin Mover" ;
v:email "marvin-0103@example.org".
#... more data from other people
@@@ or should we put the data directly into a genid graph, so that
metadata about it is less likely to change or be wrong...? On the other hand, there's ALSO some nice potential for metadata about the feed space.
Showing Past and Future States (v7)
The challenge expressed in
is to segregate some of the triples,
marking them as being in-effect only at certain times. The study
of how to do this is part of the field of temporal databases.
In this example, we handle this by defining the following vocabulary:
This "valid-time" vocabulary allows a data publisher to
express a time range during which the triples in some space are
considered valid. This acts like a time-dependent version of
owl:import, where the import is only made during the given time
range.
(rdf:space Sp)
vt:starts
(xs:dateTime T1)
Claims that all the triples in Sp are valid starting at T1, ending at some unspecified period of time.
(rdf:space Sp)
vt:end
(xs:dateTime T2)
Claims that all the triples in Sp are valid until just before T2, starting at some unspecified time.
In general, these two predicates need to be used together,
providing both vt:starts and vt:ends values for a space. In
this case, { ?sp vt:starts ?t1; vt:ends ?t2 } claims that all
the triples in ?sp are in effect for all points in time t such
that t1 <= t < t2. A consumer who only knows one of the
two times is unable to make use of data; there are no
default values.
These predicates say nothing about the validity (or "truth") of
the triples in Sp outside of the valid-time range. Each of the
triples might or might not hold outside of the range —
these vt triples simply make no claim about them.
Given this definition, it is almost trivial for Division 5 to share their "before" and "after phonebooks:
@prefix vt: .
@prefix hq: .
@prefix : <>.

:pre-move {
# all the pre-move data
...
:post-move {
# all the post-move data
...

:pre-move vt:starts "2010-01-01T00:00:00"^^xs:dateTime;
vt:ends "2012-07-12T00:00:00"^^xs:dateTime.
:post-move vt:starts "2012-07-12T00:00:00"^^xs:dateTime;
vt:ends "2020-01-01T00:00:00"^^xs:dateTime.
This design requires every client to be modified to understand
and use the valid-time vocabulary. There may be designs that do
not require this.
Folding
This section is experimental.
This section specifies a mechanism and an RDF vocabulary for
conveying
quad
s/
dataset
s using ordinary RDF Graphs
instead of special syntaxes and/or interfaces. The mechanism is
somewhat similar to reflection or reification. The idea is to
express each quad using a set of triples using a specialized
vocabulary.
Folding allows quads and thus datasets to be conveyed and
manipulated using normal triple-based RDF machinery, including
RDF/XML, Turtle, and RDFa, but at the cost of some complexity,
storage space, and performance. In general, in systems where
languages or APIs are available which directly support datasets,
folding is neither required nor useful.
As an example, the dataset
@prefix : .
:space { eg:subject eg:predicate eg:object }
would fold to these triples:
@prefix : .
:space rdf:containsTriple [
a rdf:Triple;
rdf:subjectIRI "http://example.org/subject";
rdf:predicateIRI "http://example.org/predicate";
rdf:objectIRI "http://example.org/object";
The terms in the triple are encoded (turned into literal
strings, in this example), to provide referential opacity. In the
semantics of quads, it does not follow from (a b c d) and a=aa that
(aa b c d). Without this encoding of terms as strings, that
conclusion would erroneously follow from the folded quad..
Terms in this vocabulary:
rdf:Triple
The class of RDF Triples, each of which is just a triple
(3-tuple) of a three RDF terms, called its "subject",
"predicate", and "object". Triples have no identity apart from
their three components.
rdf:subjectIRI
A predicate expressing the relationship to the triple's subject term,
when the subject term is an IRI. The value is the IRI (a string)
which is the subject-term part of the triple.
rdf:subjectNode
A predicate expressing the relationship to the triple's
subject term, when the subject term is a blank node. The value
is any RDF Resource; it simply serves as a placeholder,
representing the blank node which serves as the subject-term part
of the triple.
rdf:predicateIRI
A predicate expressing the relationship to the triple's
predicate term. The value is the IRI (a string) which serves as
the predicate-term part of the triples.
rdf:objectIRI
A predicate expressing the relationship to the triple's
object term, when the object term is an IRI. The value is the
IRI (a string) which is the object-term part of the triple.
rdf:objectNode
A predicate expressing the relationship to the triple's
object term, when the object term is a blank node. The value
is any RDF Resource; it simply serves as a placeholder,
representing the blank node which serves as the object-term part
of the triple.
rdf:objectValue
A predicate expressing the relationship to the triple's
object term, when the object term is literal. The value is the
value which serves as the object-term part of the triple.
rdf:containsTriple
A predicate expressing the relationship between an RDF
space
and a triple which it contains.
This vocabulary is used in a specific template form, always
matching this SPARQL graph pattern:
?sp rdf:containsTriple [
a rdf:Triple;
rdf:subjectIRI|rdf:subjectNode ?s;
rdf:predicateIRI ?p;
rdf:objectIRI|rdf:objectNode|rdf:objectValue ?o
This one template uses SPARQL 1.1 property paths, with
alternation using the "|" character. It could also be expressed as
six different SPARQL 1.0 (non-property-path) graph patterns.
The terms in this vocabulary only have fully-defined meaning
when they occur in the template pattern. When they do, the set of
triples matching the template has the same meaning as the
quad
[ ?s ?p ?o ?sp ].
Folding a dataset
is the act of completely
conveying the facts in a dataset in RDF triples, using this
vocabulary. The procedure is: (1) check for occurances
of the fold template in the default graph -- if they occur,
abort, since folding is not defined for this dataset; (2) copy
the triples in the default graph of the input to the output; (3)
for each quad in the input, generate a matching instance of the
fold template and put the resulting five triples in the
output.
Unfolding a dataset
is the act of turning an RDF
graph into a dataset, using this vocabulary. The
procedure is: (1) make a mutable copy of the input graph, (2) for
each match of the fold template, add the resulting quad to the
output dataset and delete the five triples which matched the
template, (3) copy the remaining triples to the output as the
default graph of the dataset.
The fold and unfold functions are inverses of each other.
That is, for all datasets D on which fold is defined, D =
unfold(fold(D)) and for all graphs G, G =
(fold(unfold(G)).
The functions cannot be composed with themselves (called
recursively), since for each of them the domain and range are
disjoint. If we were to implicitely convert graphs to datasets
(with the graph as the default graph), then fold(fold(D)) would
either be an error (if D had any named graphs) or be the same as
fold(D). If we were to define unfold2 as an unfold operating on
datasets using their default graphs, unfold2(D) = union(D,
unfold(default_graph(D)), then unfold2 would be idempotent:
unfold2(D) = unfold2(unfold2(D)).
@@@ tbd
Changes
2012-05-15: Added section on "Untrusting Merge".
2012-05-14: Fill in the use cases, removing some of the text that was there and which can go into the example. Redid the trig grammar, adding spaceName, changing formatting. Added valid-time example. Added some of transaction-time example.
2012-05-13: Fill in the example's skeleton, add a few issues/ideas on trig
2012-05-11: Rewriting and reorganizing Concepts; some more work on Usecases and Example; removed the Detailed Example since it needs to be so re-written; renamed 'reflection' to 'folding'; reworked the Semanics
2012-05-10: Wrote a short intro. Started writing the Use Cases section for real. Added grammar for N-Quads and Trig. Did a first draft of the semantics.
2012-05-09: Renamed "layers" as "spaces"; some word-smithing in Concepts and the Abstract; removed "Turtle in HTML" as a dataset syntax; added some text about trig and nquads; added a note about change-over-time; added an appendix with a reflection vocabulary
2012-05-02: Removed obsolete text from the introduction, removed the section on datasets borrowed from RDF Concepts, and added many entries to Concepts (and renamed it from Terminology).
2012-05-01: Starting with a little text from RDF Concepts, a few ideas, and the text from
Layers