www.tbray.org

4. Representations
A representation is data that represents the state of a resource.
It consists of:
Data about the resource, conveyed by formats (e.g., XHTML, CSS,
PNG, XLink, RDF/XML, and SMIL animation) used separately or in
combination.
Metadata about the representation, such as the Internet Media
Type (defined in RFC 2046 [
RFC2046
]). The Internet Media Type is the key
to the correct interpretation of a resource representation, and
governs the handling of
fragment
identifiers
. When transferred by a Web
protocol
, a representation often includes
metadata about both the representation and the message bearing the
representation (for example, some HTTP headers).
Web agents may use representations to modify as well as read resource
state.
4.1. Open-Endedness
The Web can be used to interchange resource representations in any format.
This is a good thing, since there is continuing progress in the development
of new data formats for new applications and the refinement of existing
ones.
Clearly, for a format to be usefully interoperable between two parties, they
must have a shared understanding of its syntax and semantics.
This is
not
to imply that a sender of data can count on constraining
its treatment by a receiver; simply that making good use of electronic data
usually requires knowledge of its designers' intentions.
For a format to be widely interoperable across the Web, the following must
obtain:
There SHOULD be a normative specification which is a stable,
widely-accessible Web resource.
The data format SHOULD have an officially registered Internet
media-type.
It should be noted that the invention of new data formats is expensive,
and the Web-wide deployment of software able to handle them
is immensely expensive.
Thus, before inventing a new data format, careful consideration should be
given to re-using one that is already available.
For example, if a format is required to contain human-readable text with
embedded hyperlinks, it is almost certainly better to use HTML for this
purpose than to invent a new format.
4.1.1 Desirable Characteristics of Format Specifications
As noted above, the utility of data formats deponds on an accessible
normative specification.
Some of the desirable characteristics of these specifications include:
Attention to Programmers' Needs
The usefulness of a data format depends on the availability of software
which is able to process it.
Such software is more likely to be written if the data format's specification
is aimed at the needs of programmers.
In particular, the specification SHOULD be in part formal and mathematical,
rather than relying exclusively on narrative.
Attention to Error-Handling
Given that representations are generated by humans, either directly or
intermediated by software, and then transmitted in a heterogeneous network,
it is inevitable that errors will occur.
Specifications of data formats SHOULD be clear about
behavior in the presence of errors. It is reasonable to specify that
errors should be worked around, or should result in the termination of a
transaction or session.
It is not acceptable for the behavior in the face of errors to be left
unspecified.
Use of Examples
One important lesson of the Web is that people learn rapidly and well by
example; this is the "View Source" effect.
The quality of data format specifications is improved by the inclusion of
working examples.
4.2 Taxonomic Categorization of Data Formats
This section discusses important characteristics of data formats which can
together be used to describe and understand them.
4.2.1 Binary vs. Textual
A textual data format is one in which the data is specified as a linear
sequence of characters.
HTML, Internet e-mail, and all XML-based languages are textual.
In modern textual data formats, the characters are usually taken from the
Unicode repertoire.
Binary data formats are those in which portions of the data are encoded
for direct use by computer processors, for
example thirty-two bit little-endian two's-complement and sixty-four bit
IEEE double-precision floating-point.
The portions of data so represented are include numeric values, pointers, and
compressed data of all sorts.
In principle, all data can be represented using textual formats.
The trade-offs between binary and textual data formats are complex and
application-dependent.
Binary formats can be substantially more compact, particularly for complex
pointer-rich data structures.
Also, they can be consumed more rapidly by software in those cases where they
can be loaded into memory and used with little or no conversion.
Textual formats are often more portable and interoperable, since there are
fewer choices for representation of the basic units (characters), and those
choices are well-understood and widely implemented.
Textual formats also have the considerable advantage that they can be
directly read and understood by human beings. This can simplify
the tasks of creating and mainting processing software, and allow the
direct intervention of humans in the processing chain without recourse to
tools any more complex than the ubiquitous text editor.
Finally, it simplifies the necessary human task of learning about new data
formats.
All things being equal (a rare state of affairs) textual formats are
generally preferable to binary ones in Web applications.
It is important to emphasize that intuition as to such matters as data
size and processing speed are not a reliable guide in data format design;
quantitative studies are essential to a correct understanding of the
trade-offs.
4.2.2 Final-form vs. Reusable
Final-form data formats are not designed to allow modification or uses
other than that intended by their designers.
An example would be PDF, which is designed to support the presentation of
page images on either screen or paper, and is not readily used in any other
way. XML Flow Objects share this characteristic.
XHTML, on the other hand, can be and is put to a variety of uses including
direct display (with highly flexible display semantics), processing by
network-sensitive Web spiders to support search and retrieval operations,
and reprocessing into a variety of derivative forms.
In general XML-based data formats are more re-usable and repurposable than
the alternatives, although the example of XML-FO shows that this is not an
absolute.
There are many cases where final-form is an application
requirement; representations which embody legally-binding transactions are
an obvious example.
In such cases, the use of digital signatures may be appropriate to achieve
immutability, whether the format is naturally final-form or some XML
vocabulary.
On the other hand, where such requirements are not in play,
representations that are
reusable and repurposable are in general higher in value,
particularly in the case where the information's utility may be
long-lived.
4.2.3 Composable vs. Standalone
Some data formats are explicitly designed to be used in combination with
others, while some are designed for standalone use.
An example of a standalone data format is PDF; it is typically neither
embedded in representations encoded in other formats nor is data in other
formats generally embeddable in it.
At the other extreme is SOAP, which is designed explicitly to contain a
"payload" in some non-SOAP vocabulary.
Another example is SVG, which is designed to be included in compound
documents, and which may in turn contain information encoded in other XML
vocabularies.
This characteristic is related to, but distinct from, the
final-form/reusable distinction discussed above.
For example, one can certainly imagine cases
where it is useful for a representation to include data in multiple
different formats, but be considered immutable and display-only.
4.3 Presentation, Content, and Interaction
In many cases, the information contained in a separation is logically
separable from the choice of ways in which it may be presented to a human,
and the modes of interaction it may support.
While such separation is, where possible, often advantageous, it is clearly
not always possible and in some cases not desirable either.
More incoming from C. Lilley
4.4 Embedding Hyperlinks in Representations
The Web's vast network of hyperlinks is one of its defining
characteristics, and resource representations are thus commonly required to
contain embedded links to other resources.
This section assumes that the other resources identified by hyperlinks are
represented by URI references, a basic requirement of Web Architecture.
There are, however, many syntactic options available for embedding such
URI-based hyperlinks in resource representations.
More incoming from N. Walsh
4.5 XML-Based Representations
Many resource representations are encoded in formats which are XML
vocabularies.
This section discusses issues that are specific to such data formats.
Anyone seeking guidance in this area is urged to consult the IETF Best
Common Practice guidelines for the use of XML in Internet Protocols.
This document contains a very thorough discussion of the considerations that
govern whether or not XML ought to be used, as well as specific guidelines
on how it ought to be used.
While it is directed at Internet applications with specific
reference to protocols, the discussion is generally applicable to Web
scenarios as well.
The discussion here should be seen as ancillary to the content of the IETF
BCP.
4.5.1 When to Use an XML-Based Format
XML defines textual data formats that are naturally suited to describing
data objects which are hierarchical and processed in an in-order sequence.
It is widely but not universally applicable for format
specifications. For example, an audio or video format is unlikely
to be well suited to representation in XML.
Design constraints that would suggest the use of XML include:
Explicit representation of a hierarchical structure.
The data's usefulness should outlive the tools currently used
to process it.
Ability to support internationalization in a self-describing way that
makes confusion over coding options unlikely.
Early detection of encoding errors with no requirement to "work around"
such errors.
A high proportion of human-readable textual content.
Potential composition of the data format with other XML-encoded
formats.
4.5.2 Namespace Documents
It is often desired to place the markup in an XML vocabulary in one or more
namespaces with names which by definition are URIs.
These namespace names SHOULD be usable for retrieval of human-readable
material aimed at meeting the needs of those who are going to be using the
markup vocabulary.
The simplest way to achieve this is for the namespace name to be an HTTP URI
which may be dereferenced to access this material.
The resource identified by such a URI is called a "namespace document".
Ideally, a namespace document ought to be usable in support of automatic
retrieval of other Web resources useful in support of processing markup
from this vocabulary.
Such resources could include stylesheets, schemas, and executable code.
RDDL
is a proposal under discussion in
the community for a variant of XHTML optimized for the construction of
namespace documents which meet the goals described in this section.
4.5.3 Fragment identifiers and ID semantics
Suppose that the URI
defines a
resource with representations encoded in XML. What, then, is the
interpretation of the
URI
RFC 2396bis makes it clear that the interpretation depends on the
context of the media-type of the representation.
It follows from this that designers of XML-based data formats SHOULD include
the semantics of fragment identifiers in their designs.
XPointer is a W3C Recommendation which provides a syntax designed for in such
fragment identifiers, and it SHOULD be used for this purpose.
When a representation is provided whose media-type
is
application/xml
, there are no semantics defined for
fragment identifiers, and thus they SHOULD NOT be provided for such
representations.
This is also the case if the representation is known to be XML because the
media type has a suffix of
+xml
as described in RFC3023, but
there is no normative specification of fragment semantics.
It is common practice to assume that when an element has an attribute that
is declared in a DTD to be of type ID, then the fragment
identifier
#abc
identifies the element which has an attribute
of that type whose value is
"abc"
However, there is no normative support for this assumption and it is
problematic in practice, since the only defined way to establish that an
attribute is of type ID is via a DTD, which may not exist or may not be
available.
4.6 Media-types For XML
RFC 3023 defines the media-types
application/xml
and
text/xml
, and describes a convention whereby XML-based
data formats use media-types with a
+xml
suffix, for
example
image/svg+xml
In general, media-types beginning with
text/
SHOULD NOT be
used for XML representations.
They create two problems: First, intermediate agents in the Web are allowed
to "transcode", i.e. convert one character encoding to another.
Since XML documents are designed to allow them to be self-describing, and
since this is a good and widely-followed practice, any such transcoding
will make the self-description false.
Secondly, representations whose media-types begin with
text/
are required, unless the
charset
parameter is specified, to be
considered to be encoded in US-ASCII.
In the case of XML, since it is self-describing, it is good practice to omit
the
charset
parameter, and since XML is very often not encoded
in US-ASCII, the use of "
text/
" media-types effectively
precludes this good practice.