Speech Synthesis Markup Language (SSML)

Speech Synthesis Markup Language (SSML) Version 1.1
Speech Synthesis Markup Language (SSML) Version 1.1
W3C Recommendation 7 September 2010
This version:
Latest version:
Previous version:
Editors:
Daniel C. Burnett, Voxeo (formerly of Vocalocity and Nuance)
双志伟 (Zhi Wei Shuang), IBM
Authors:
Paolo Baggia, Loquendo
Paul Bagshaw, France Telecom
Michael Bodell, Microsoft
黄德智 (De Zhi Huang), France Telecom
楼晓雁 (Lou Xiaoyan), Toshiba
Scott McGlashan, HP
陶建华 (Jianhua Tao), Chinese Academy of Sciences
严峻 (Yan Jun), iFLYTEK
胡方 (Hu Fang) (until 20 October 2009 while an Invited Expert)
康永国 (Yongguo Kang) (until 5 December 2007 while at Panasonic Corporation)
蒙美玲 (Helen Meng) (until 29 July 2009 while at Chinese University of Hong Kong)
王霞 (Wang Xia) (until 30 October 2006 while at Nokia)
夏海荣 (Xia Hairong) (until 2 August 2006 while at Panasonic Corporation)
吴志勇 (Zhiyong Wu) (until 29 July 2009 while at Chinese University of Hong Kong)
Please refer to the
errata
for this document, which may include some normative
corrections.
See also
translations
W3C
MIT
ERCIM
Keio
), All Rights Reserved. W3C
liability
trademark
and
document use
rules apply.
Abstract
The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.
Status of this Document
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the
W3C technical reports index
at http://www.w3.org/TR/.
This is the
Recommendation
of "Speech Synthesis Markup Language (SSML) Version 1.1".

It has been produced by the
Voice Browser Working Group
which is part of the
Voice Browser Activity
Comments are welcome on
www-voice@w3.org
archive
).
See
W3C mailing list and archive
usage guidelines
The design of SSML 1.1 has been widely reviewed (see the
disposition of comments
and satisfies the Working Group's technical requirements.

A list of implementations is included in the
SSML 1.1 Implementation Report
along with the associated test suite.

The Working Group made a few editorial changes to the
23 February 2010 Proposed Recommendation
in response to comments.

Changes from the Proposed Recommendation can be found in
Appendix G

Also changes from SSML 1.0 including a note on backwards compatibility
to SSML 1.0 can be found in
Appendix F
This document enhances SSML 1.0 [
SSML
] to provide better support for a broader set of natural (human) languages. To determine in what ways, if any, SSML is limited by its design with respect to supporting languages that are in large commercial or emerging markets for speech synthesis technologies but for which there was limited or no participation by either native speakers or experts during the development of SSML 1.0, the W3C held three workshops on the Internationalization of SSML. The first workshop [
WS
], in Beijing, PRC, in October 2005, focused primarily on Chinese, Korean, and Japanese languages, and the second [
WS2
], in Crete, Greece, in May 2006, focused primarily on Arabic, Indian, and Eastern European languages. The third workshop [
WS3
], in Hyderabad, India, in January 2007, focused heavily on Indian and Middle Eastern languages. Information collected during these workshops was used to develop a requirements document [
REQS11
]. Changes from SSML 1.0 are motivated by these requirements.
This document has been reviewed by W3C Members, by software
developers, and by other W3C groups and interested parties, and is
endorsed by the Director as a W3C Recommendation. It is a stable
document and may be used as reference material or cited from another
document. W3C's role in making the Recommendation is to draw
attention to the specification and to promote its widespread
deployment. This enhances the functionality and interoperability of
the Web.
This document was produced by a group operating under the
5 February 2004 W3C Patent Policy
. W3C maintains a
public list of any patent disclosures
made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains
Essential Claim(s)
must disclose the information in accordance with
section 6 of the W3C Patent Policy
The sections in the main body of this document are normative
unless otherwise specified. The appendices in this document are
informative unless otherwise indicated explicitly.
Table of Contents
1.
Introduction
1.1
Design Concepts
1.2
Speech Synthesis Process Steps
1.3
Document Generation, Applications and Contexts
1.4
Platform-Dependent Output Behavior of SSML Content
1.5
Terminology
2.
SSML Documents
2.1
Document Form
2.2
Conformance
2.2.1
Conforming Speech Synthesis Markup Language Fragments
2.2.2
Conforming Stand-Alone Speech Synthesis Markup Language Documents
2.2.3
Using SSML With Other Namespaces
2.2.4
Conforming Speech Synthesis Markup Language Processors
2.2.5
Profiles
2.2.6
Conforming User Agent
2.3
Integration With Other Markup Languages
2.3.1
SMIL
2.3.2
ACSS
2.3.3
VoiceXML
2.4
Fetching SSML Documents
3.
Elements and Attributes
3.1
Document Structure, Text Processing and Pronunciation
3.1.1
"speak" Root Element
3.1.1.1
Trimming Attributes
3.1.2
Language: "xml:lang" Attribute
3.1.3
Base URI: "xml:base" Attribute
3.1.3.1
Resolving Relative URIs
3.1.4
Identifier: "xml:id" Attribute
3.1.5
Lexicon Documents
3.1.5.1
"lexicon" Element
3.1.5.2
"lookup" Element
3.1.6
"meta" Element
3.1.7
"metadata" Element
3.1.8
Text Structure
3.1.8.1
"p" and "s" Elements
3.1.8.2
"token" and "w" Elements
3.1.9
"say-as" Element
3.1.10
"phoneme" Element
3.1.10.1
Pronunciation Alphabet Registry
3.1.11
"sub" Element
3.1.12
"lang" Element
3.1.13
Language Speaking Failure: "onlangfailure" Attribute
3.2
Prosody and Style
3.2.1
"voice" Element
3.2.2
"emphasis" Element
3.2.3
"break" Element
3.2.4
"prosody" Element
3.3
Other Elements
3.3.1
"audio" Element
3.3.1.1
Trimming Attributes
3.3.1.2
"soundLevel" Attribute
3.3.1.3
"speed" Attribute
3.3.2
"mark" Element
3.3.3
"desc" Element
4.
References
5.
Acknowledgments
Appendix A.
Audio File Formats
(normative)
Appendix B.
Internationalization
(normative)
Appendix C.
Media Types and File Suffix
(normative)
Appendix D.
Schema for the Speech Synthesis Markup Language
(normative)
Appendix E.
Example SSML
(informative)
Appendix F.
Changes since SSML 1.0
(informative)
Appendix G.
Changes since last draft
(informative)
1.
Introduction
This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [
JSML
].
SSML is part of a larger set of markup specifications for
voice browsers
developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to establish a standard system for marking up text input is SABLE [
SABLE
], which tried to integrate many different XML-based markups for
speech synthesis
into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages [
REQS
]. Since then, SABLE itself has not undergone any further development.
The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process (see
Section 1.2
). The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document (see
Section 2.2.2
) or as part of a fragment (see
Section 2.2.1
) embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like
phoneme
and
prosody
(e.g. for speech contour design) may require specialized knowledge.
1.1
Design Concepts
The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [
REQS
].
The following items were the key design criteria.
Consistency:
provide predictable control of voice output across platforms and across
speech synthesis
implementations.
Interoperability:
support use along with other W3C specifications including (but not limited to) VoiceXML, aural Cascading Style Sheets and SMIL.
Generality:
support speech output for a wide range of applications with varied speech content.
Internationalization:
Enable speech output in a large number of languages within or across documents.
Generation and Readability:
Support automatic generation and hand authoring of documents. The documents should be human-readable.
Implementable:
The specification should be implementable with existing, generally available technology, and the number of optional features should be minimal.
1.2
Speech Synthesis Process Steps
Text-To-Speech
system (a
synthesis processor
) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.
Document creation:
A text document provided as input to the
synthesis processor
may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.
Document processing:
The following are the six major processing steps undertaken by a
synthesis processor
to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output. Although each step below is divided into "markup support" and "non-markup behavior", actual behavior is usually a mix of the two and varies depending on the tag. The processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is then up to the processor to determine whether and in what way to use the information.
XML parse:
An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.
Structure analysis:
The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.
Markup support:
The
and
elements defined in SSML explicitly indicate document structures that affect the speech output.
Non-markup behavior:
In documents and parts of documents where these elements are not used, the
synthesis processor
is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.
Text normalization:
All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the
synthesis processor
that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on. By the end of this step the text to be spoken has been converted completely into tokens. The exact details of what constitutes a token are language-specific. In English, tokens are usually separated by white space and are typically words. For languages with different tokenization behavior, the term "word" in this specification is intended to mean an appropriately comparable unit. Tokens in SSML cannot span markup tags except within the
token
and
elements. A simple English example is "cupboard"; outside the
token
and
elements, the
synthesis processor
will treat this as the two tokens "cup" and "board" rather than as one token (word) with a pause in the middle. Breaking one token into multiple tokens this way will likely affect how the processor treats it.
Markup support:
The
say-as
element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked has not yet been defined but might include dates, times, numbers, acronyms, currency amounts and more. Note that many acronyms and abbreviations can be handled by the author via direct text replacement or by use of the
sub
element, e.g. "BBC" can be written as "B B C" and "AAA" can be written as "triple A". These replacement written forms will likely be pronounced as one would want the original acronyms to be pronounced. In the case of Japanese text, if you have a
synthesis processor
that supports both Kanji and kana, you may be able to use the
sub
element to identify whether 今日は should be spoken as きょうは ("kyou wa" = "today") or こんにちは ("konnichiwa" = "hello").
Non-markup behavior:
For text content that is not marked with the
say-as
element the
synthesis processor
is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different processors to render the same document differently.
Text-to-phoneme conversion:
Once the
synthesis processor
has determined the set of tokens to be spoken, it must derive pronunciations for each token. Pronunciations may be conveniently described as sequences of phonemes, which are units of sound in a language that serve to distinguish one word from another. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes, Hawai'ian has between 12 and 18 (depending on who you ask), and some languages have more than 100! This conversion is made complex by a number of issues. One issue is that there are differences between written and spoken forms of a language, and these differences can lead to indeterminacy or ambiguity in the pronunciation of written words. For example, compared with their spoken form, words in Hebrew and Arabic are usually written with no vowels, or only a few vowels specified. In many languages the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Both human speakers and synthesis processors can pronounce these words correctly in context but may have difficulty without context (see "Non-markup behavior" below). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English
synthesis processor
will often have trouble determining how to speak some non-English-origin names, e.g. "Caius College" (pronounced "keys college") and President Tito (pronounced "sutto"), the president of the Republic of Kiribati (pronounced "kiribass").
Markup support:
The
phoneme
element allows a phonemic sequence to be provided for any token or token sequence. This provides the content creator with explicit control over pronunciations. The
say-as
element might also be used to indicate that text is a proper name that may allow a
synthesis processor
to apply special rules to determine a pronunciation. The
lexicon
and
lookup
elements can be used to reference external definitions of pronunciations. These elements can be particularly useful for acronyms and abbreviations that the processor is unable to resolve via its own
text normalization
and that are not addressable via direct text substitution or the
sub
element (see paragraph 3, above).
Non-markup behavior:
In the absence of a
phoneme
element the
synthesis processor
MUST
apply automated capabilities to determine pronunciations. This is typically achieved by looking up tokens in a pronunciation dictionary (which may be language-dependent) and applying rules to determine other pronunciations.
Synthesis processors
are designed to perform text-to-phoneme conversions so most words of most documents can be handled automatically. As an alternative to relying upon the processor, authors may choose to perform some conversions themselves prior to encoding in SSML. Written words with indeterminate or ambiguous pronunciations could be replaced by words with an unambiguous pronunciation; for example, in the case of "read", "I will reed the book". Authors should be aware, however, that the resulting SSML document may not be optimal for visual display.
Prosody analysis:
Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
Markup support:
The
emphasis
element,
break
element and
prosody
element may all be used by document creators to guide the
synthesis processor
in generating appropriate prosodic features in the speech output.
Non-markup behavior:
In the absence of these elements,
synthesis processors
are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.
While most of the elements of SSML can be considered high-level in that they provide either content to be spoken or logical descriptions of style, the
break
and
prosody
elements mentioned above operate at a later point in the process and thus must coexist both with uses of the
emphasis
element and with the processor's own determinations of prosodic behavior. Unless specified in the appropriate sections, details of the interactions between the processor's own determinations and those provided by the author at this level are processor-specific. Authors are encouraged not to casually or arbitrarily mix these two levels of control.
Waveform production:
The phonemes and prosodic information are used by the
synthesis processor
in the production of the audio waveform. There are many approaches to this processing step so there may be considerable processor-specific variation.
Markup support:
The
voice
element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The
audio
element allows for insertion of recorded audio data into the output stream, with optional control over the duration, sound level and playback speed of the recording. Rendering can be restricted to a subset of the document by using the trimming attributes on the
speak
element.
Non-markup behavior:
The default volume/sound level, speed, and pitch/frequency of both voices and recorded audio in the document are that of the unmodified waveforms, whether they be voices or recordings.
1.3
Document Generation, Applications and Contexts
There are many classes of document creator that will produce marked-up documents to be spoken by a
synthesis processor
. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the
previous section
. The following are some of the common cases.
The document creator has no access to information to mark up the text. All processing steps in the
synthesis processor
must be performed fully automatically on
raw text
. The document requires only the containing
speak
element to indicate the content is to be spoken.
When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure,
text normalization
, prosody and possibly text-to-phoneme conversion.
Some document creators make considerable effort to mark as many details of the document as possible to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and
voice browser
applications may be fine-tuned to maximize the effectiveness of the overall system.
The most advanced document creators may skip the higher-level markup (structure,
text normalization
, text-to-phoneme conversion, and prosody analysis) and produce low-level
speech synthesis
markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.
The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.
Dialog language
: It is a requirement that it
SHOULD
be possible to include documents marked with SSML into the dialog description document to be produced by the Voice Browser Working Group.
Interoperability with aural CSS (ACSS)
: Any HTML processor that is aural CSS-enabled can produce SSML. ACSS is covered in
Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification
CSS2
§19]. This usage of
speech synthesis
facilitates improved accessibility to existing HTML and XHTML content.
Application-specific style sheet processing
: As mentioned above,
there are classes of applications that have knowledge of text content to be spoken, and that can be incorporated into the
speech synthesis
markup to enhance rendering of the document. In many cases, it is expected that the application will use style sheets to perform transformations of existing XML documents to SSML. This is equivalent to the use of ACSS with HTML and once again SSML is the resulting representation to be passed to the
synthesis processor
. In this context, SSML may be viewed as a superset of
ACSS
CSS2
§19] capabilities, excepting spatial audio.
1.4
Platform-Dependent Output Behavior of SSML Content
SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.
Unless otherwise specified, markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that the
synthesis processor
cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.
1.5
Terminology
Requirements terms
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [
RFC2119
]. However, for readability, these words do not appear in all uppercase letters in this specification.
At user option
A conforming
synthesis processor
MAY
or
MUST
(depending on the modal verb in the sentence) behave as described; if it does, it
MUST
provide users a means to enable or disable the behavior described.
Error
Results are undefined. A conforming
synthesis processor
MAY
detect and report an error and
MAY
recover from it.
Media Type
media type
(defined in [
RFC2045
] and [
RFC2046
]) specifies the nature of a linked resource. Media types are case insensitive. A list of registered media types is available for download [
TYPES
].
See
Appendix C
for information on media types for SSML.
Speech Synthesis
The process of automatic generation of speech output from data input which may include plain text, marked up text or binary objects.
Synthesis Processor
Text-To-Speech
system that accepts SSML documents as input and renders them as spoken output.
Text-To-Speech
The process of automatic generation of speech output from text or annotated text input.
URI: Uniform Resource Identifier
A global identifier in the context of the World Wide Web [
WEB-ARCH
]. A URI is defined as any legal
anyURI
primitive as defined in XML Schema Part 2: Datatypes [
SCHEMA2
§3.2.17]. For informational purposes only, [
RFC3986
] and [
RFC2732
] may be useful in understanding the structure, format, and use of URIs. Note that IRIs (see [
RFC3987
]) are permitted within the above definition of URI. Any relative URI reference
MUST
be resolved according to the rules given in
Section 3.1.3.1
. In this specification URIs are provided as attributes to elements, for example in the
audio
and
lexicon
elements.
Voice Browser
A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.
2.
SSML Documents
2.1
Document Form
A legal stand-alone Speech Synthesis Markup Language document
MUST
have a legal XML Prolog [
XML 1.0
or
XML 1.1
, as appropriate, §2.8].
The XML prolog is followed by the root
speak
element. See
Section 3.1.1
for details on this element.
The
speak
element
MUST
designate the SSML namespace. This can be achieved by declaring an
xmlns
attribute or an attribute with an "xmlns" prefix. See [
XMLNS 1.0
or
XMLNS 1.1
, as appropriate, §2] for details. Note that when the
xmlns
attribute is used alone, it sets the default namespace for the element on which it appears and for any child elements. The namespace for SSML is defined to be
It is
RECOMMENDED
that the
speak
element also indicate the location of the appropriate SSML schema (see
Appendix D
) via the
xsi:schemaLocation
attribute from [
SCHEMA1
§2.6.3]. Although such indication is not required, to encourage it this document provides such indication on all of the examples. When this attribute is not given, the Core profile [
Section 2.2.5
MUST
be assumed.
The following are two examples of legal SSML headers:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

xml:lang="en-US">
The
meta
metadata
and
lexicon
elements
MUST
occur before all other elements and text contained within the root
speak
element. There are no other ordering constraints on the elements in this specification.
2.2.
Conformance
2.2.1
Conforming Speech Synthesis Markup Language Fragments
2.2.1.1
Conforming Core Speech Synthesis Markup Language Fragments
A document fragment is a
Conforming Core Speech Synthesis Markup Language Fragment
if:
it conforms to the criteria for
Conforming Stand-Alone Core Speech Synthesis Markup Language Documents
after:
with the exception of
xml:lang
and
xml:base
, all non-synthesis namespace elements and attributes and all
xmlns
attributes which refer to non-synthesis namespace elements are removed from the document,
and, if the
speak
element does not already designate the synthesis namespace using the
xmlns
attribute, then
xmlns="http://www.w3.org/2001/10/synthesis"
is added to the element.
2.2.1.2
Conforming Extended Speech Synthesis Markup Language Fragments
A document fragment is a
Conforming Extended Speech Synthesis Markup Language Fragment
if:
it conforms to the criteria for
Conforming Stand-Alone Extended Speech Synthesis Markup Language Documents
after:
with the exception of
xml:lang
and
xml:base
, all non-synthesis namespace elements and attributes and all
xmlns
attributes which refer to non-synthesis namespace elements are removed from the document,
and, if the
speak
element does not already designate the synthesis namespace using the
xmlns
attribute, then
xmlns="http://www.w3.org/2001/10/synthesis"
is added to the element.
2.2.2
Conforming Stand-Alone Speech Synthesis Markup Language Documents
2.2.2.1
Conforming Stand-Alone Core Speech Synthesis Markup Language Documents
A document is a
Conforming Stand-Alone Core Speech Synthesis Markup Language Document
if it meets both the following conditions:
It is a well-formed XML document [
XML 1.0
or
XML 1.1
§2.1] conforming to Namespaces in XML (1.0 [
XMLNS 1.0
] or 1.1 [
XMLNS 1.1
], respectively).
It is a valid XML document [
XML 1.0
or
XML 1.1
§2.8] which adheres to the specification described in this document (
Speech Synthesis Markup Language Specification
) including the constraints expressed in the Core Schema (see
Appendix D
) and having an XML Prolog and
speak
root element as specified in
Section 2.1
2.2.2.2
Conforming Stand-Alone Extended Speech Synthesis Markup Language Documents
A document is a
Conforming Stand-Alone Extended Speech Synthesis Markup Language Document
if it meets both the following conditions:
It is a well-formed XML document [
XML 1.0
or
XML 1.1
§2.1] conforming to Namespaces in XML (1.0 [
XMLNS 1.0
] or 1.1 [
XMLNS 1.1
], respectively).
It is a valid XML document [
XML 1.0
or
XML 1.1
§2.8] which adheres to the specification described in this document (
Speech Synthesis Markup Language Specification
) including the constraints expressed in the Extended Schema (see
Appendix D
) and having an XML Prolog and
speak
root element as specified in
Section 2.1
The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.
2.2.3
Using SSML with other Namespaces
The synthesis namespace
MAY
be used with other XML namespaces as per the appropriate Namespaces in XML Recommendation (1.0 [
XMLNS 1.0
] or 1.1 [
XMLNS 1.1
], depending on the version of XML being used). Future work by W3C is expected to address ways to specify conformance for documents involving multiple namespaces. Language-specific (i.e. non-SSML) elements and attributes may be inserted into SSML using an appropriate namespace. However, such content would only be rendered by a
synthesis processor
that supported the custom markup. Here is an example of how one might insert Ruby [
RUBY
] elements into SSML:

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="ja">

今日は七月

二十日
ハツカ

です。

今日は七月

二十日
ニジューニチ

です。

2.2.4
Conforming Speech Synthesis Markup Language Processors
In a
Conforming Speech Synthesis Markup Language Processor
, the XML parser
MUST
be able to parse and process all XML constructs defined by XML 1.0 [
XML 1.0
] and XML 1.1 [
XML 1.1
] and the corresponding versions of Namespaces in XML (1.0 [
XMLNS 1.0
] and 1.1 [
XMLNS 1.1
]). This XML parser is not required to perform validation of an SSML document as per its schema or DTD; this implies that during processing of an SSML document it is
OPTIONAL
to apply or expand external entity references defined in an external DTD.
A Conforming Speech Synthesis Markup Language Processor
MUST
meet the following requirements for handling of natural (human) languages:
A Conforming Speech Synthesis Markup Language Processor is
REQUIRED
to parse all legal natural language declarations successfully.
A Conforming Speech Synthesis Markup Language Processor may be able to apply the semantics of markup languages which refer to more than one natural language. When a processor is able to support each natural language in the set but is unable to handle them concurrently it
SHOULD
inform the hosting environment. When the set includes one or more natural languages that are not supported by the processor it
SHOULD
inform the hosting environment.
A Conforming Speech Synthesis Markup Language Processor
MAY
implement natural languages by approximate substitutions according to a documented, processor-specific behavior. For example, a US English synthesis processor could process British English input.
There is no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.
2.2.4.1
Conforming Core Speech Synthesis Markup Language Processors
A Core Speech Synthesis Markup Language processor is a Conforming Speech Synthesis Markup Language Processor that can parse and process
Conforming Stand-Alone Core Speech Synthesis Markup Language documents
A Conforming Core Speech Synthesis Markup Language Processor
MUST
correctly understand and apply the semantics of the elements and attributes of the
Core profile
as described by this document.
When a Conforming Core Speech Synthesis Markup Language Processor encounters elements or attributes other than those included in the
Core profile
it
MAY
ignore the non-standard elements and/or attributes
or, process the non-standard elements and/or attributes
or, reject the document containing those elements and/or attributes
2.2.4.2
Conforming Extended Speech Synthesis Markup Language Processors
An Extended Speech Synthesis Markup Language processor is a Conforming Speech Synthesis Markup Language Processor that can parse and process
Conforming Stand-Alone Extended Speech Synthesis Markup Language documents
A Conforming Extended Speech Synthesis Markup Language Processor
MUST
correctly understand and apply the semantics of the elements and attributes of the
Extended profile
as described by this document.
When a Conforming Extended Speech Synthesis Markup Language Processor encounters elements or attributes other than those included in the
Extended profile
it
MAY
ignore the non-standard elements and/or attributes
or, process the non-standard elements and/or attributes
or, reject the document containing those elements and/or attributes
2.2.5
Profiles
An SSML Profile is a collection of SSML elements and attributes. There are only two profiles defined in this document:
Core profile
The Core profile consists of all elements and attributes defined in this specification except for the
clipBegin
clipEnd
repeatCount
repeatDur
soundLevel
, and
speed
attributes on the
audio
element.
Extended profile
The Extended profile consists of all elements and attributes defined in this specification.
2.2.6
Conforming User Agent
Conforming User Agent
is a
Conforming Speech Synthesis Markup Language Processor
that is capable of accepting an SSML document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author. A Conforming User Agent
MUST
support at least one natural language.
Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test
MAY
, however, require some examples of correct synthesis of a reference document to determine conformance.
2.3
Integration With Other Markup Languages
2.3.1
SMIL
The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [
SMIL3
] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text editor. See the SMIL/SSML integration examples in
Appendix E
2.3.2
ACSS
Aural Cascading Style Sheets [
CSS2
§19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.
2.3.3
VoiceXML
The Voice Extensible Markup Language [
VXML
] enables Web-based development and content-delivery for interactive voice response applications (see
voice browser
). VoiceXML supports
speech synthesis
, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see
Appendix F
2.4
Fetching SSML Documents
The fetching and caching behavior of SSML documents is defined by the environment in which the
synthesis processor
operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.
3.
Elements and Attributes
The following elements and attributes are defined in this specification.
3.1
Document Structure, Text Processing and Pronunciation
3.1.1
"speak" Root Element
3.1.1.1
Trimming Attributes
3.1.2
Language: "xml:lang" Attribute
3.1.3
Base URI: "xml:base" Attribute
3.1.3.1
Resolving Relative URIs
3.1.4
Identifier: "xml:id" Attribute
3.1.5
Lexicon Documents
3.1.5.1
"lexicon" Element
3.1.5.2
"lookup" Element
3.1.6
"meta" Element
3.1.7
"metadata" Element
3.1.8
Text Structure
3.1.8.1
"p" and "s" Elements
3.1.8.2
"token" and "w" Elements
3.1.9
"say-as" Element
3.1.10
"phoneme" Element
3.1.10.1
Pronunciation Alphabet Registry
3.1.11
"sub" Element
3.1.12
"lang" Element
3.1.13
Language Speaking Failure: "onlangfailure" Attribute
3.2
Prosody and Style
3.2.1
"voice" Element
3.2.2
"emphasis" Element
3.2.3
"break" Element
3.2.4
"prosody" Element
3.3
Other Elements
3.3.1
"audio" Element
3.3.1.1
Trimming Attributes
3.3.1.2
"soundLevel" Attribute
3.3.1.3
"speed" Attribute
3.3.2
"mark" Element
3.3.3
"desc" Element
3.1
Document Structure, Text Processing and Pronunciation
3.1.1
speak
Root Element
The Speech Synthesis Markup Language is an XML application. The root element is
speak
xml:lang
is a
REQUIRED
attribute specifying the language of the root document.
xml:base
is an
OPTIONAL
attribute specifying the Base
URI
of the root document.
onlangfailure
is an
OPTIONAL
attribute specifying the desired behavior upon language speaking failure.
The
version
attribute is a
REQUIRED
attribute that indicates the version of the specification to be used for the document and
MUST
have the value "1.1".
The trimming attributes are specified in a subsection, below.
Before the
speak
element is executed, the
synthesis processor
MUST
select a default voice. Note that a language speaking failure (see
Section 3.1.13
) will occur as soon as the first text is encountered if the language of the text is one that the default voice cannot speak. This assumes that the voice has not been changed before encountering the text, of course.

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
... the body ...

The
speak
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lexicon
lookup
mark
meta
metadata
phoneme
prosody
say-as
sub
token
voice
3.1.1.1
Trimming Attributes
Trimming attributes define the span of the document to be
rendered. Both the start and the end of the span within the
speak
content can be specified using marks.
The following trimming attributes are defined for
speak
Name
Required
Type
Default Value
Description
startmark
false
type
xsd:token
SCHEMA2
§3.3.2]
none
The mark used to determined when rendering starts.
endmark
false
type
xsd:token
SCHEMA2
§3.3.2]
none
The mark used to determine when rendering ends.
The
startmark
and
endmark
attributes specify a name that references a marker as assigned by the
name
attribute of the
mark
element. Only markers defined once in the document, i.e. that are unique, are permitted as the value of either
startmark
or
endmark
. The span of the document rendered is determined as follows:
If the
startmark
is specified, then rendering starts at the
startmark
. If
startmark
is not specified,
rendering begins at the beginning of the
document.
If the
endmark
is specified, then rendering ends at the
endmark
. If the
endmark
is not specified, rendering ends
at the document end.
If the
startmark
is after the
endmark
, then no audio is generated.
It is an
error
if the value given for either
startmark
or
endmark
is not a valid mark in the document.
Examples
If no trimming attributes are specified, then the complete document
is rendered:
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

here "first.wav", "middle.wav" and "last.wav" are rendered, where the
mark "mark2" is the last mark rendered.
The
startmark
can be used to specify that rendering begins from a
specific mark:
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

"middle.wav" and "last.wav" are rendered, but not "first.wav" since it
occurs before the
startmark
"mark1".
The end of rendering can be specified using the
endmark
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

where "first.wav" and "middle.wav" are completely rendered but none of "last.wav" is rendered.
Finally, these trimming attributes can be used to control both the
start and end of rendering:
version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

where only "middle.wav" is played.
3.1.2
Language:
xml:lang
Attribute
The
xml:lang
attribute, as defined by XML [
XML 1.0
or
XML 1.1
, as appropriate, §2.12],
MAY
be used in SSML to indicate the natural language of the written content of the element on which it occurs. BCP47 [
BCP47
] can help in understanding how to use this attribute.
Language information is inherited down the document hierarchy, i.e. it needs to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.
xml:lang
is a defined attribute for the
speak
lang
desc
token
, and
elements.
xml:lang
is permitted on
token
, and
only because it is common to change the language at those levels.
The
synthesis processor
SHOULD
use the value of the
xml:lang
attribute to assist it in determining the best way of rendering the content of the element on which it occurs. When the
synthesis processor
comes across text it does not know how to speak, it is the responsibility of the processor to decide what to do (see the
onlangfailure
attribute). One of the sources of information it can draw upon to make this decision is the value of the
xml:lang
attribute.
The
synthesis processor
may also use the value of the
xml:lang
attribute to help it to determine the language of the content, which may of course affect how the voice will speak the content. For example, "
The French word for cat is chat, not chat.
" If the document author requires a new voice that is better adapted to the new language, then the
synthesis processor
can be explicitly requested to select a new voice by using the
voice
element. Further information about voice selection appears in
Section 3.2.1
The
text normalization
processing step may be affected by the enclosing language. This is true for both markup support by the
say-as
element and non-markup behavior. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
~~Today, 2/1/2000.~~

~~Un mese fà, 2/1/2000.~~

3.1.3
Base URI:
xml:base
Attribute
Relative
URIs
are resolved according to a
base URI
, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See
Section 3.1.3.1
for details on the resolution of relative URIs.
The
base URI declaration
is permitted but
OPTIONAL
. The two elements affected by it are
audio
The
OPTIONAL
src
attribute can specify a relative URI.
lexicon
The
uri
attribute can specify a relative URI.
The
xml:base
attribute
The base
URI
declaration follows [
XML-BASE
] and is indicated by an
xml:base
attribute on the root
speak
element.

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:base="http://www.example.com/base-file-path">

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:base="http://www.example.com/another-base-file-path">
3.1.3.1
Resolving Relative URIs
User agents
MUST
calculate the base
URI
for resolving relative URIs according to [
RFC3986
]. The following describes how RFC3986 applies to synthesis documents.
User agents
MUST
calculate the base URI according to the following precedences (highest priority to lowest):
The base URI is set by the
xml:base
attribute on the
speak
element (see
Section 3.1.3
).
The base URI is given by metadata discovered during a protocol interaction, such as an HTTP header (see [
RFC2616
]).
By default, the base URI is that of the current document. Not all synthesis documents have a base URI (e.g., a valid synthesis document may appear in an email and may not be designated by a URI). It is an
error
if such documents contain relative URIs.
3.1.4
Identifier:
xml:id
Attribute
The
xml:id
attribute [
XML-ID
MAY
be used in SSML to give an element an identifier that is unique to the document, allowing the element to be referenced from other documents.
xml:id
is a defined attribute for the
lexicon
token
, and
elements.
3.1.5
Lexicon Documents:
lexicon
and
lookup
Elements
An SSML document
MAY
reference one or more lexicon documents. A lexicon document is located by a
URI
with an
OPTIONAL
media type
and is assigned a name that is unique in the SSML document.
3.1.5.1
lexicon
Element
Any number of
lexicon
elements
MAY
occur as immediate children of the
speak
element.
The
lexicon
element
MUST
have a
uri
attribute specifying a
URI
that identifies the location of the lexicon document.
The
lexicon
element
MUST
have an
xml:id
attribute that assigns a name to the lexicon document. The name
MUST
be unique to the current SSML document. The scope of this name is the current SSML document.
The
lexicon
element
MAY
have a
type
attribute that specifies the
media type
of the lexicon document. The default value of the
type
attribute is
application/pls+xml
, the media type associated with Pronunciation Lexicon Specification [
PLS
] documents as defined in [
RFC4267
].
The
lexicon
element
MAY
have a
fetchtimeout
attribute that specifies the timeout for fetches. The value is a
Time Designation
. The default value is processor-specific.
The
lexicon
element
MAY
have a
maxage
attribute that indicates that the document is willing to use content whose age is no greater than the specified time (cf. 'max-age' in HTTP 1.1
[RFC2616
]). The value is an
xsd:nonNegativeInteger
SCHEMA2
§3.3.20]. The document is not willing to use stale content, unless
maxstale
is also provided.
The
lexicon
element
MAY
have a
maxstale
attribute that indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [
RFC2616
]). The value is an
xsd:nonNegativeInteger
SCHEMA2
§3.3.20]. If
maxstale
is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified amount of time.
The
lexicon
element is an empty element.
If an error occurs in fetching or parsing a lexicon document, the
synthesis processor
MUST
notify the hosting environment that such an error has occurred. The processor
MAY
notify the hosting environment immediately with an asynchronous event, or the processor
MAY
make the error notification through its logging system. The processor
SHOULD
include information about the error where possible; for example, if the lexicon couldn't be fetched due to an http 404 error, that error code could be included with the notification. After notification, the processor
MUST
continue processing as if it had loaded an empty valid lexicon.
Details of the type attribute
Note: the description and table that follow use an imaginary vendor-specific lexicon type of
x-vnd.example.lexicon
. This is intended to represent whatever format is returned/available, as appropriate.
A lexicon resource indicated by a
URI
reference may be available in one or more
media types
. The SSML author can specify the preferred media type via the
type
attribute. When the content represented by a URI is available in many data formats, a
synthesis processor
MAY
use the preferred type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the type to order the preferences in the negotiation.
Upon delivery, the resource indicated by a URI reference may be considered in terms of two types. The
declared media type
is the alleged value for the resource and the actual media type is the true format of its content. The
actual type
should be the same as the declared type, but this is not always the case (e.g. a misconfigured HTTP server might return
text/plain
for a document following the vendor-specific
x-vnd.example.lexicon
format). A specific URI scheme may require that the resource owner always, sometimes, or never return a media type. Whenever a type is returned, it is treated as authoritative. The declared media type is determined by the value returned by the resource owner or, if none is returned, by the preferred media type given in the SSML document.
Three special cases may arise. The declared type may not be supported by the processor; this is an
error
. The declared type may be supported but the actual type may not match; this is also an
error
. Finally, no media type may be declared; the behavior depends on the specific URI scheme and the capabilities of the
synthesis processor
. For instance, HTTP 1.1 allows document introspection (see [
RFC2616
§7.2.1]), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:
Media type examples
HTTP 1.1 request
Local file access
Media type returned by the resource owner
text/plain
x-vnd.example.lexicon

Preferred media type from the SSML document
Not applicable; the returned type is authoritative.
x-vnd.example.lexicon
application/pls+xml
Declared media type
text/plain
x-vnd.example.lexicon
x-vnd.example.lexicon

Behavior for an actual media type of x-vnd.example.lexicon
This
MUST
be processed as text/plain. This will generate an
error
if text/plain is not supported or if the document does not follow the expected format.
The declared and actual types match; success if x-vnd.example.lexicon
is supported by the synthesis processor; otherwise an
error
Scheme specific; the synthesis processor might introspect the document to determine the type.
3.1.5.2
lookup
Element
The
lookup
element
MUST
have a
ref
attribute. The
ref
attribute specifies a name that references a lexicon document as assigned by the
xml:id
attribute of the
lexicon
element.
The referenced lexicon document may contain information (e.g., pronunciation) for tokens that can appear in a text to be rendered. For PLS lexicon documents, the information contained within the PLS document
MUST
be used by the
synthesis processor
when rendering tokens that appear within the context of a lookup element. For non-PLS lexicon documents, the information contained within the lexicon document
SHOULD
be used by the
synthesis processor
when rendering tokens that appear within the content of a lookup element, although the processor
MAY
choose not to use the information if it is deemed incompatible with the content of the SSML document. For example, a vendor-specific lexicon may be used only for particular values of the
interpret-as
attribute of the
say-as
element, or for a particular set of voices. Vendors
SHOULD
document the expected behavior of the
synthesis processor
when SSML content refers to a non-PLS lexicon.
lookup
element
MAY
contain other
lookup
elements. When a
lookup
element contains other
lookup
elements, the child
lookup
elements have higher precedence. Precedence means that a token is first looked up in the lexicon with highest precedence. Only if the token is not found in that lexicon is it then looked up in the lexicon with the next lower precedence, and so on until the token is successfully found or until all lexicons have been used for lookup. It is assumed that the
synthesis processor
already has one or more built-in system lexicons which will be treated as having a lower precedence than those specified using the
lexicon
and
lookup
elements. Note that if a token is not within the scope of at least one
lookup
element, then the token can only be looked up in the built-in system lexicons.
The
lookup
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

xml:id="pls"/>
xml:id="sw"
type="media-type"/>

tokens here are looked up in lexicon.pls

tokens here are looked up first in strange-words.file and then, if not found, in lexicon.pls

tokens here are looked up in lexicon.pls

tokens here are not looked up in lexicon documents
...

3.1.6
meta
Element
The
metadata
and
meta
elements are containers in which information about the document can be placed. The
metadata
element provides more general and powerful treatment of metadata information than
meta
by using a metadata schema.
meta
declaration associates a string to a declared meta property or declares "http-equiv" content. Either a
name
or
http-equiv
attribute is
REQUIRED
. It is an
error
to provide both
name
and
http-equiv
attributes. A
content
attribute is
REQUIRED
. The
seeAlso
property is the only defined
meta
property name. It is used to specify a resource that might provide additional metadata information about the content. This property is modeled on the
seeAlso
property of Resource Description Framework (RDF) Schema Specification 1.0 [
RDF-SCHEMA
§5.4.1]. The
http-equiv
attribute has a special significance when documents are retrieved via HTTP. Although the preferred method of providing HTTP header information is by using HTTP header fields, the "http-equiv" content
MAY
be used in situations where the SSML document author is unable to configure HTTP header fields associated with their document on the origin server, for example, cache control information. Note that HTTP servers and caches are not required to introspect the contents of
meta
in SSML documents and thereby override the header values they would send otherwise.
Informative: This is an example of how
meta
elements can be included in an SSML document to specify a resource that provides additional metadata information and also indicate that the document must not be cached.

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

The
meta
element is an empty element.
3.1.7
metadata
Element
The
metadata
element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with
metadata
, it is
RECOMMENDED
that the XML syntax of the Resource Description Framework (RDF) [
RDF-XMLSYNTAX
] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [
DC
].
The Resource Description Format [
RDF
] is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [
RDF-XMLSYNTAX
] and [
RDF-SCHEMA
] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [
DC
], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Rights, etc.).
Document properties declared with the
metadata
element can use any metadata schema.
Informative: This is an example of how
metadata
can be included in an SSML document using the Dublin Core version 1.0 RDF schema [
DC
] describing general document information such as title, description, date, and so on:

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
xmlns:dc = "http://purl.org/dc/elements/1.1/">

dc:title="Hamlet-like Soliloquy"
dc:description="Aldine's Soliloquy in the style of Hamlet"
dc:publisher="W3C"
dc:language="en-US"
dc:date="2002-11-29"
dc:rights="Copyright 2002 Aldine Turnbet"
dc:format="application/ssml+xml" >

William Shakespeare
Aldine Turnbet

The
metadata
element can have arbitrary content, although none of the content will be rendered by the
synthesis processor
3.1.8
Text Structure
3.1.8.1
and
Elements
element represents a paragraph. An
element represents a sentence.
xml:lang
xml:id
, and
onlangfailure
are defined attributes on the
and
elements.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

~~This is the first sentence of the paragraph.~~
~~Here's another sentence.~~

The use of
and
elements is
OPTIONAL
. Where text occurs without an enclosing
or
element the
synthesis processor
SHOULD
attempt to determine the structure using language-specific knowledge of the format of plain text.
The
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice
The
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice
3.1.8.2
token
and
Elements
The
token
element allows the author to indicate its content is a token and to eliminate token (word) segmentation ambiguities of the synthesis processor.
The
token
element is necessary in order to render languages
that do not use white space as a token boundary identifier, such as Chinese, Thai, and Japanese
that use white space for syllable segmentation, such as Vietnamese
that use white space for other purposes, such as Urdu
Use of this element can result in improved cues for prosodic control (e.g., pause) and may assist the synthesis processor in selection of the correct pronunciation for homographs. Other elements such as
break
mark
, and
prosody
are permitted within
token
to allow annotation at a sub-token level (e.g., syllable, mora, or whatever units are reasonable for the current language).
Synthesis processors
are
REQUIRED
to parse these annotations and
MAY
render them as they are able.
The text contents of the token element and its subelements are together considered to be one token for lexical lookup purposes as follows:
All markup within the token element is removed (leaving the contents of the markup).
All remaining text is concatenated together in the order in which it appears in the document.
Leading and trailing spaces are removed from this single block of text.
Multiple contiguous white space characters are converted into a single space.
The result is treated as a single token for lexical lookup purposes.
Thus, "happy" and " hap py" would refer to the tokens "happy" and "hap py", respectively. Note that this is different from how text and markup outside a
token
element are treated (see "Text normalization" in
Section 1.2
).
The use of
token
elements is
OPTIONAL
. Where text occurs without an enclosing
token
element the
synthesis processor
SHOULD
attempt to determine the token segmentation using language-specific knowledge of the format of plain text.
xml:lang
is a defined attribute on the
token
element to identify the written language of the content.
xml:id
is a defined attribute on the
token
element.
onlangfailure
is an
OPTIONAL
attribute specifying the desired behavior upon language speaking failure.
role
is an
OPTIONAL
defined attribute on the
token
element. The
role
attribute takes as its value one or more white space separated
QNames
(as defined in Section 4 of Namespaces in XML (1.0 [
XMLNS 1.0
] or 1.1 [
XMLNS 1.1
], depending on the version of XML being used)). A
QName
in the attribute content is expanded into an
expanded-name
using the namespace declarations in scope for the containing
token
element.
Thus, each QName provides a reference to a specific item in the
designated namespace. In the second example below, the QName within the
role
attribute expands to the "VV0" item in the
"http://www.example.com/claws7tags" namespace.

This mechanism allows for referencing defined taxonomies of word
classes, with the expectation that they are documented at the
specified namespace URI.
The
role
attribute is intended to be of use in synchronizing with other specifications, for example to describe additional information to help the selection of the most appropriate pronunciation for the contained text inside an external lexicon (see
lexicon documents
).
The
token
element can only contain text to be rendered and the following elements:
audio
break
emphasis
mark
phoneme
prosody
say-as
sub
The
token
element can only be contained in the following elements:
audio
emphasis
lang
lookup
prosody
speak
voice
The
element is an alias for the
token
element.
Here is an example showing the use of the
token
element.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="zh-CN">

南京市长江大桥

南京市长江大桥

上海是个大都会

上海人大都会那么说

The next example shows the use of the
role
attribute. The first document below is a sample lexicon (PLS) for the Chinese word "处". The second references this lexicon and shows how the role attribute may be used to select the appropriate pronunciation of the Chinese word "处" in the dialog.

xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:claws="http://www.example.com/claws7tags"
alphabet="x-myorganization-pinyin"
xml:lang="zh-CN">

处
chu3

处
chu4

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xmlns:claws="http://www.example.com/claws7tags"
xml:lang="zh-CN">
type="application/pls+xml"
xml:id="mylex"/>

他这个人很不好相处。
此处不准照相。

3.1.9
say-as
Element
The
say-as
element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
Defining a comprehensive set of text format types is difficult because of the variety of languages that have to be considered and because of the innate flexibility of written languages. SSML only specifies the
say-as
element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.
The
say-as
element has three attributes:
interpret-as
format
, and
detail
. The
interpret-as
attribute is always
REQUIRED
; the other two attributes are
OPTIONAL
. The legal values for the
format
attribute depend on the value of the
interpret-as
attribute.
The
say-as
element can only contain text to be rendered.
The
interpret-as
and
format
attributes
The
interpret-as
attribute indicates the content type of the contained text construct. Specifying the content type helps the
synthesis processor
to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the
OPTIONAL
format
attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.
When specified, the
interpret-as
and
format
values are to be interpreted by the
synthesis processor
as hints provided by the markup document author to aid
text normalization
and pronunciation.
In all cases, the text enclosed by any
say-as
element is intended to be a standard, orthographic form of the language currently in context. A
synthesis processor
SHOULD
be able to support the common, orthographic forms of the specified language for every content type that it supports.
When the value for the
interpret-as
attribute is unknown or unsupported by a processor, it
MUST
render the contained text as if no
interpret-as
value were specified.
When the value for the
format
attribute is unknown or unsupported by a processor, it
MUST
render the contained text as if no
format
value were specified, and
SHOULD
render it using the
interpret-as
value that is specified.
When the content of the
say-as
element contains additional text next to the content that is in the indicated
format
and
interpret-as
type, then this additional text
MUST
be rendered. The processor
MAY
make the rendering of the additional text dependent on the
interpret-as
type of the element in which it appears.
When the content of the
say-as
element contains no content in the indicated
interpret-as
type or
format
, the processor
MUST
render the content either as if the
format
attribute were not present, or as if the
interpret-as
attribute were not present, or as if neither the
format
nor
interpret-as
attributes were present. The processor
SHOULD
also notify the environment of the mismatch.
Indicating the content type or format does not necessarily affect the way the information is pronounced. A
synthesis processor
SHOULD
pronounce the contained text in a manner in which such content is normally produced for the language.
The
detail
attribute
The
detail
attribute is an
OPTIONAL
attribute that indicates the level of detail to be read aloud or rendered. Every value of the
detail
attribute
MUST
render all of the informational content in the contained text; however, specific values for the
detail
attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, a
synthesis processor
will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.
The
detail
attribute can be used for all
interpret-as
types.
If the
detail
attribute is not specified, the level of detail that is produced by the
synthesis processor
depends on the text content and the language.
When the value for the
detail
attribute is unknown or unsupported by a processor, it
MUST
render the contained text as if no value were specified for the
detail
attribute.
3.1.10
phoneme
Element
The
phoneme
element provides a phonemic/phonetic pronunciation for the contained text. The
phoneme
element
MAY
be empty. However, it is
RECOMMENDED
that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.
The
ph
attribute is a
REQUIRED
attribute that specifies the phoneme/phone string.
This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo
text normalization
and is not treated as a token for lookup in the lexicon (see
Section 3.1.5
), while values in
say-as
and
sub
may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.
The
alphabet
attribute is an
OPTIONAL
attribute that specifies the phonemic/phonetic pronunciation alphabet. A pronunciation alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "
ipa
" (see the next paragraph), values defined in the
Pronunciation Alphabet Registry
and vendor-defined strings of the form "
x-organization
" or "
x-organization-alphabet
". For example, the Japan Electronics and Information Technology Industries Association [
JEITA
] might wish to encourage the use of an alphabet such as "x-JEITA" or "x-JEITA-IT-4002" for their phoneme alphabet [
JEIDAALPHABET
].
Synthesis processors
SHOULD
support a value for
alphabet
of "
ipa
", corresponding to Unicode representations of the phonetic characters developed by the International Phonetic Association [
IPA
]. In addition to an exhaustive set of vowel and consonant symbols, this character set supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more. For this alphabet, legal
ph
values are strings of the values specified in Appendix 2 of [
IPAHNDBK
]; note that an IPA transcription may contain white space characters to assist readability, which have no implications for the pronunciation. Informative tables of the IPA-to-Unicode mappings can be found at [
IPAUNICODE1
] and [
IPAUNICODE2
]. Note that not all of the IPA characters are available in Unicode. For processors supporting this alphabet,
The processor
MUST
syntactically accept all legal
ph
values.
The processor
SHOULD
produce output when given Unicode IPA codes that can reasonably be considered to belong to the current language.
The production of output when given other codes is entirely at processor discretion.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
tomato

It is an
error
if a value for
alphabet
is specified that is not known or cannot be applied by a
synthesis processor
. The default behavior when the
alphabet
attribute is left unspecified is processor-specific.
The
type
attribute is an optional attribute that indicates additional information about how the pronunciation information is to be interpreted. The only allowed values for this attribute are "
default
", which has no implications, and "
ruby
", which indicates that the pronunciation information is from ruby text [
RUBY
]. The default value of this attribute is "
default
".
The
phoneme
element itself can only contain text (no elements).
3.1.10.1
Pronunciation Alphabet Registry
Links to the Pronunciation Alphabet Registry can be found on the SSML namespace page at
3.1.11
sub
Element
The
sub
element is employed to indicate that the text in the
alias
attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The
REQUIRED
alias
attribute specifies the string to be spoken instead of the enclosed string. The processor
SHOULD
apply
text normalization
to the
alias
value.
The
sub
element can only contain text (no elements).

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
_W3C

3.1.12
lang
Element
The
lang
element is used to specify the natural language of the content.
xml:lang
is a
REQUIRED
attribute specifying the language of the root document.
onlangfailure
is an
OPTIONAL
attribute specifying the desired behavior upon language speaking failure.
This element
MAY
be used when there is a change in the natural language. There is no text structure associated with the language change indicated by the
lang
element. It
MAY
be used to specify the language of the content at a level other than a paragraph, sentence or word level. When language change is to be associated with text structure, it is
RECOMMENDED
to use the
xml:lang
attribute on the respective
token
, or
element.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
The French word for cat is chat.
He prefers to eat pasta that is al dente.

The
lang
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice
3.1.13
Language Speaking Failure:
onlangfailure
Attribute
The
onlangfailure
attribute is an
OPTIONAL
attribute that contains one value from the following enumerated list describing the desired behavior of the
synthesis processor
upon language speaking failure. A conforming
synthesis processor
MUST
report a language speaking failure in addition to taking the action(s) below.
changevoice
- if a voice exists that can speak the language, the
synthesis processor
will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either ignoretext or ignorelang).
ignoretext
- the
synthesis processor
will not attempt to render the text that is in the failed language.
ignorelang
- the
synthesis processor
will ignore the change in language and speak as if the content were in the previous language.
processorchoice
- the
synthesis processor
chooses the behavior (either changevoice, ignoretext, or ignorelang).
A language speaking failure occurs whenever the
synthesis processor
decides that the currently-selected voice (see
Section 3.2.1
) cannot speak the declared language of the text. This can occur when the
synthesis processor
encounters a new
xml:lang
value or characters or character sequences that the voice does not know how to process.
The value of this attribute is inherited down the document hierarchy, i.e. it
needs
to be given only once if the desired behavior for the whole document is the same, and settings of this value nest, i.e. inner attributes overwrite outer attributes. The top-level default value for this attribute is "processorchoice". Other languages which embed fragments of SSML (without a
speak
element)
MUST
declare the top-level default value for this attribute.
onlangfailure
is permitted on all elements which can contain
xml:lang
, so it is a defined attribute for the
speak
lang
desc
token
, and
elements.
3.2
Prosody and Style
3.2.1
voice
Element
The
voice
element is a production element that requests a change in speaking voice. There are two kinds of attributes for the voice element: those that indicate desired features of a voice and those that control behavior. The voice feature attributes are:
gender
OPTIONAL
attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "
male
", "
female
", "
neutral
", or the empty string "".
age
OPTIONAL
attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. Acceptable values are of type
xsd:nonNegativeInteger
SCHEMA2
§3.3.20] or the empty string "".
variant
OPTIONAL
attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second male child voice). Valid values of
variant
are of type
xsd:positiveInteger
SCHEMA2
§3.3.25] or the empty string "".
name
OPTIONAL
attribute indicating a processor-specific voice name to speak the contained text. The value
MAY
be a space-separated list of names ordered from top preference down or the empty string "". As a result a name
MUST NOT
contain any white space.
languages
OPTIONAL
attribute indicating the list of languages the voice is desired to speak. The value
MUST
be either the empty string "" or a space-separated list of languages, with
OPTIONAL
accent indication per language. Each language/accent pair is of the form "
language
" or "
language
accent
", where both
language
and
accent
MUST
be an Extended Language Range [
BCP47, Matching of Language Tags
§2.2], except that the values "und" and "zxx" are disallowed. A voice satisfies the
languages
feature if, for each language/accent pair in the list,
the voice is documented (see
Voice descriptions
) as reading/speaking a language that matches the Extended Language Range given by
language
according to the Extended Filtering matching algorithm [
BCP47, Matching of Language Tags
§3.3.2], and
if an
accent
is given, the voice is documented (see
Voice descriptions
) as reading/speaking the language above with an accent that matches the Extended Language Range given by
accent
according to the Extended Filtering matching algorithm [
BCP47, Matching of Language Tags
§3.3.2], except that the script and extension subtags of the
accent
MUST
be ignored by the
synthesis processor
. It is recommended that authors and voice providers do not use the script or extension subtags for accents because they are not relevant for speaking.
For example, a
languages
value of "en:pt fr:ja" can legally be matched by any voice that can both read English (speaking it with a Portuguese accent) and read French (speaking it with a Japanese accent). Thus, a voice that only supports "en-US" with a "pt-BR" accent and "fr-CA" with a "ja" accent would match. As another example, if we have and there is no voice that supports French with a Portuguese accent, then a voice selection failure will occur. Note that if no accent indication is given for a language, then any voice that speaks the language is acceptable, regardless of accent. Also, note that author control over language support during voice selection is independent of any value of
xml:lang
in the text.
For the feature attributes above, an empty string value indicates that any voice will satisfy the feature. The top-level default value for all feature attributes is "", the empty string.
The behavior control attributes of
voice
are:
required
OPTIONAL
attribute that specifies a set of features by their respective attribute names. This set of features is used by the voice selection algorithm described below. Valid values of
required
are a space-separated list composed of values from the list of feature names: "
name
", "
languages
", "
gender
", "
age
", "
variant
" or the empty string "". The default value for this attribute is "languages".
ordering
OPTIONAL
attribute that specifies the priority ordering of features. Valid values of
ordering
are a space-separated list composed of values from the list of feature names: "
name
", "
languages
", "
gender
", "
age
", "
variant
" or the empty string "", where features named earlier in the list have higher priority . The default value for this attribute is "languages". Features not listed in the
ordering
list have equal priority to each other but lower than that of the last feature in the list. Note that if the
ordering
attribute is set to the empty string then all features have the same priority.
onvoicefailure
OPTIONAL
attribute containing one value from the following enumerated list describing the desired behavior of the
synthesis processor
upon voice selection failure. The default value for this attribute is "priorityselect".
priorityselect
- the
synthesis processor
uses the values of all voice feature attributes to select a voice by
feature priority
, where the starting
candidate set
is the set of all available voices.
keepexisting
- the voice does not change.
processorchoice
- the
synthesis processor
chooses the behavior (either priorityselect or keepexisting).
The following voice selection algorithm
MUST
be used:
All available voices are identified for which the values of all voice feature attributes listed in the
required
attribute value are matched. When the value of the
required
attribute is the empty string "", any and all voices are considered successful matches. If one or more voices are identified, the selection is considered successful; otherwise there is voice selection failure.
If a successful selection identifies only one voice, the
synthesis processor
MUST
use that voice.
If a successful selection identifies more than one voice, the remaining features (those not listed in the
required
attribute value) are used to choose a voice by
feature priority
, where the starting
candidate set
is the set of all voices identified.
If there is voice selection failure, a conforming
synthesis processor
MUST
report the voice selection failure in addition to taking the action(s) expressed by the value of the
onvoicefailure
attribute.
To choose a voice by
feature priority
, each feature is taken in turn starting with the highest priority feature, as controlled by the
ordering
attribute.
If at least one voice matches the value of the current voice feature attribute then all voices not matching that value are removed from the
candidate set
. If a single voice remains in the
candidate set
the synthesis processor must use it. If more than one voice remains in the
candidate set
then the next priority feature is examined for the
candidate set
If no voices match the value of the current voice feature attribute then the next priority feature is examined for the
candidate set
After examining all feature attributes on the
ordering
list, if multiple voices remain in the
candidate set
, the
synthesis processor
MUST
use any one of them.
Although each attribute individually is optional, it is an
error
if no attributes are specified when the
voice
element is used.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
Mary had a little lamb,

Its fleece was white as snow.

I want to be like Mike.

Voice descriptions
For every voice made available to a synthesis processor, the vendor of the voice must document the following:
a list of language tags [
BCP47, Tags for Identifying Languages
] representing the languages the voice can read.
for each language, a language tag [
BCP47, Tags for Identifying Languages
] representing the accent the voice uses when reading the language.
Although indication of language (using
xml:lang
) and selection of voice (using
voice
) are independent, there is no requirement that a synthesis processor support every possible combination of values of the two. However, a synthesis processor
MUST
document expected rendering behavior for every possible combination. See the
onlangfailure
attribute for information on what happens when the processor encounters text content that the voice cannot speak.
voice
attributes are inherited down the tree including to within elements that change the language. The defaults described for each attribute only apply at the top (document) level and are overridden by explicit author use of the
voice
element. In addition, changes in voice are scoped and apply only to the content of the element in which the change occurred. When processing reaches the end of a
voice
element content, i.e. the closing tag, the voice in effect before the beginning tag is restored.
Similarly, if a voice is changed by the processor as a result of a language speaking failure, the prior voice is restored when that voice is again able to speak the content. Note that there is always an active voice, since the
synthesis processor
is required to select a default voice before beginning execution of the document (see
section 3.1.1
).

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

Any female voice here.

A female child voice here.

Relative changes in prosodic parameters
SHOULD
be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.
The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.
The
voice
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice
3.2.2
emphasis
Element
The
emphasis
element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The
synthesis processor
determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:
level
: the
OPTIONAL
level
attribute indicates the strength of emphasis to be applied. Defined values are
"strong"
"moderate"
"none"
and
"reduced"
. The default
level
is
"moderate"
. The meaning of
"strong"
and
"moderate"
emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The
"reduced"
level
is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The
"none"
level
is used to prevent the
synthesis processor
from emphasizing words that it might typically emphasize. The values
"none"
"moderate"
, and
"strong"
are monotonically non-decreasing in strength.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
That is a big car!
That is a huge
bank account!

The
emphasis
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice
3.2.3
break
Element
The
break
element is an empty element that controls the pausing or other prosodic boundaries between tokens. The use of the
break
element between any pair of tokens is
OPTIONAL
. If the element is not present between tokens, the
synthesis processor
is expected to automatically determine a break based on the linguistic context. In practice, the
break
element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:
strength
: the
strength
attribute is an
OPTIONAL
attribute having one of the following values:
"none"
"x-weak"
"weak"
"medium"
(default value),
"strong"
, or
"x-strong"
. This attribute is used to indicate the strength of the prosodic break in the speech output. The value
"none"
indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break which the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses. "
x-weak
" and "
x-strong
" are mnemonics for "extra weak" and "extra strong", respectively.
time
: the
time
attribute is an
OPTIONAL
attribute indicating the duration of a pause to be inserted in the output in seconds or milliseconds. It follows the time value format from the Cascading Style Sheets Level 2 Recommendation [
CSS2
], e.g. "250ms", "3s".
The
strength
attribute is used to indicate the prosodic strength of the break. For example, the breaks between paragraphs are typically much stronger than the breaks between words within a sentence. The
synthesis processor
MAY
insert a pause as part of its implementation of the prosodic break. A pause of a specific length can also be inserted by using the
time
attribute.
If a
break
element is used with neither
strength
nor
time
attributes, a break will be produced by the processor with a prosodic strength greater than that which the processor would otherwise have used if no
break
element was supplied.
If both
strength
and
time
attributes are supplied, the processor will insert a break with a duration as specified by the
time
attribute, with other prosodic changes in the output based on the value of the
strength
attribute.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
Take a deep breath
then continue.
Press 1 or wait for the tone.
I didn't hear you! Please repeat.

3.2.4
prosody
Element
The
prosody
element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all
OPTIONAL
, are:
pitch
: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a
number
followed by "Hz", a
relative change
or
"x-low"
"low"
"medium"
"high"
"x-high"
, or
"default"
. Labels
"x-low"
through
"x-high"
represent a sequence of monotonically non-decreasing pitch levels.
contour
: sets the actual pitch contour for the contained text. The format is specified in
Pitch contour
below.
range
: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a
number
followed by "Hz", a
relative change
or
"x-low"
"low"
"medium"
"high"
"x-high"
, or
"default"
. Labels
"x-low"
through
"x-high"
represent a sequence of monotonically non-decreasing pitch ranges.
rate
: a change in the speaking rate for the contained text. Legal values are: a
non-negative percentage
or
"x-slow"
"slow"
"medium"
"fast"
"x-fast"
, or
"default"
. Labels
"x-slow"
through
"x-fast"
represent a sequence of monotonically non-decreasing speaking rates. When the value is a
non-negative percentage
it acts as a multiplier of the default rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice
SHOULD
be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.
duration
: a value in seconds or milliseconds for the desired time to take to read the contained text. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [
CSS2
], e.g. "250ms", "3s".
volume
: the volume for the contained text. Legal values are: a
number
preceded by "+" or "-" and immediately followed by "dB"; or
"silent"
"x-soft"
"soft"
"medium"
"loud"
"x-loud"
, or
"default"
. The default is +0.0dB. Specifying a value of
"silent"
amounts to specifying minus infinity decibels (dB). Labels
"silent"
through
"x-loud"
represent a sequence of monotonically non-decreasing volume levels. When the value is a signed
number
(dB), it specifies the ratio of the squares of the new signal amplitude (a
) and the current amplitude (a
), and is defined in terms of dB:
volume
(dB)
= 20 log
10
(a
/ a
Note that all numerical volume levels (in dB) are relative to the current level and that they are always signed (including zero). Also note that once the current volume level is set to
"silent"
all child relative changes also result in silence. A child
prosody
element
MAY
use the label
"default"
to reset the current volume level.
So that for a value of:
"silent"
, the contained text is read silently;
'-6.0dB', the contained text is read at approximately half the amplitude of the current signal amplitude;
'-0dB', the contained text is read with no relative change in volume;
'+6.0dB', the contained text is read at approximately twice the amplitude of the current signal amplitude.
Note that the behavior of this attribute for label values may differ from that of numerical values. Use of a numerical value causes direct modification of the waveform, while use of a label value may result in prosodic modifications that more accurately reflect how a human being would increase or decrease the perceived loudness of his speech, e.g., adjusting frequency and power differently for different sound units.
Although each attribute individually is optional, it is an
error
if no attributes are specified when the
prosody
element is used. The "
x-
foo
" attribute value names are intended to be mnemonics for "extra
foo
". All units ("Hz", "st") are case-sensitive. Note also that customary pitch levels and standard pitch ranges may vary significantly by language, as may the meanings of the labelled values for pitch targets and ranges.
Here is an example of how to use the
volume
attribute:
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

~~I am speaking this at the default volume for this voice.~~

~~I am speaking this at approximately twice the original signal amplitude.~~

~~I am speaking this at approximately half the original signal amplitude.~~

Number
A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.
Non-negative percentage
A non-negative percentage is an unsigned
number
immediately followed by "%".
Relative values
Relative changes for the attributes above can be specified
as a percentage (a
number
preceded by "+" or "-" and followed by "%"), e.g. "+15.2%", "-8.0%", or
as a relative number:
For the
pitch
and
range
attributes, relative changes can be given in semitones (a
number
preceded by "+" or "-" and followed by "st") or in Hertz (a
number
preceded by "+" or "-" and followed by "Hz"): "+0.5st", "+5st", "-2st", "+10Hz", "-5.5Hz". A semitone is half of a tone (a half step) on the standard diatonic scale.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">
The price of XYZ is $45

Pitch contour
The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form
(time position,target)
, the first value is a percentage of the period of the contained text (a
number
followed by "%") and the second value is the value of the
pitch
attribute (a
number
followed by "Hz", a
relative change
, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

good morning

The
duration
attribute takes precedence over the
rate
attribute. The
contour
attribute takes precedence over the
pitch
and
range
attributes.
The default value of all prosodic attributes is no change. For example, omitting the
rate
attribute means that the rate is the same within the element as outside.
The
prosody
element can only contain text to be rendered and the following elements:
audio
break
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice
Limitations
All prosodic attribute values are indicative. If a
synthesis processor
is unable to accurately render a document as specified (e.g., trying to set the pitch to 1 MHz or the speaking rate to 1,000,000 words per minute), it
MUST
make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and
MAY
inform the host environment when such limits are exceeded.
In some cases,
synthesis processors
MAY
elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units
MAY
reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.
3.3
Other Elements
3.3.1
audio
Element
The
audio
element supports the insertion of recorded audio files (see
Appendix A
for
REQUIRED
formats) and the insertion of other audio formats in conjunction with synthesized speech output. The
audio
element
MAY
be empty. If the
audio
element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content
MAY
include text, speech markup,
desc
elements, or other
audio
elements. The alternate content
MAY
also be used when rendering the document to non-audible output and for accessibility (see the
desc
element). In addition to the
OPTIONAL
attributes described in subsections below,
audio
has the following attributes:
Name
Required
Type
Default Value
Description
src
false
URI
None
The URI of a document with an appropriate media type. If absent, the
audio
element behaves as if src were present with a legal URI but the document could not be fetched.
fetchtimeout
false
Time Designation
Processor-specific
The timeout for fetches.
fetchhint
false
The value "prefetch" or the value "safe"
prefetch
This tells the
synthesis processor
whether or not it can attempt to optimize rendering by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the processor to pre-fetch the audio.
maxage
false
xsd:nonNegativeInteger
None
Indicates that the document is willing to use content whose age is no greater than the specified time (cf. 'max-age' in HTTP 1.1 [
RFC2616
]). The document is not willing to use stale content, unless
maxstale
is also provided.
maxstale
false
xsd:nonNegativeInteger
None
Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [
RFC2616
]). If
maxstale
is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified amount of time.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

Please say your name after the tone.

An
audio
element is successfully rendered by:
Playing the referenced audio source successfully
If the referenced audio source fails to play, rendering the alternative content
Additionally if the processor can detect that text-only output is required then it
MAY
render the alternative content
When attempting to play the audio source a number of different issues
may arise such as mismatched media types or bad header information
about the media. In general the
synthesis processor
makes a best effort to play the referenced media and, when unsuccessful, the processor
MUST
play
the alternative content. Note the processor
MUST NOT
render both all
or part of the referenced media and all or part of the referenced
alternative content. If any of the referenced media is processed and
rendered then the playback is considered a successful playback within
the context of this section. If an error occurs that causes the
alternative content to be rendered instead of the referenced media the
processor
MUST
notify the hosting environment that such an error has
occurred. The processor
MAY
notify the hosting environment
immediately with an asynchronous event, or the processor
MAY
notify
the hosting environment only at the end of playback when it signals to
the hosting environment that it has completed rendering the request, or the processor
MAY
make the error notification through its logging
system. The processor
SHOULD
include information about the error
where possible; for example, if the media resource couldn't be fetched
due to an http 404 error, that error code could be included with the
notification.
The
audio
element can only contain text to be rendered and the following elements:
audio
break
desc
emphasis
lang
lookup
mark
phoneme
prosody
say-as
sub
token
voice
3.3.1.1
Trimming attributes
Trimming attributes define the span of the audio to be
rendered. Both the start and the end of the span within the
audio
content can be specified using time offsets. The duration of the span, including repetitions, can also be specified with repeat attributes.
Synthesis processor
support for these attributes is
REQUIRED
in the
Extended profile
The following trimming attributes are defined for
audio
Name
Required
Type
Default Value
Description
clipBegin
false
Time Designation
0s
offset from start of media to begin rendering. This offset is
measured in normal media playback time from the beginning of the
media.
clipEnd
false
Time Designation
None
offset from start of media to end rendering. This offset is
measured in normal media playback time from the beginning of the
media.
repeatCount
false
a positive
Real Number
number of iterations of media to render. A fractional value describes a portion of the rendered media.
repeatDur
false
Time Designation
None
total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media.
Calculations of rendered durations and interaction with other timing
properties follow SMIL
Computing the active duration
where
audio
is a time container
Time Designation
values for
clipBegin
clipEnd
, and
repeatDur
are a subset of SMIL Clock-value
If the length of an audio clip is not known in advance then it is treated as indefinite. Consequently
repeatCount
will have no effect.
If
clipEnd
is after the end of the audio, then rendering ends at the audio end.
If
clipBegin
is after
clipEnd
, no audio will be produced.
repeatDur
takes precedence over
repeatCount
in determining the total time for rendering media.
Note that not all SMIL Timing features are supported.
Real Numbers
Real numbers and integers are specified in decimal notation only.
An integer consists of one or more digits "0" to "9".
A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits. Both integers and real numbers may be preceded by a "-" or "+" to indicate the sign.
Time Designation
Time designations consist of a non-negative
real number
followed by a time unit identifier. The time unit identifiers are:
ms: milliseconds
s: seconds
Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s".
Examples
In the following example, rendering of the media begins 10 seconds into the audio:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

Here the rendering of the media ends after 20 seconds of audio:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

Note that if the duration of "radio.wav" is less than 20 seconds, the
clipEnd
value is ignored, and the rendering end is set equal to the effective end of the media.
In the following example, the duration of the audio is constrained by
repeatCount

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

Only the first half of the clip will play; the active duration will be 1.5 seconds.
In the following example, the audio will repeat for a total of 7 seconds. It will play fully two times, followed by a fractional part of 2 seconds. This is equivalent to a
repeatCount
of 2.8.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

In the following example, the active duration of the audio will be 4 seconds. Playback will start 1 second into the audio (as specified by the
clipBegin
value) and then play for 1 second (since
clipEnd
is specified as 2 seconds), and then this span will be repeated so that the total duration is 4 seconds (as specified by
repeatDur
). Note that the value of
repeatDur
takes precedence over the value of
repeatCount

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

These attributes can interact with the rendering specified by
speak
trimming attributes:
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

The
speak
startmark
and
endmark
allow only the "15second_music.mp3" clip to be played. The actual duration of the audio is 5 seconds: the clip begins at 2 seconds into the audio and ends after 7 seconds, hence a duration of 5
seconds.
3.3.1.2
soundLevel
Attribute
The
soundLevel
attribute specifies the relative volume of the referenced audio. It is inspired by the similarly-named attribute in SMIL [
SMIL3
].
Synthesis processor
support for this attribute is
REQUIRED
in the
Extended profile
Name
Required
Type
Default Value
Description
soundLevel
false
signed ("+" or "-")
CSS2 numbers
immediately followed by "dB"
The default value is +0.0dB.
Decibel values are interpreted as a ratio of the squares of the new signal amplitude (a
) and the current amplitude (a
) and are defined in terms of dB:
soundLevel
(dB)
= 20 log
10
(a
∕ a
). A setting of a large negative value effectively plays the media silently. A value of '-6.0dB' will play the media at approximately half the amplitude of its current signal amplitude. Similarly, a value of '+6.0dB' will play the media at approximately twice the amplitude of its current signal amplitude (subject to hardware limitations). The absolute sound level of media perceived is further subject to system volume settings, which cannot be controlled with this attribute.
Here is an example of how to use the
soundLevel
attribute:
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

~~This is the original, unmodified waveform:~~

~~This is the same audio at approximately twice the signal amplitude:~~

~~This is the same audio at approximately half the original signal amplitude:~~

3.3.1.3
speed
Attribute
The
speed
attribute controls the playback speed of the referenced audio, to speed up or slow down the effective rate of play relative to the original speed of the waveform. The argument value does not specify an absolute play speed, but rather is relative to the playback speed of the original waveform.
Synthesis processor
support for this attribute is
REQUIRED
in the
Extended profile
Name
Required
Type
Default Value
Description
speed
false
x%
(where x is a positive real value)
The default value is 100%, which corresponds to the speed of an unmodified audio waveform.
The speed at which to play the referenced audio, relative to the original speed.
The speed is set to the requested percentage of the speed of the original waveform.
A change in the value of the
speed
attribute will change the rate at which recorded samples are played back. Note that this will affect the pitch.
Here is an example of how to use the
speed
attribute:
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

~~This is the original, unmodified waveform:~~

~~This is the same audio at twice the speed:~~

~~This is the same audio at half the original speed:~~

3.3.2
mark
Element
mark
element is an empty element that places a marker into the text/tag sequence. It has one
REQUIRED
attribute,
name
, which is of type
xsd:token
SCHEMA2
§3.3.2]. The
mark
element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a
mark
element, a
synthesis processor
MUST
do one or both of the following:
inform the hosting environment with the value of the
name
attribute and with information allowing the platform to retrieve the corresponding position in the rendered output.
when audio output of the SSML document reaches the
mark
, issue an event that includes the
REQUIRED
name
attribute of the element. The hosting environment defines the destination of the event.
The
mark
element does not affect the speech output process.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

Go from here, to there!

3.3.3
desc
Element
The
desc
element can only occur within the content of the
audio
element. When the audio source referenced in
audio
is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a
desc
element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the
synthesis processor
, the content of the
desc
element(s)
SHOULD
be rendered instead of other alternative content in
audio
. The
OPTIONAL
xml:lang
attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. The
OPTIONAL
onlangfailure
attribute can be used to specify the desired behavior upon language speaking failure.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

Heads of State often make mistakes when speaking in a foreign language.
One of the most well-known examples is that of John F. Kennedy:

Here's the same thing again but with a different fallback:

The
desc
element can only contain descriptive text.
4.
References
4.1
Normative References
[BCP47]
Tags for Identifying Languages
and
Matching of Language Tags
, A. Phillips and M. Davis, Editors. IETF, September 2009. Available at http://www.rfc-editor.org/bcp/bcp47.txt.
[CSS2]
Cascading Style Sheets, level 2: CSS2 Specification
, B. Bos, et al., Editors. World Wide Web Consortium, 12 May 1998. This version of the CSS2 Recommendation is http://www.w3.org/TR/1998/REC-CSS2-19980512/. The
latest version of CSS2
is available at http://www.w3.org/TR/CSS2/.
Note this reference may be revised when the
CSS3 Speech Module
becomes a W3C Recommendation.
[IPAHNDBK]
Handbook of the International Phonetic Association
, International Phonetic Association, Editors. Cambridge University Press, July 1999. Information on the Handbook is available at http://www.langsci.ucl.ac.uk/ipa/handbook.html.
[PLS]
Pronunciation Lexicon Specification (PLS) Version 1.0
, P. Baggia, Editor. World Wide Web Consortium, 14 October 2008. This version of the PLS Recommendation is http://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/. The
latest version of PLS
is available at http://www.w3.org/TR/pronunciation-lexicon/.
[RFC1521]
MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies
, N. Borenstein and N. Freed, Editors. IETF, September 1993. This RFC is available at http://www.ietf.org/rfc/rfc1521.txt.
[RFC2045]
Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies.
, N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2045.txt.
[RFC2046]
Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types
, N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2046.txt.
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels
, S. Bradner, Editor. IETF, March 1997. This RFC is available at http://www.ietf.org/rfc/rfc2119.txt.
[RFC3986]
Uniform Resource Identifier (URI): Generic Syntax
, T. Berners-Lee et al., Editors. IETF, January 2005. This RFC is available at http://www.ietf.org/rfc/rfc3986.txt.
[RFC3987]
Internationalized Resource Identifiers (IRIs)
, M. Duerst and M. Suignard, Editors. IETF, January 2005. This RFC is available at http://www.ietf.org/rfc/rfc3987.txt.
[RFC4267]
The W3C Speech Interface Framework Media Types: application/voicexml+xml, application/ssml+xml, application/srgs, application/srgs+xml, application/ccxml+xml, and application/pls+xml
, M. Froumentin, Editor. IETF, November 2005. This RFC is available at http://www.ietf.org/rfc/rfc4267.txt.
[SCHEMA1]
XML Schema Part 1: Structures Second Edition
, H. S. Thompson, et al., Editors. World Wide Web Consortium, 28 October 2004. This version of the XML Schema Part 1 Recommendation is http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/. The
latest version of XML Schema 1
is available at http://www.w3.org/TR/xmlschema-1/.
[SCHEMA2]
XML Schema Part 2: Datatypes Second Edition
, P.V. Biron and A. Malhotra, Editors. World Wide Web Consortium, 28 October 2004. This version of the XML Schema Part 2 Recommendation is http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/. The
latest version of XML Schema 2
is available at http://www.w3.org/TR/xmlschema-2/.
[SMIL3]
Synchronized Multimedia Integration Language (SMIL 3.0)
, D. Bulterman, et al., Editors. World Wide Web Consortium, 1 December 2008. This version of the SMIL 3 Recommendation is http://www.w3.org/TR/2008/REC-SMIL3-20081201/. The
latest version of SMIL3
is available at http://www.w3.org/TR/SMIL3/. This document is a work in progress.
[TYPES]
MIME Media types
, IANA. This continually-updated list of media types registered with IANA is available at http://www.iana.org/assignments/media-types/index.html.
[XML 1.0]
Extensible Markup Language (XML) 1.0 (Fifth Edition)
, T. Bray et al., Editors. World Wide Web Consortium, 26 August 2008. This version of the XML 1.0 Recommendation is http://www.w3.org/TR/2008/REC-xml-20081126/. The
latest version of XML 1.0
is available at http://www.w3.org/TR/xml/.
[XML 1.1]
Extensible Markup Language (XML) 1.1 (Second Edition)
, T. Bray et al., Editors. World Wide Web Consortium, 16 August 2006. This version of the XML 1.1 Recommendation is http://www.w3.org/TR/2006/REC-xml11-20060816/. The
latest version of XML 1.1
is available at http://www.w3.org/TR/xml11/.
[XML-BASE]
XML Base (Second Edition)
, J. Marsh and R. Tobin, Editors. World Wide Web Consortium, 28 January 2009. This version of the XML Base Recommendation is http://www.w3.org/TR/2009/REC-xmlbase-20090128/. The
latest version of XML Base
is available at http://www.w3.org/TR/xmlbase/.
[XML-ID]
xml:id Version 1.0
, J. Marsh et al., Editors. World Wide Web Consortium, 9 September 2005. This version of the xml:id Recommendation is http://www.w3.org/TR/2005/REC-xml-id-20050909/. The
latest version of xml:id
is available at http://www.w3.org/TR/xml-id/.
[XMLNS 1.0]
Namespaces in XML 1.0 (Third Edition)
, T. Bray et al., Editors. World Wide Web Consortium, 8 December 2009. This version of the XML Namespaces 1.0 Recommendation is http://www.w3.org/TR/2009/REC-xml-names-20091208/. The
latest version of XML Namespaces 1.0
is available at http://www.w3.org/TR/REC-xml-names/.
[XMLNS 1.1]
Namespaces in XML 1.1 (Second Edition)
, T. Bray et al., Editors. World Wide Web Consortium, 16 August 2006. This version of the XML Namespaces 1.1 Recommendation is http://www.w3.org/TR/2006/REC-xml-names11-20060816/. The
latest version of XML Namespaces 1.1
is available at http://www.w3.org/TR/xml-names11/.
4.2
Informative References
[DC]
Dublin Core Metadata Initiative.
See
[HTML]
HTML 4.01 Specification
, D. Raggett et al., Editors. World Wide Web Consortium, 24 December 1999. This version of the HTML 4 Recommendation is http://www.w3.org/TR/1999/REC-html401-19991224/. The
latest version of HTML 4
is available at http://www.w3.org/TR/html4/.
[IPA]
International Phonetic Association
. See http://www.langsci.ucl.ac.uk/ipa/ for the organization's website.
[IPAUNICODE1]
The International Phonetic Alphabet
, J. Esling. This table of IPA characters in Unicode is available at http://web.uvic.ca/ling/resources/ipa/charts/unicode_ipa-chart.htm.
[IPAUNICODE2]
The International Phonetic Alphabet in Unicode
, J. Wells. This table of Unicode values for IPA characters is available at http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm.
[JEIDAALPHABET]
JEIDA-62-2000 Phoneme Alphabet
. JEITA. An abstract of this document (in Japanese) is available at http://it.jeita.or.jp/document/publica/standard/summary/JEIDA-62-2000.pdf.
[JEITA]
Japan Electronics and Information Technology Industries Association
. See http://www.jeita.or.jp/.
[JSML]
JSpeech Markup Language
, A. Hunt, Editor. World Wide Web Consortium, 5 June 2000. Copyright ©2000 Sun Microsystems, Inc. This version of the JSML submission is http://www.w3.org/TR/2000/NOTE-jsml-20000605/. The
latest W3C Note of JSML
is available at http://www.w3.org/TR/jsml/.
[LEX]
Pronunciation Lexicon Markup Requirements
, P. Baggia and F. Scahill, Editors. World Wide Web Consortium, 29 October 2004. This document is a work in progress. This version of the Lexicon Requirements is http://www.w3.org/TR/2004/WD-lexicon-reqs-20041029/. The
latest version of the Lexicon Requirements
is available at http://www.w3.org/TR/lexicon-reqs/.
[RDF]
RDF Primer
, F. Manola and E. Miller, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Primer Recommendation is http://www.w3.org/TR/2004/REC-rdf-primer-20040210/. The
latest version of the RDF Primer
is available at http://www.w3.org/TR/rdf-primer/.
[RDF-XMLSYNTAX]
RDF/XML Syntax Specification
, D. Beckett, Editor. World Wide Web Consortium, 10 February 2004. This version of the RDF/XML Syntax Recommendation is http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/. The
latest version of the RDF XML Syntax
is available at http://www.w3.org/TR/rdf-syntax-grammar/.
[RDF-SCHEMA]
RDF Vocabulary Description Language 1.0: RDF Schema
, D. Brickley and R. Guha, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Schema Recommendation is http://www.w3.org/TR/2004/REC-rdf-schema-20040210/. The
latest version of RDF Schema
is available at http://www.w3.org/TR/rdf-schema/.
[REQS]
Speech Synthesis Markup Requirements for Voice Markup Languages
, A. Hunt, Editor. World Wide Web Consortium, 23 December 1999. This document is a work in progress. This version of the Synthesis Requirements is http://www.w3.org/TR/1999/WD-voice-tts-reqs-19991223/. The
latest version of the Synthesis Requirements
is available at http://www.w3.org/TR/voice-tts-reqs/.
[REQS11]
Speech Synthesis Markup Language Version 1.1 Requirements
, D. Burnett and Z. Shuang, Editors. World Wide Web Consortium, 11 June 2007. This document is a work in progress. This version of the SSML 1.1 Requirements is http://www.w3.org/TR/2007/WD-ssml11reqs-20070611/. The
latest version of the SSML 1.1 Requirements
is available at http://www.w3.org/TR/ssml11reqs/.
[RFC2616]
Hypertext Transfer Protocol -- HTTP/1.1
, R. Fielding, et al., Editors. IETF, June 1999. This RFC is available at http://www.ietf.org/rfc/rfc2616.txt.
[RFC2732]
Format for Literal IPv6 Addresses in URL's
, R. Hinden, et al., Editors. IETF, December 1999. This RFC is available at http://www.ietf.org/rfc/rfc2732.txt.
[RUBY]
Ruby Annotation
Marcin Sawicki, et al., Editors. World Wide Web Consortium, 31 May 2001. This version of the Ruby Recommendation is
The latest version is available at
[SABLE]
"SABLE: A Standard for TTS Markup", Richard Sproat, et al.
Proceedings of the International Conference on Spoken Language Processing
, R. Mannell and J. Robert-Ribes, Editors.
Causal Productions Pty Ltd
(Adelaide), 1998. Vol. 5, pp. 1719-1722. Conference proceedings are available from the publisher at http://www.causalproductions.com/.
[SSML]
Speech Synthesis Markup Language (SSML) Version 1.0
Daniel C. Burnett, et al., Editors. World Wide Web Consortium, 7 September 2004. This version of the SSML 1.0 Recommendation is
The latest version is available at
[UNICODE]
The Unicode Standard
. The Unicode Consortium. Information about the Unicode Standard and its versions can be found at http://www.unicode.org/standard/standard.html.
[WEB-ARCH]
Architecture of the World Wide Web, Version One
, I. Jacobs, N. Walsh, Editors. World Wide Web Consortium, 15 December 2004. This version of the WWW Architecture is http://www.w3.org/TR/2004/REC-webarch-20041215/. The
latest version of WWW Architecture
is available at http://www.w3.org/TR/webarch/.
[VXML]
Voice Extensible Markup Language (VoiceXML) Version 2.0
, S. McGlashan, et al., Editors. World Wide Web Consortium, 16 March 2004. This version of the VoiceXML 2.0 Recommendation is http://www.w3.org/TR/2004/REC-voicexml20-20040316/. The
latest version of VoiceXML 2
is available at http://www.w3.org/TR/voicexml20/.
[WS]
Minutes
, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 2-3 November 2005. The agenda and minutes are available at
[WS2]
Minutes
, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 30-31 May 2006. The agenda is available at
. The minutes are available at
[WS3]
Minutes
, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 13-14 January 2007. The agenda is available at
. The minutes are available at
5.
Acknowledgments
This document was written with the participation of the following participants in the W3C Voice Browser Working Group and other W3C Working Groups
(listed in family name alphabetical order)
芦村和幸 (Kazuyuki Ashimura), W3C
Max Froumentin, W3C (at the time of participation)
黄力行 (Lixing Huang), Chinese Academy of Sciences (at the time of participation)
Andrew Hunt, Speechworks (at the time of participation)
今竹渉 (Wataru Imatake), Invited Expert
Richard Ishida, W3C
Jim Larson, Invited Expert (formerly of Intervoice)
Wai-Kit Lo, Chinese University of Hong Kong (at the time of participation)
Mark Walker, Intel (at the time of participation)
The editors also wish to thank the members of the W3C Internationalization Working Group, who have provided significant review and contributions to SSML 1.0 and 1.1.
Appendix A
: Audio File Formats
This appendix is normative.
SSML requires that a platform support the playing of the audio formats specified below.
Required audio formats
Audio Format
Media Type
Raw (headerless) 8kHz 8-bit mono mu-law (PCM) single channel. (G.711)
audio/basic (from [
RFC1521
])
Raw (headerless) 8kHz 8 bit mono A-law (PCM) single channel. (G.711)
audio/x-alaw-basic
WAV (RIFF header) 8kHz 8-bit mono mu-law (PCM) single channel.
audio/x-wav
WAV (RIFF header) 8kHz 8-bit mono A-law (PCM) single channel.
audio/x-wav
The 'audio/basic' media type is commonly used with the 'au' header format as well as the headerless 8-bit 8kHz mu-law format. If this media type is specified for playing, the mu-law format
MUST
be used. For playback with the 'audio/basic' media type, processors
MUST
support the mu-law format and
MAY
support the 'au' format.
Appendix B
: Internationalization
This appendix is normative.
SSML is an application of XML [
XML 1.0
or
XML 1.1
] and thus supports [
UNICODE
] which defines a standard universal character set.
SSML provides a mechanism for control of the spoken language via the use of the
xml:lang
attribute. Language changes can occur as frequently as per token (word), although excessive language changes can diminish the output audio quality. SSML also permits finer control over output pronunciations via the
lexicon
and
phoneme
elements, features that can help to mitigate poor quality default lexicons for languages with only minimal commercial support today.
Appendix C
: Media Types and File Suffix
This appendix is normative.
The media type associated with the Speech Synthesis Markup Language specification is "application/ssml+xml" and the filename suffix is ".ssml" as defined in [
RFC4267
].
Appendix D
: Schema for the Speech Synthesis Markup Language
This appendix is normative.
The synthesis schema for the Core profile (
Sec. 2.2.5
) is located at
, and the schema for the Extended profile (
Sec. 2.2.5
) is located at
Note: the synthesis schemas include no-namespace schemas for the Core and Extended profiles, located respectively at
and
, which
MAY
be used as a basis for specifying Speech Synthesis Markup Language Fragments (
Sec. 2.2.1
) embedded in non-synthesis namespace schemas.

Also for stability it is
RECOMMENDED
that you use the following dated URIs for the above schema files:
(for http://www.w3.org/TR/speech-synthesis11/synthesis.xsd)
(for http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd)
(for http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace.xsd)
(for http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace-extended.xsd)
Appendix E
: Example SSML
This appendix is informative.
The following is an example of reading headers of email messages. The
and
elements are used to mark the text structure. The
break
element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The
prosody
element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

~~You have 4 new messages.~~
~~The first is from Stephanie Williams and arrived at 3:45pm.~~

~~The subject is ski trip~~

The following example combines audio files and different spoken voices to provide information on a collection of music.

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

~~Today we preview the latest romantic music from Example.~~

~~Hear what the Software Reviews said about Example's newest hit.~~

He sings about issues that touch us all.

Here's a sample.

It is often the case that an author wishes to include a bit of foreign text (say, a movie title) in an application without having to switch languages (for example via the
lang
element). A simple way to do this is shown here. In this example the synthesis processor would render the movie name using the pronunciation rules of the container language ("en-US" in this case), similar to how a reader who doesn't know the foreign language might try to read (and pronounce) it.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

The title of the movie is:
"La vita è bella"
(Life is beautiful),
which is directed by Roberto Benigni.

With some additional work the output quality can be improved tremendously either by creating a custom pronunciation in an external lexicon (see
Section 3.1.5
) or via the
phoneme
element as shown in the next example.
It is worth noting that IPA alphabet support is an
OPTIONAL
feature and that phonemes for an external language may be rendered with some approximation (see
Section 3.1.5
for details). The following example only uses phonemes common to US English.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

The title of the movie is:
ph="ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə">
La vita è bella

(Life is beautiful),
which is directed by
ph="ɹəˈbɛːɹɾoʊ bɛˈniːnji">
Roberto Benigni

SMIL Integration Example
The SMIL language [
SMIL3
] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.
File
'greetings.ssml'
contains the following:

xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
xml:lang="en-US">

~~Greetings from the _W3C!~~

SMIL Example 1:
W3C logo image appears, and then one second later, the speech sequence is rendered. File
'greetings.smil'
contains the following:

SMIL Example 2:
W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File
'greetings.smil'
contains the following:

VoiceXML Integration Example
The following is an example of SSML in VoiceXML (see
Section 2.3.3
) for
voice browser
applications. It is worth noting that the VoiceXML namespace includes the SSML namespace elements and attributes. See Appendix O of [
VXML
] for details.

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml

Appendix F
: Changes since SSML 1.0
This appendix is informative.
In the event of modifying an SSML 1.0 conformant document for a
synthesis processor that supports only SSML 1.1, document authors are
informed of the following note on compatibility:
SSML 1.0 conformant elements requiring no changes for SSML 1.1
are: , , , , , _{, , , and partially (excluding rate and volume attributes)
Elements with attribute changes in SSML 1.1 are:
, , , and elements to enhance external references into SSML content.
In 3.1.5 lexicon, added element to control which lexicons are currently in use. now only defines which lexicons are used in the document.
Removed general text describing how text may be mapped to entries in the lexicon.
The default type is now "application/pls+xml", as defined by the PLS 1.0 specification.
Introduced the notion of a Pronunciation Alphabet Registry that would maintain a list of registered values for the alphabet attribute of the element.
Removed the xml:lang attribute from the element to reduce confusion.
Added element to allow setting xml:lang for arbitrary text content.
Clarified in description that indication of language and voice are independent, no synthesis processor is required to support all combinations thereof, and processors must document behavior for every combination thereof.
3.1.5: Now mandates that if a referenced lexicon is a PLS document, then the information in it must be used by the processor.
3.1.5.2: Clarified that the processor already has built-in system lexicons whose values are overridden by use of the and elements.
Updated entire document to allow for XML and XMLNS 1.1 in addition to 1.0. Clarified in definition of URI that IRIs are allowed and added an informative reference to RFC3987.
Completely revamped how voice selection and language speaking control are done.
In 3.2.1, added "languages", "required", "ordering", and "onvoicefailure" attributes, introduced a new voice selection algorithm. Voice selection is now scoped.
Added new "onlangfailure" attribute (new section 3.1.13) on all elements that take the "xml:lang" attribute: , , ,
, , , and .
Added trimming attributes to (section 3.1.1) to accommodate expected VoiceXML 3 needs.
In 3.3.1, added trimming, soundLevel, and speed attributes to}