The XHTML+Voice profile brings spoken interaction to standard web content by integrating the mature XHTML and XML-Events technologies with XML vocabularies developed as part of the W3C Speech Interface Framework. The profile includes voice modules that support speech synthesis, speech dialogs, command and control, and speech grammars. Voice handlers can be attached to XHTML elements and respond to specific DOM events, thereby reusing the event model familiar to web developers. Voice interaction features are integrated with XHTML and CSS and can consequently be used directly within XHTML content.
This section describes the status of this document at the time of its publication. Other documents may supersede this document.
Note that the language profile described in this specification re-uses W3C working drafts that are likely to change. This integration profile will be updated as needed to use the final stable versions of these specifications. This profile is an update to the XHTML+Voice 1.1 profile. XHTML+Voice 1.2 is current with the VoiceXML 2.0 Recommendation.
The list of known errors in this specification is available at xhtml-voice12-errata.html. Please report errors in this document to mccobb@us.ibm.com.
1 Introduction
1.1 Motivation And Applications
1.2 Design Principles
1.3 XHTML+Voice Processing Model
1.3.1 Processing within one Document
1.3.1.1 Language and Version
1.3.1.2 VoiceXML Scope within XHTML+Voice
1.3.1.3 VoiceXML Dialog Activation
1.3.1.4 Accessing Speech Dialog Results from XHTML
1.3.1.5 Accessing XHTML from a Speech Dialog
1.3.1.6 Returning from a VoiceXML Form
1.3.2 Cancel
1.3.3 Declarative Synchronization of Input Modes
1.3.4 Events and Event Handling
1.3.5 Document Linking with Voice
1.3.6 Aural Style Sheets
2 VoiceXML 2.0 Modules
2.1 Modularization Of VoiceXML 2.0
2.2 Speech Dialogs
2.3 Executable Content
2.4 Speech Grammars
2.5 Speech And Non-speech Audio Output
2.6 Event Handling
3 XHTML Modularization
3.1 Document Conformance
3.2 User Agent Conformance
3.3 XHTML Namespace Integration
3.4 XHTML+Voice Profile
3.5 XHTML+Voice Abstract Modules
3.5.1 Abstract Modules
3.5.2 Element content shorthands
3.5.3 Attribute list shorthands
4 XML Events Module
4.1 Listener
4.2 Event Types
4.2.1 DOMActivate
4.3 XHTML+Voice Event Propagation
5 XHTML+Voice Extension Module
5.1 Sync
5.1.1 Standard Grammars for XHTML Controls
5.2 Cancel
5.3 VoiceXML Field ID Attribute
5.4 VoiceXML Prompt SRC and EXPR Attributes
5.4.1 Styling External Prompt Resources
5.4.2 Invalid Prompt Resource
5.4.3 Prompt Resource Fetching Properties
A Reusable VoiceXML
B Examples
B.1 What You See Is What You Can Say
B.2 Mixed-initiative Conversational Interface
B.3 Speech-Enabled Mail Interface
B.4 Reusable VoiceXML Subdialogs
C FIA for XHTML+Voice
D DTD
D.1 xhtml+voice12.dtd
E Schema
E.1 xhtml+voice12.xsd
F VoiceXML Container for the XHTML+Voice Subset
F.1 vxml20-xvsubset.xsd
G Multimodal Auto Fill
H Changes from XHTML+Voice 1.1
H.1 Modified Elements
H.2 Clarifications
H.3 Miscellaneous
I References
I.1 Normative References
I.2 Informative References
This document defines version 1.2 of the XHTML+Voice profile. XHTML+Voice 1.2 is a member of the XHTML family of document types, as specified by XHTML Modularization [XHTML Modularization]. XHTML is extended with a modularized subset of VoiceXML 2.0, the XML Events module, and a module containing a small number of attribute extensions to both XHTML and VoiceXML. The latter module facilitates the sharing of multimodal input data between the VoiceXML dialog and XHTML input and text elements.
The XML Events module [XML Events] provides XML host languages the ability to uniformly integrate event listeners and associated event handlers with Document Object Model (DOM) Level 2 [DOM2 Events] event interfaces. The result is an event syntax for XHTML-based languages that enables an interoperable way of associating behaviors with document-level markup.
VoiceXML [VoiceXML 2.0] has been designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. In this document, VoiceXML 2.0 is modularized to prepare it for integration into the XHTML family of languages using the XHTML modularization framework. The modules that combine to support speech dialogs for updating XHTML forms and form elements are selected to be added to XHTML. The modules are described as well as the integration issues. The modularization of VoiceXML 2.0 also specifies DOM event types specific to voice interaction for use with the XML Events module. Speech dialogs authored in VoiceXML 2.0 can then be treated as event handlers to add voice-interaction specific behaviors to XHTML documents. The language integration supports all of the modules defined in XHTML Modularization, and adds speech interaction functionality to XHTML elements to enable multimodal applications. The document type defined by the XHTML+Voice profile is XHTML Host language document type conformant.
Two mature technologies, XHTML 1.1 [XHTML 1.1] and VoiceXML 2.0 [VoiceXML 2.0] are integrated using [XHTML Modularization] to bring spoken interaction to the visual web. The design leverages open industry APIs like the W3C DOM to create interoperable web content that can be deployed across a variety of end-user devices. Multiple modes of interaction are synchronized and integrated using the DOM 2 Events model [DOM2 Events] and exposed to the content author via XML Events [XML Events].
Today, web applications are authored in XHTML with user interaction created via XHTML form elements. The W3C is presently working on XForms [XForms], the next generation of web forms that bring the power of XML to web application development. The combination of XHTML and Voice described in this document can leverage the semantic richness of web applications created using XForms, while providing a smooth transition for today's developers wishing to deploy multimodal applications by adding spoken interaction to present-day web content. Integrating the work of the W3C voice browser working group into mainstream XHTML content has the advantage of ensuring that future enhancements to the voice browser component such as natural language understanding will be incorporated. Thus, a smooth transition path for web developers wishing to deliver increasingly smart user interaction for their web applications is provided. Building on XHTML Basic [XHTML Basic] and XHTML modularization, content developers will be able to deploy their content to a wide variety of end-user clients ranging from mobile phones and small PDAs to desktop browsers.
XHTML+Voice is an XML application [XML 1.0].
XHTML+Voice is designed for creating multimodal dialogs that combine the visual input mode, represented by XHTML, and speech input and output, represented by a subset of VoiceXML. Here is a "Hello World" example of XHTML+Voice:
<?xml version="1.0"?>
<html
xmlns="http://www.w3.org/1999/xhtml"
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice"
>
<head>
<title>XHTML+Voice Example</title>
<!-- voice handler -->
<vxml:form id="sayHello">
<vxml:block><vxml:prompt xv:src="#hello"/>
</vxml:block>
</vxml:form>
</head>
<body>
<h1>XHTML+Voice Example</h1>
<p id="hello" ev:event="click" ev:handler="#sayHello">
Hello World!
</p>
</body>
</html>The speech dialog identified by "sayHello" is activated when the user clicks anywhere on the paragraph identified by "hello." The speech dialog is a VoiceXML form that synthesizes the text obtained from the same paragraph that activated the form. The speech output is "Hello World!"
A speech dialog is defined within XHTML+Voice as a [VoiceXML 2.0] form with a unique ID. The VoiceXML form is activated by an XML Events event with an associated handler that references the form's unique ID. The XML Events event is generated from a user interaction with an XHTML element, generally a form control, or from a document event such as load or unload. Activating the VoiceXML form sets all form and field item variables to their initial values. This clears the guard conditions on all form items that don't have an initial value set with the expr attribute. The form is run according to the form interpretation algorithm (FIA) specified by VoiceXML.
A VoiceXML form requires language and VoiceXML version information. VoiceXML 2.0 includes language and version attributes with its root <vxml> element. XHTML+Voice obtains language and VoiceXML version from XHTML as follows. Language is obtained from the HTML root element's xml:lang attribute, while the VoiceXML version can be derived from the value of the VoiceXML namespace. The language can be overriden by the xml:lang attribute on the VoiceXML grammar and prompt tags.
A VoiceXML form within an XHTML+Voice document does not have the session and document scopes defined by VoiceXML. It does not have these scopes for two reasons. First, <form> is the top level VoiceXML element in an XHTML+Voice document. Second, XHTML+Voice does not allow transitions from one voice handler to another. VoiceXML 2.0 allows a form to have either dialog or document scope. If the form's scope is document, as set by the scope attribute, the form is active while another form in the document is running. When the speech input matches the grammar of the form with document scope, there is a transition from the currently running form to the form with the document scope. XHTML+Voice does not allow this transition. Consequently, a form's scope is limited to dialog and the scope attribute is ignored. The grammar scope attribute is also ignored for the same reason. The remaining inner VoiceXML scopes, dialog and anonymous, are processed by XHTML+Voice, as required by the VoiceXML FIA.
While XHTML+Voice only supports the default value of the scope attribute, which is "dialog," if the scope attribute is encountered on a voice handler form the form is not invalidated and processing continues. The scope attribute on the <grammar> element is also ignored and its default value of "dialog" maintained. XHTML+Voice document processing ignores all VoiceXML 2.0 attributes it does not support when they are encountered.
If XHTML+Voice document processing encounters a VoiceXML 2.0 element not supported by XHTML+Voice (e.g., <goto>), a "badfetch" error is thrown. This means that a VoiceXML 2.0 interpreter and an XHTML+Voice interpreter can run the same VoiceXML 2.0 source if all the source tags are supported by XHTML+Voice. However, all the source attributes do not need to be supported by XHTML+Voice as XHTML+Voice supports their default values.
XHTML+Voice allows a speech dialog to be referenced as a voice handler in an external file. Because the speech dialog has no scope outside of its enclosing form, only the form in the external file is processed when the form is activated. For example, the script elements in the external file will not be processed. This is because the visual browser only executes script in the current document, and the VoiceXML <script> element is not supported. This requires the external reference to contain a fragment identifer specifying the form in addition to an absolute or relative URI. This differs from VoiceXML, which specifies that when the fragment is absent, the form "invoked is the lexically first dialog in the document" [VoiceXML 2.0]. With this restriction, the speech dialog can reside in any external XML document, including VoiceXML. Only the calling document has to be an XHTML+Voice document.
Because XHTML script placed in an external file is not processed, validation of VoiceXML results cannot be performed within an external subdialog by calling out to some ECMAScript contained within a VoiceXML script tag. ECMAScript validation of subdialog results can only be performed from the calling document. Validation methods must be included in the ECMAScript objects passed as parameters to the subdialog.
VoiceXML <field>, <subdialog>, and <var> elements do not have any visibility to the XHTML namespace as ECMAScript variables. Furthermore, there is no requirement to support the VoiceXML elements as nodes in the DOM object available to JavaScript. There are several problems with supporting the DOM object. Unlike XHTML form control elements, VoiceXML form item elements don't have a value attribute and consequently the DOM node value property is missing. A value attribute is necessary because the VoiceXML form item elements are their own ECMAScript variables, and they have defined values only while the enclosing form is active. At all other times their values are undefined.
When the browser loads the body of an XHTML+Voice document a "load" event is generated. This begins the event cycle specified by the DOM Level 2 Events model. While the event cycle is running events propagate through the HTML tree. An XML Events listener can observe an event on either a target HTML node, or an ancestor of the node, if the event bubbles. An XML Events listener activates a handler in response to the observed event. The handler can be a voice dialog activated in response to a "click" event on an HTML input, for example.
A voice dialog can also be activated by dispatching a DOMActivate event against it from XHTML script. The XML Events Module provides more details and an example.
Speech dialog results may be accessed from XHTML in one of the following ways:
The global JavaScript scope of an XHTML+Voice document is available to a speech dialog. For example, an XHTML form control element, such as input, can be accessed from within VoiceXML using the DOM object traversal notation available to JavaScript. For example, the value of an input field with name "from_city" is set from the VoiceXML assign tag as follows:
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:ev="http://www.w3.org/2001/xml-events">
<head>
<form id="form_id" xmlns="www.w3.org/2001/vxml">
<field name="from_field">
<filled>
<assign name="document.main.from_city.value"
expr="from_field"/>
</filled>
</field>
</form>
</head>
<body>
<form name="main" action="cgi/city.jsp">
<input name="from_city" type="text"
ev:event="focus" ev:handler="#form_id"/>
</form>
</body>
</html>The document keyword in XHTML+Voice refers to the JavaScript DOM object. This works because XHTML+Voice allows a voice dialog to share the global JavaScript scope with the XHTML container. XHTML+Voice also puts the VoiceXML application scope below the shared global scope.
When an event is captured within a voice dialog the author may choose to end the dialog and return to the XHTML container. XHTML+Voice uses the VoiceXML <return> element for this purpose. If the <return> element is run within executable content of a top level voice handler (i.e., one that is not called as a subdialog), the voice handler will end its execution and return to the XHTML. The following example shows how the <return> element can be used:
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice" >
<head><title>Find City or Airport</title>
<vxml:form id="vform">
<vxml:subdialog name="cityorairport" src="cityorairport.vxml#cityform">
<vxml:param name="paramPrompt" expr="'What city or airport?'"/>
<vxml:filled>
<vxml:assign name="document.xform.city.value"
expr="cityorairport.returnCityOrAirport"/>
</vxml:filled>
<catch event="error.badfetch">
Error fetching subdialog!
<return/>
</catch>
</vxml:subdialog>
</vxml:form>
</head>
<body bgcolor="#FFFFFF">
<h3>City or Airport</h3>
<form name="xform" action="cgi/cityorairport.jsp">
<p>Enter city or airport:</br>
<input type="text" name="city" ev:event="focus" ev:handler="#vform"/>
</p>
</form>
</body>
</html>When the <return> element is specified within a top-level voice form, its namelist attribute has no meaning and is ignored. However, either the event or eventexpr attribute can be used to return a VoiceXML event to the XHTML container.
Multiple speech dialogs running simultaneously are not allowed by XHTML+Voice. A speech dialog runs in its own thread and, for many devices, the audio subsystem can be owned by only one thread at one time. Also, other resources that are not guaranteed to be thread-safe may cause a voice handler to indefinitely block. Therefore, only one speech dialog can be running at one time per loaded XHTML+Voice document. If only one speech dialog can be running at one time, the activating speech dialog must cancel the currently running dialog. This is the default behavior. The running dialog should also be canceled when the current XHTML+Voice document is unloaded.
The document author can cancel the currently running speech dialog with the <cancel> element that can be specified by an XHTML element as a handler for an XML Events event. The XHTML+Voice Extension Module section provides more details.
Cancel is a message from the visual browser that must be handled by the VoiceXML FIA. It is separate from the cancel event supported by VoiceXML that cancels the currently running prompt. The cancel message from the visual browser modifies the FIA in the sense that it must be checked throughout the FIA, and if it is received then the FIA must terminate.
The XHTML+Voice <sync> element provides a declarative synchronization of XHTML form control elements and the VoiceXML <field> element. The <sync> element specifies the following behaviors. First, sync allows input from one speech or visual modality to set the field in the other modality. Second, setting the focus of an <input> element that is synchronized with a VoiceXML field updates the FIA to visit that VoiceXML field. This is useful when there are multiple fields within a VoiceXML form. Sync is both a message to the VoiceXML FIA from the visual browser, like cancel, and a message from the FIA to the visual browser. The XHTML+Voice Extension Module section provides more details.
The nomatch, noinput, help, and error VoiceXML event types are propagated as XML Events events to XHTML. They can be linked to an XML Events handler using the XML Events syntax for specifying target, observer, event, and handler. The events are propagated regardless of whether the event has already been caught and handled properly within the VoiceXML form. The VoiceXML event types nomatch, noinput, help, and error propagate to the XHTML container as the XHTML+Voice event types vxmlnomatch, vxmlnoinput, vxmlhelp, and vxmlerror, respectively.
Within VoiceXML a chain of events can be created, where one event is caught and another event is thrown, and so on. Because the entire chain of events is propagated to XHTML, the application author should be careful not to chain multiple events of the same type. The VoiceXML error event subtypes error.semantic, error.badfetch, error.unsupport.element, etc., are propagated as the vxmlerror event type to XHTML. This is in accordance with the VoiceXML specification. This allows for the application to define additional error subtypes that can be handled by the visual browser. More general application-defined event types are also supported. If an application-defined event type is defined within the VoiceXML form, such as "foo.bar", then when that event is thrown within the form, it is propagated to XHTML as an XML Events event. For the example below, both the vxmlnoinput and foo.bar events are handled by the visual browser via the XML Events listener tag. Note that the VoiceXML form exits because the foo.bar event is not handled within the form.
<vxml:form id="ex1">
<vxml:catch event="noinput">
<vxml:throw event="foo.bar"/>
</vxml:catch>
<vxml:field name="f1">
<vxml:grammar type="boolean"/>
<vxml:prompt>Say yes or no</vxml:prompt>
</vxml:field>
</vxml:form>
<ev:listener ev:observer="ex1" ev:event="vxmlnoinput" ev:handler="#h1"/>
<ev:listener ev:observer="ex1" ev:event="foo.bar" ev:handler="#h2"/>
In addition to the VoiceXML event types listed above, XHTML+Voice supports the vxmldone event type. The vxmldone event is generated when the currently running VoiceXML form completes without an error. All the event types that XHTML+Voice supports are listed in the XML Events Module.
Document linking with voice is available to the author. Given an XHTML+Voice document with the following <link> and <a> elements:
<link rel="glossary" title="Glossary" href="glossary.html"/> <link rel="contents" title="Contents" href="contents.html"/> <a href="chapter3.html" title="Next Page" rel="next">Next</a> <a href="chapter1.html" title="Previous Page" rel="previous">Previous</a> <a href="http://www.nytimes.com" title="New York Times">NY Times</a>
The following grammar can be produced, as shown below. The document author uses the rel attribute to enable document linking for a select set of <link> and <a> elements. For each element with a rel attribute, the rel and href attribute values are added to the grammar, where the rel value is what the user might say, and the href value is the corresponding URI. If the rel attribute is omitted the title attribute can be used for building a link activation grammar for all the <a> elements in the document.
#JSGF V1.0 iso-8859-1;
grammar document-links;
public <document-links> = Glossary {this.$value="glossary.html"}
| Contents {this.$value="contents.html"}
| Next Page {this.$value="chapter3.html"}
| Previous Page {this.$value="chapter1.html"}
| New York Times {this.$value="http://www.nytimes.com"};The grammar scope of the grammar is document so that it is always active. While XHTML+Voice does not support authoring a grammar with document scope within a form, the multimodal browser should support grammars with document scope for document linking and command and control.
With the addition of the src and expr attributes to the VoiceXML <prompt> element, XHTML+Voice is able to support Aural style sheets declared according to [CSS2]. Within XHTML, a paragraph with id set to "warnPara" can be styled with the CSS "warn" class:
<p id="warnPara" class="warn">warning</p>
The CSS has visual and aural rules for class "warn." When the VoiceXML<form> processes a prompt with the src attribute set to that paragraph, the aural style rules for "warn" are invoked. The VoiceXML Prompt SRC and EXPR Attributes Section provides more details and a complete example.
This section first modularizes VoiceXML 2.0 and then specifies the various VoiceXML 2.0 modules used in the creation of the XHTML+Voice profile.
The files making up the modularization of the VoiceXML 2.0 SCHEMA are available as voice-xml-modules.zip and have been created to ease the process of integrating VoiceXML 2.0 and XHTML. These modules do not change the VoiceXML 2.0 language as specified by the voice browser working group of the W3C. This section gives a high-level overview of each module.
| Module | Purpose | Elements | XHTML+Voice? |
| Events | Events thrown by Voice XML processor |
catch
help
noinput
nomatch
error
throw |
Y |
| Executable statements | Statements for use in voice handlers |
assign
clear
var
log
reprompt
|
Y |
| Filled | Voice handlers invoked when a slot is filled. | filled |
Y |
| Flow control | Flow control constructs from VoiceXML |
if
else
elseif
return
|
Y |
| Forms | Encapsulate voice dialogs |
form
field
record
subdialog
block
initial
|
Y |
| Miscellaneous | Non-local transfers in VoiceXML |
exit
goto
link
script
submit |
N |
| Menus | VoiceXML menus |
menu
choice
|
N |
| Object | Foreign objects for VoiceXML | object |
N |
| Resources | Specifying resources for VoiceXML |
param
property
|
Y |
| Root | VoiceXML stand-alone documents |
vxml
meta
metadata
|
N |
| Enumerate | Enumerate choices or options available to user | enumerate |
Y |
| Option | Specify option in a field | option |
Y |
| Output | Speech and audio output |
prompt
value
audio
desc
emphasis
lexicon
mark
voice
break
prosody
say-as
sub
phoneme
p
s
meta
metadata
|
Y |
| Telephony | Telephony control |
transfer
disconnect
|
N |
| User Input | Speech input constructs from VoiceXML |
grammar
lexicon
example
tag
token
item
meta
metadata
one-of
rule
ruleref
|
Y |
| Attributes | Common attributes used in VoiceXML | NA | Y |
| Datatypes | Common datatypes used in VoiceXML | NA | Y |
| Document Model | Defines content model for VoiceXML elements | NA | N |
Modules vxml-exec-1.xsd,
vxml-filled-1.xsd,
vxml-resource-1.xsd,
vxml-flow-1.xsd,
vxml-enumerate-1.xsd,
vxml-option-1.xsd, and
vxml-form-1.xsd
support authoring
handlers that implement speech dialogs.
Modules vxml-filled-1.xsd,
vxml-flow-1.xsd,
vxml-exec-1.xsd,
and vxml-resource-1.xsd
declare constructs for use within voice handlers.
The semantics of these constructs are as defined in the
VoiceXML 2.0 specification.
The speech grammar modules provide constructs for authoring
speech grammars as specified in VoiceXML 2.0. The modules are
provided by the normative VoiceXML 2.0
SCHEMA and are unchanged: grammar-core.xsd,
grammar.xsd, vxml-grammar-restriction.xsd,
and vxml-grammar-extension.xsd. The restriction and
extension modules allow the elements and attributes normatively
specified by the speech grammar specification [Speech Grammars] to be included
within the VoiceXML 2.0 namespace.
The speech and audio output modules define constructs for producing spoken
and non-spoken audio output. The modules are provided by the normative
VoiceXML SCHEMA and are unchanged: synthesis-core.xsd,
synthesis.xsd, vxml-synthesis-restriction.xsd,
and vxml-synthesis-extension.xsd. As with the speech grammar
modules, the elements and attributes normatively defined in the SSML
specification [SSML 1.0] are included within the VoiceXML
2.0 namespace.
This section is normative.
A conforming XHTML+Voice document is a document that requires only the facilities described as mandatory in this specification. Such a document must meet all of the following criteria:
It must validate against the XML Schema found in schema provided in this document.
The root element of the document must be html.
The name of the default namespace on the root element must be the XHTML
namespace name:
http://www.w3.org/1999/xhtml.
If a DOCTYPE declaration is present and includes a public identifier, the DOCTYPE declaration must reference the DTD provided in this document using its Formal Public Identifier. The system identifier may be modified appropriately.
<!DOCTYPE html PUBLIC "-//VoiceXML Forum//DTD XHTML+Voice 1.2//EN" "http://www.voicexml.org/specs/multimodal/x+v/12/dtd/xhtml+voice12.dtd">
The user agent must conform to the "User Agent Conformance" section of the XHTML specification [XHTML 1.0], section 3.2, and the conformance requirements detailed in the VoiceXML modules [VoiceXML 2.0] supported by the integration profile.
The user agent must conform to the following additional user agent rule:
When the user agent claims to support facilities defined within the VoiceXML 2.0 specifications or facilities required by this specification through normative reference, it must do so in ways consistent with the facilities' definition.
The default XML namespace of an XHTML+Voice document is XHTML. XHTML+Voice extends XHTML with VoiceXML, XML Events, and XHTML+Voice extensions. The VoiceXML, XML Events, and XHTML+Voice extension elements and attributes are included through additional namespace declarations:
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice">
The name of the unique prefix identifier for the namespace within the document,
for example, vxml for VoiceXML elements, is left to the
document author's discretion.
The XHTML functionality in the XHTML+Voice document type is based upon the XHTML modules defined in [XHTML Modularization]. The XHTML+Voice profile includes the XHTML modules defined in [XHTML Basic], such as the basic XHTML forms and tables modules. Added to the XHTML Basic modules are the following modules:
The notation, terms and document conventions used here are borrowed from [XHTML 1.1].
The profile includes the XHTML basic module defined in [XHTML Basic], the XHTML scripting module defined in [XHTML 1.1], the XML Events module defined in [XML Events], the XHTML+Voice extension module defined in the XHTML+Voice Extension Module, and the following VoiceXML 2.0 modules:
The namespaces used in these modules are as follows:
| Element | Content | Attributes |
|---|---|---|
| Base Module (XHTML) | ||
| base | EMPTY | href* (URI) |
| Basic Forms Module (XHTML) | ||
| form | Heading | Block - form | Common, action* (URI), method ("get"* | "post"), enctype (ContentType) |
| input | EMPTY | Common, Access, checked ("checked"), maxlength (Number), name (CDATA), size (Number), src (URI), type ("text"* | "password" | "checkbox" | "radio" | "submit" | "reset" | "hidden" ), value (CDATA) |
| label | (PCDATA | Inline - label)* | Common, accesskey (Character), for (IDREF) |
| select | option+ | Common, multiple ("multiple"), name (CDATA), size (Number) |
| option | PCDATA | Common, , selected ("selected"), value (CDATA) |
| textarea | PCDATA | Common, Access, cols* (Number), name (CDATA), rows* (Number) |
| Basic Tables Module (XHTML) | ||
| caption | (PCDATA | Inline)* | Common |
| table | caption?, tr+ | Common, summary (Text), width (Length ) |
| td | (PCDATA | Flow - table)* | Common, Cell, Align |
| th | (PCDATA | Flow - table)* | Common, Cell, Align |
| tr | td+ | Common, Align |
| Enumeration Module (VoiceXML) | ||
| enumerate | (Audio | TTS)* | - |
| Events Module (VoiceXML) | ||
| catch | Exec | VoiceHandler, event (NMTOKENS) |
| help | Exec | VoiceHandler |
| noinput | Exec | VoiceHandler |
| nomatch | Exec | VoiceHandler |
| error | Exec | VoiceHandler |
| throw | EMPTY | VoiceHandler, event (NMTOKEN), eventexpr (Script), message (CDATA), messageexpr (Script) |
| Executable Statements Module (VoiceXML) | ||
| assign | EMPTY | Expr |
| clear | EMPTY | namelist (CDATA) |
| var | EMPTY | Expr |
| log | (PCDATA | value)* | label (CDATA), expr (Script) |
| reprompt | EMPTY | - |
| Filled Module (VoiceXML) | ||
| filled | (Exec)* | mode("any" | "all"*), namelist (CDATA) |
| Flow Control Module (VoiceXML) | ||
| if | (Exec | elseif | else)* | cond (Script) |
| else | EMPTY | - |
| elseif | EMPTY | cond (Script) |
| return | EMPTY | namelist (CDATA), event (NMTOKEN), eventexpr (Script), message (CDATA), messageexpr (Script) |
| Forms Module (VoiceXML) | ||
| form | (Form)* | id (ID) |
| field | (Audio | EventHandler | filled | enumerate | grammar | link | vxml:option | prompt | property)* | Item, type (GrammarType), slot (NMTOKEN ), modal (Boolean), xv:id (ID) |
| record | (Audio | EventHandler | filled | grammar | prompt | property)* | Item, type (ContentType), beep (Boolean), maxtime (Duration), modal (Boolean), dtmfterm (Boolean), finalsilence (Duration) |
| subdialog | (Audio | filled | param | prompt | property)* | Item, Cache, Submit, src (URI), srcexpr (Script), fetchaudio (URI) |
| block | Exec | Item |
| initial | (Audio | EventHandler | link | prompt | property)* | Item |
| Hypertext Module (XHTML) | ||
| a | (PCDATA | Inline - a)* | Common, Access, Linking, hreflang (LanguageCode) |
| Image Module (XHTML) | ||
| img | EMPTY | Common, Dim, alt* (Text), longdesc (URI), src* (URI) |
| Link Module (XHTML) | ||
| link | EMPTY | Linking , media (MediaDesc) |
| List Module (XHTML) | ||
| dl | (dd | dt)+ | Common |
| dt | (PCDATA | Inline)* | Common |
| dd | (PCDATA | Flow)* | Common |
| ol | li+ | Common |
| ul | li+ | Common |
| li | (PCDATA |Flow)* | Common |
| Metainformation Module (XHTML) | ||
| meta | EMPTY | I18N, content* (CDATA), http-equiv (NMTOKEN), name (NMTOKEN), scheme (CDATA) |
| Object Module (XHTML) | ||
| object | (PCDATA | Flow | param)* | Common, Dim, archive (URI), classid (URI), codebase (URI), codetype (ContentType), data (URI), declare ("declare"), name (CDATA), standby (Text), tabindex (Number), type (ContentType) |
| param | EMPTY | id (IDREF), name* (CDATA), type (ContentType), value (CDATA), valuetype ("data"* | "ref" | "object") |
| Option Module (VoiceXML) | ||
| vxml:option | PCDATA | dtmf (CDATA), value (CDATA) |
| Output Module (VoiceXML) | ||
| prompt | (Audio | TTS | lexicon | meta | metadata)* | I18N, VoiceHandler, bargein (Boolean), bargeintype ("speech" | "hotword"), timeout (Duration), xml:base (URI), version ("1.0"), xv:src (URI), xv:expr (CDATA) |
| value | EMPTY | expr (Script) |
| audio | (Audio | TTS | desc)* | Cache, src (URI), expr (Script) |
| desc | PCDATA | xml:lang (NMTOKEN) |
| lexicon | EMPTY | uri (URI), type (ContentType) |
| emphasis | SentenceContent | level ("strong" | "moderate"* | "none" | "reduced") |
| voice | (SentenceContent | Structure)* | I18N, gender ("male" | "female" | "neutral"), age (Number), variant (Number), name (CDATA) |
| break | EMPTY | strength ("x-weak" | "weak" | "medium"* | "strong" | "x-strong" | "none"), time (Duration) |
| prosody | (SentenceContent | Structure)* | pitch (CDATA), contour (CDATA), range (CDATA), rate (CDATA), duration (Duration), volume (CDATA) |
| say-as | (PCDATA | value)* | interpret-as (NMTOKEN), format (NMTOKEN), detail (CDATA) |
| meta | EMPTY | name (NMTOKEN), content (CDATA), http-equiv (NMTOKEN) |
| metadata | ANY | |
| phoneme | PCDATA | ph (CDATA), alphabet (CDATA) |
| p | (SentenceContent | s)* | I18N |
| s | SentenceContent | I18N |
| sub | PCDATA | alias (CDATA) |
| mark | EMPTY | name (CDATA) |
| Resources Module (VoiceXML) | ||
| param | EMPTY | Expr, value (CDATA), valuetype ("data"* | "ref"), type (CDATA) |
| property | EMPTY | name (NMTOKEN), value (CDATA) |
| Scripting Module (XHTML) | ||
| script | PCDATA | charset (CharSet), defer ("defer"), src (URI), type* (ContentType), xml:space="preserve", declare ("declare") |
| noscript | (Heading | Block | List)+ | Common |
| Structure Module (XHTML) | ||
| body | (Heading | Block | List)* | Common |
| head | title, (meta | link | object | script | vxml:form | ev:listener | xv:sync | xv:cancel)* | I18N, profile (URI) |
| html | head, body | I18N, version (CDATA), xmlns (URI = "http://www.w3.org/1999/xhtml") |
| title | PCDATA | I18N |
| Text Module (XHTML) | ||
| abbr | (PCDATA | Inline)* | Common |
| acronym | (PCDATA | Inline)* | Common |
| address | (PCDATA | Inline)* | Common |
| blockquote | (PCDATA | Heading | Block | List)* | Common, cite (URI) |
| br | EMPTY | Core |
| cite | (PCDATA | Inline)* | Common |
| code | (PCDATA | Inline)* | Common |
| dfn | (PCDATA | Inline)* | Common |
| div | (PCDATA | Flow)* | Common |
| em | (PCDATA | Inline)* | Common |
| h1 | (PCDATA | Inline)* | Common |
| h2 | (PCDATA | Inline)* | Common |
| h3 | (PCDATA | Inline)* | Common |
| h4 | (PCDATA | Inline)* | Common |
| h5 | (PCDATA | Inline)* | Common |
| h6 | (PCDATA | Inline)* | Common |
| kbd | (PCDATA | Inline)* | Common |
| p | (PCDATA | Inline)* | Common |
| pre | (PCDATA | Inline)* | Common, xml:space="preserve" |
| q | (PCDATA | Inline)* | Common, cite (URI) |
| samp | (PCDATA | Inline)* | Common |
| span | (PCDATA | Inline)* | Common |
| strong | (PCDATA | Inline)* | Common |
| var | (PCDATA | Inline)* | Common |
| User Input Module (VoiceXML) | ||
| grammar | (PCDATA | meta | metadata | lexicon | tag | rule)* | Cache, I18N, version (NMTOKEN), root (IDREF), mode ("voice"* | "dtmf"), src (URI), scope ("document" | "dialog"), type (ContentType), weight (CDATA), tag-format (URI), xml:base (URI) |
| example | PCDATA | |
| lexicon | EMPTY | uri (URI), type (ContentType) |
| tag | PCDATA | |
| token | PCDATA | I18N |
| item | (RuleExpansion)* | I18N, weight (NMTOKEN), repeat (NMTOKEN), repeat-prob (NMTOKEN) |
| meta | EMPTY | name (NMTOKEN), content (CDATA), http-equiv (NMTOKEN) |
| metadata | ANY | |
| one-of | (item)+ | I18N |
| rule | (RuleExpansion | example)* | id (ID), scope ("private"* | "public") |
| ruleref | EMPTY | uri (URI), type (ContentType), special ("NULL" | "VOID" | "GARBAGE") |
| XML Events Module (XML Events) | ||
| listener | EMPTY | XEvents |
| XHTML+Voice Extension Module (XHTML+Voice) | ||
| sync | EMPTY | |
| cancel | EMPTY | |
| Elements | Attributes | |
| vxml:field& | id (ID) | |
| vxml:prompt& | src (URI) | expr (CDATA) | |
| Element Entities | Content |
|---|---|
| Audio (VoiceXML) | PCDATA | audio | value | enumerate |
| Block (XHTML) | address | blockquote | div | p | pre |
| EventHandler (VoiceXML) | catch | help | noinput | nomatch | error |
| Exec (VoiceXML) | Audio | assign | clear | if | log | prompt | reprompt | return | throw | var |
| Flow (XHTML) | Heading | List | Block | Inline |
| Form (VoiceXML) | EventHandler | grammar | filled | initial | property | record | subdialog | Variable |
| Heading (XHTML) | h1 | h2 | h3 | h4 | h5 | h6 |
| Inline (XHTML) | a | abbr | acronym | button | br | cite | code | dfn | em | img | input | kbd | label | object | q | samp | select | span | strong | textarea |
| RuleExpansion (VoiceXML) | PCDATA | token | ruleref | item | one-of | tag |
| SentenceContent (VoiceXML) | Audio | SentenceElements |
| SentenceElements (VoiceXML) | break | emphasis | phoneme | mark | prosody | say-as | voice | sub |
| Structure (VoiceXML) | s | p |
| TTS (VoiceXML) | SentenceElements | Structure |
| Variable (VoiceXML) | block | field | var |
| Attribute Entities | Content |
|---|---|
| Access (XHTML) | accesskey (Character), tabindex (Number) |
| Align (XHTML) | align ("left" | "center" | "right"), valign ("top" | "middle" | "bottom") |
| Cache (VoiceXML) | fetchhint ("prefetch" | "safe"), fetchtimeout (Duration, maxage (Number), maxstale (Number) |
| Cell (XHTML) | abbr (Text), axis (CDATA), colspan (Number), headers (IDREFS), rowspan (Number), scope ("row" | "col") |
| Common (XHTML) | Core, Events, XEvents |
| Core (XHTML) | class (NMTOKENS), id (ID), title (CDATA ) |
| Dim (XHTML) | height (Length ), width (Length) |
| Events (XHTML) | MouseEvents , KeyEvents |
| Expr (VoiceXML) | name (VarName), expr (Script ) |
| I18N (XML) | xml:lang (NMTOKEN) |
| Item (VoiceXML) | name (VarName), cond (Script), expr (Script) |
| KeyEvents (XHTML) | onkeypress (Script), onkeydown (Script), onkeyup (Script) |
| Linking (XHTML) | charset (CharSet), href (URI), hreflang (LanguageCode), rel (LinkTypes), rev (LinkTypes), type (ContentType) |
| MouseEvents (XHTML) | onclick (Script), ondblclick (Script), onmousedown (Script), onmouseover (Script), onmousemove (Script), onmouseout (Script) |
| Style (XHTML) | style (CDATA ) |
| VoiceHandler (VoiceXML) | count (Number), cond (Script) |
| XEvents (XML Events) | event, observer (IDREF), handler (URI), target (IDREF), phase ("capture" | "default"*), propagate ("stop" | "continue"*), defaultAction("cancel" | "perform"*), id |
| Attribute Type | Description |
|---|---|
| Boolean | "true" | "false" |
| Duration | A positive real number followed by either 's' (seconds) or 'ms' (milliseconds) |
| GrammarType | CDATA |
| VarName | NMTOKEN or NMTOKEN with "$" appended |
XHTML+Voice extends XHTML with the XML Events <listener> element and its attributes. The <listener> attributes are added to XHTML elements primarily for activating voice handlers. The <listener> element and attributes belong to the XML Events namespace:
xmlns:ev="http://www.w3.org/2001/xml-events"
For a given XML language extended with XML Events, a set of event types must be specified independently of the [XML Events] module. The XML Events event types supported by the XHTML+Voice profile include all event types defined for [HTML 4.01] intrinsic events. VoiceXML handler activation is specified by including, with an XHTML element, one of these event types as an XML Events event and an ID reference to the VoiceXML form as an XML Events event handler.
The XHTML+Voice profile supports the following VoiceXML 2.0 event types: nomatch, noinput, error, and help. These event types are emitted to the XHTML container as the following XHTML+Voice event types: vxmlnomatch, vxmlnoinput, vxmlerror, and vxmlhelp, respectively. The VoiceXML exit and cancel event types are supported within the VoiceXML form but are not propagated to the visual browser. Event types defined by the author within VoiceXML, also known as application-defined event types, are also propagated to the visual browser. However, the VoiceXML <form> element does not support adding the XML Events attributes.
An additional XHTML+Voice event type, "vxmldone", is supported. The vxmldone event is generated when the voice handler completes.
The XHTML+Voice profile extends the XHTML <script> element with XML Events. The <script> element doesn't generate any events of its own, so the observer attribute is required to specify observing an XML Events event on another node in the XHTML tree. The <script> element can observe any HTML 4.01 intrinsic event or VoiceXML event. Here is an example of how a <script> element can be a handler for a "vxmldone" event. The value of XHTML input "drink" is updated when the voice handler "fid" completes:
<?xml version="1.0"?>
<html xmlns="www.w3.org/1999/xhtml"
xmlns:ev="http://www.w3.org/2001/xml-events"
xmlns:vxml="http://www.w3.org/2001/vxml"
xmlns:xv="http://www.voicexml.org/2002/xhtml+voice" >
<head><title>Script Event Handler</title>
<script type="text/javascript"
ev:event="vxmldone" ev:observer="fid" declare="declare">
document.xform.drink.value = application.lastresult$[0].utterance;
&