A multimodal interface is a system that combines two or more input modalities in a coordinated manner. Perceptual interfaces are inherently multi-modal. In this section, we define more precisely what we mean by modes and channels, and discuss research in multimodal interfaces and how this relates to the more general concept of perceptual interfaces.
Humans interact with the world by way of information being sent and received, primarily through the five major senses of sight, hearing, touch, taste, and smell. A modality (informally, a mode) refers to a particular sense. A communication channel is a course or pathway through which information is transmitted. In typical HCI usage, a channel describes the interaction technique that utilizes a particular combination of user and computer communication—that is, the user output/computer input pair or the computer output/user input pair.[1] This can be based on a particular device, such as the keyboard channel or the mouse channel, or on a particular action, such as spoken language, written language, or dynamic gestures. In this view, the following are all channels: text (which may use multiple modalities when typing in text or reading text on a monitor), sound, speech recognition, images/video, and mouse pointing and clicking.
[1] Input means to the computer, output means from the computer.
Unfortunately, there is some ambiguity in the use of the word mode in HCI circles, as sometimes it means "modality" and at other times it means "channel." So, are multimodal interfaces "multimodality" or "multichannel"? Certainly, every command-line interface uses multiple modalities, as sight and touch (and sometimes sound) are vital to these systems. The same is true for GUIs, which in addition use multiple channels of keyboard text entry, mouse pointing and clicking, sound, images, and so on.
What, then, distinguishes multimodal interfaces from other HCI technologies? As a research field, multimodal interfaces focus on integrating sensor recognition-based input technologies such as speech recognition, pen gesture recognition, and computer vision, into the user interface. The function of each technology is better thought of as a channel than as a sensing modality; hence, in our view, a multimodal interface is one that uses multiple modalities to implement multiple channels of communication. Using multiple modalities to produce a single interface channel (e.g., vision and sound to produce 3D user location) is multisensor fusion, not a multimodal interface. Similarly, using a single modality to produce multiple channels (e.g., a left-hand mouse to navigate and a right-hand mouse to select) is a multichannel (or multidevice) interface, not a multimodal interface.
An early prototypical multimodal interfaces was the "Put That There" prototype system demonstrated at MIT in the early 1980s [10]. In this system, the user communicated via speech and pointing gestures in a "media room." The gestures served to disambiguate the speech (Which object does the word "this" refer to? What location is meant by "there" ?) and effected other direct interactions with the system. More recently, the QuickSet architecture [18] is a good example of a multimodal system using speech and pen-based gesture to interact with map-based and 3D visualization systems. QuickSet is a wireless, handheld, agent-based, collaborative multimodal system for interacting with distributed applications. The system analyzes continuous speech and pen gesture in real time and produces a joint semantic interpretation using a statistical unification-based approach. The system supports unimodal speech or gesture as well as multimodal input.
Multimodal systems and architectures vary along several key dimensions or characteristics, including
- Number and type of input modalities;
- Number and type of communication channels;
- Ability to use modes in parallel, serially, or both;
- Size and type of recognition vocabularies;
- Methods of sensor and channel integration;
- Kinds of applications supported.
There are many potential advantages of multimodal interfaces, including the following [101]:
- They permit the flexible use of input modes, including alternation and integrated use.
- They support improved efficiency, especially when manipulating graphical information.
- They can support shorter and simpler speech utterances than a speech-only interface, which results in fewer disfluencies and more robust speech recognition.
- They can support greater precision of spatial information than a speech-only interface, since pen input can be quite precise.
- They give users alternatives in their interaction techniques.
- They lead to enhanced error avoidance and ease of error resolution.
- They accommodate a wider range of users, tasks, and environmental situations.
- They are adaptable during continuously changing environmental conditions.
- They accommodate individual differences, such as permanent or temporary handicaps.
- They can help prevent overuse of any individual mode during extended computer usage.
Oviatt and Cohen and their colleagues at the Oregon Health and Science University (formerly Oregon Graduate Institute) have been at the forefront of multimodal interface research, building and analyzing multimodal systems over a number of years for a variety of applications. Oviatt's "Ten Myths of Multimodal Interaction" [100] are enlightening for anyone trying to understand the area. We list Oviatt's myths in italics, with our accompanying comments:
Myth 1. If you build a multimodal system, users will interact multimodally. In fact, users tend to intermix unimodal and multimodal interactions; multimodal interactions are often predictable based on the type of action being performed.
Myth 2. Speech and pointing is the dominant multimodal integration pattern. This is only one of many interaction combinations, comprising perhaps 14% of all spontaneous multimodal utterances.
Myth 3. Multimodal input involves simultaneous signals. Multimodal signals often do not co-occur temporally.
Myth 4. Speech is the primary input mode in any multimodal system that includes it. Speech is not the exclusive carrier of important content in multimodal systems, nor does it necessarily have temporal precedence over other input modes.
Myth 5. Multimodal language does not differ linguistically from unimodal language. Multimodal language is different, and often much simplified, compared with unimodal language.
Myth 6. Multimodal integration involves redundancy of content between modes. Complementarity of content is probably more significant in multimodal systems than is redundancy.
Myth 7. Individual error-prone recognition technologies combine multimodally to produce even greater unreliability. In a flexible multimodal interface, people figure out how to use the available input modes effectively; in addition, there can be mutual disambiguation of signals that also contributes to a higher level of robustness.
Myth 8. All users' multimodal commands are integrated in a uniform way. Different users may have different dominant integration patterns.
Myth 9. Different input modes are capable of transmitting comparable content. Different modes vary in the type and content of their information, their functionality, the ways they are integrated, and their suitability for multimodal integration.
Myth 10. Enhanced efficiency is the main advantage of multimodal systems. While multimodal systems may increase efficiency, this may not always be the case. The advantages may reside elsewhere, such as in decreased errors, increased flexibility, or increased user satisfaction.
A technical key to multimodal interfaces is the specific integration levels and technique(s) used. Integration of multiple sources of information is generally characterized as "early," "late," or somewhere in between. In early integration (or feature fusion), the raw data from multiple sources (or data that has been processed somewhat, perhaps into component features) are combined and recognition or classification proceeds in the multidimensional space. In late integration (or semantic fusion), individual sensor channels are processed through some level of classification before the results are integrated. Figure 10.3 shows a view of these alternatives. In practice, integration schemes may combine elements of early and late integration, or even do both in parallel.
There are advantages to using late, semantic integration of multiple modalities in multimodal systems. For example, the input types can be recognized independently and therefore do not have to occur simultaneously. The training requirements are smaller, O(2N), for two separately trained modes as opposed to O(N2) for two modes trained together. The software development process is also simpler in the late integration case, as exemplified by the QuickSet architecture [148]. Quickset uses temporal and semantic filtering, unification as the fundamental integration technique, and a statistical ranking to decide among multiple consistent interpretations.
Multimodal interface systems have used a number of non-traditional modes and technologies. Some of the most common are the following: