10.3. Multimodal Interfaces

A multimodal interface is a system that combines two or more input modalities in a coordinated manner. Perceptual interfaces are inherently multi-modal. In this section, we define more precisely what we mean by modes and channels, and discuss research in multimodal interfaces and how this relates to the more general concept of perceptual interfaces.

Humans interact with the world by way of information being sent and received, primarily through the five major senses of sight, hearing, touch, taste, and smell. A modality (informally, a mode) refers to a particular sense. A communication channel is a course or pathway through which information is transmitted. In typical HCI usage, a channel describes the interaction technique that utilizes a particular combination of user and computer communication—that is, the user output/computer input pair or the computer output/user input pair.[1] This can be based on a particular device, such as the keyboard channel or the mouse channel, or on a particular action, such as spoken language, written language, or dynamic gestures. In this view, the following are all channels: text (which may use multiple modalities when typing in text or reading text on a monitor), sound, speech recognition, images/video, and mouse pointing and clicking.

[1] Input means to the computer, output means from the computer.

Unfortunately, there is some ambiguity in the use of the word mode in HCI circles, as sometimes it means "modality" and at other times it means "channel." So, are multimodal interfaces "multimodality" or "multichannel"? Certainly, every command-line interface uses multiple modalities, as sight and touch (and sometimes sound) are vital to these systems. The same is true for GUIs, which in addition use multiple channels of keyboard text entry, mouse pointing and clicking, sound, images, and so on.

What, then, distinguishes multimodal interfaces from other HCI technologies? As a research field, multimodal interfaces focus on integrating sensor recognition-based input technologies such as speech recognition, pen gesture recognition, and computer vision, into the user interface. The function of each technology is better thought of as a channel than as a sensing modality; hence, in our view, a multimodal interface is one that uses multiple modalities to implement multiple channels of communication. Using multiple modalities to produce a single interface channel (e.g., vision and sound to produce 3D user location) is multisensor fusion, not a multimodal interface. Similarly, using a single modality to produce multiple channels (e.g., a left-hand mouse to navigate and a right-hand mouse to select) is a multichannel (or multidevice) interface, not a multimodal interface.

An early prototypical multimodal interfaces was the "Put That There" prototype system demonstrated at MIT in the early 1980s [10]. In this system, the user communicated via speech and pointing gestures in a "media room." The gestures served to disambiguate the speech (Which object does the word "this" refer to? What location is meant by "there" ?) and effected other direct interactions with the system. More recently, the QuickSet architecture [18] is a good example of a multimodal system using speech and pen-based gesture to interact with map-based and 3D visualization systems. QuickSet is a wireless, handheld, agent-based, collaborative multimodal system for interacting with distributed applications. The system analyzes continuous speech and pen gesture in real time and produces a joint semantic interpretation using a statistical unification-based approach. The system supports unimodal speech or gesture as well as multimodal input.

Multimodal systems and architectures vary along several key dimensions or characteristics, including

- Number and type of input modalities;

- Number and type of communication channels;

- Ability to use modes in parallel, serially, or both;

- Size and type of recognition vocabularies;

- Methods of sensor and channel integration;

- Kinds of applications supported.

There are many potential advantages of multimodal interfaces, including the following [101]:

- They permit the flexible use of input modes, including alternation and integrated use.

- They support improved efficiency, especially when manipulating graphical information.

- They can support shorter and simpler speech utterances than a speech-only interface, which results in fewer disfluencies and more robust speech recognition.

- They can support greater precision of spatial information than a speech-only interface, since pen input can be quite precise.

- They give users alternatives in their interaction techniques.

- They lead to enhanced error avoidance and ease of error resolution.

- They accommodate a wider range of users, tasks, and environmental situations.

- They are adaptable during continuously changing environmental conditions.

- They accommodate individual differences, such as permanent or temporary handicaps.

- They can help prevent overuse of any individual mode during extended computer usage.

Oviatt and Cohen and their colleagues at the Oregon Health and Science University (formerly Oregon Graduate Institute) have been at the forefront of multimodal interface research, building and analyzing multimodal systems over a number of years for a variety of applications. Oviatt's "Ten Myths of Multimodal Interaction" [100] are enlightening for anyone trying to understand the area. We list Oviatt's myths in italics, with our accompanying comments:

Myth 1. If you build a multimodal system, users will interact multimodally. In fact, users tend to intermix unimodal and multimodal interactions; multimodal interactions are often predictable based on the type of action being performed.

Myth 2. Speech and pointing is the dominant multimodal integration pattern. This is only one of many interaction combinations, comprising perhaps 14% of all spontaneous multimodal utterances.

Myth 3. Multimodal input involves simultaneous signals. Multimodal signals often do not co-occur temporally.

Myth 4. Speech is the primary input mode in any multimodal system that includes it. Speech is not the exclusive carrier of important content in multimodal systems, nor does it necessarily have temporal precedence over other input modes.

Myth 5. Multimodal language does not differ linguistically from unimodal language. Multimodal language is different, and often much simplified, compared with unimodal language.

Myth 6. Multimodal integration involves redundancy of content between modes. Complementarity of content is probably more significant in multimodal systems than is redundancy.

Myth 7. Individual error-prone recognition technologies combine multimodally to produce even greater unreliability. In a flexible multimodal interface, people figure out how to use the available input modes effectively; in addition, there can be mutual disambiguation of signals that also contributes to a higher level of robustness.

Myth 8. All users' multimodal commands are integrated in a uniform way. Different users may have different dominant integration patterns.

Myth 9. Different input modes are capable of transmitting comparable content. Different modes vary in the type and content of their information, their functionality, the ways they are integrated, and their suitability for multimodal integration.

Myth 10. Enhanced efficiency is the main advantage of multimodal systems. While multimodal systems may increase efficiency, this may not always be the case. The advantages may reside elsewhere, such as in decreased errors, increased flexibility, or increased user satisfaction.

A technical key to multimodal interfaces is the specific integration levels and technique(s) used. Integration of multiple sources of information is generally characterized as "early," "late," or somewhere in between. In early integration (or feature fusion), the raw data from multiple sources (or data that has been processed somewhat, perhaps into component features) are combined and recognition or classification proceeds in the multidimensional space. In late integration (or semantic fusion), individual sensor channels are processed through some level of classification before the results are integrated. Figure 10.3 shows a view of these alternatives. In practice, integration schemes may combine elements of early and late integration, or even do both in parallel.

Figure 10.3. (a) Early integration, fusion at the feature level, (b) Late integration, fusion at the semantic level.


There are advantages to using late, semantic integration of multiple modalities in multimodal systems. For example, the input types can be recognized independently and therefore do not have to occur simultaneously. The training requirements are smaller, O(2N), for two separately trained modes as opposed to O(N2) for two modes trained together. The software development process is also simpler in the late integration case, as exemplified by the QuickSet architecture [148]. Quickset uses temporal and semantic filtering, unification as the fundamental integration technique, and a statistical ranking to decide among multiple consistent interpretations.

Multimodal interface systems have used a number of non-traditional modes and technologies. Some of the most common are the following:

- Speech recognition. Speech recognition has a long history of research and commercial deployment, and has been a popular component of multimodal systems for obvious reasons. Speech is a very important and flexible communication modality for humans and is much more natural than typing or any other way of expressing particular words, phrases, and longer utterances. Despite the decades of research in speech recognition and over a decade of commercially available speech recognition products, the technology is still far from perfect, due to the size, complexity, and subtlety of language; the limitations of microphone technology; the plethora of disfluencies in natural speech; and problems of noisy environments. Systems using speech recognition have to be able to recover from the inevitable errors produced by the system.

- Language understanding. Natural language processing attempts to model and understand human language, whether spoken or written. In multimodal interfaces, language understanding may be hand-in-hand with speech recognition (together forming a "speech understanding" component), or it may be separate, processing the user's typed or handwritten input. Typically, the more a system incorporates natural language, the more users will expect sophisticated semantic understanding from the system. Current systems are unable to deal with completely unconstrained language, but can do quite well with limited vocabularies and subject matter. Allowing for user feedback to clarify and disambiguate language input can help language understanding systems significantly.

- Pen-based gesture. Pen-based gesture has been popular in part because of computer form factors (PDAs and tablet computers) that include a pen or stylus as a primary input device. Pen input is particularly useful for deictic (pointing) gestures, defining lines, contours, and areas, and specially defined gesture commands (e.g., minimizing a window by drawing a large M on the screen). Pen-based systems are quite useful in mobile computing, where a small computer can be carried, but a keyboard is impractical.

- Sensors (such as magnetic and inertial) for body tracking. Sturman's 1991 thesis [131] thoroughly documented the early use of sensors worn on the hand for input to interactive systems. Magnetic tracking sensors such as the Ascension Flock of Birds[2] product, various instrumented gloves, and sensor- or marker-based motion capture devices have been used in multimodal interfaces, particularly in immersive environments (e.g., see [50]).

[2] http://www.ascension-tech.com

- Nonspeech sound. Nonspeech sounds have traditionally been used in HCI to provide signals to the user: for example, warnings, alarms, and status information. (Ironically, one of the most useful sounds for computer users is rather serendipitous: the noise made by many hard drives that lets a user know that the machine is still computing rather than hung.) However, nonspeech sound can also be a useful input channel, as sound made by users can be meaningful events in human-to-human communication—utterances such as "uh-huh" used in backchannel communication (communication events that occur in the background of an interaction rather than being the main focus), a laugh, a sigh, or a clapping of hands.

- Haptic input and force feedback. Haptic, or touch-based, input devices measure pressure, velocity, location—essentially perceiving aspects of a user's manipulative and explorative manual actions. These can be integrated into existing devices (e.g., keyboards and mice that know when they are being touched, and possibly by whom). Or they can exist as standalone devices, such as the well-known PHANTOM device by SensAble Technologies, Inc.[3] (see Figure 10.4), or the DELTA device by Force Dimension.[4] These and most other haptic devices integrate force feedback and allow the user to experience the "touch and feel" of simulated artifacts as if they were real. Through the mediator of a handheld stylus or probe, haptic exploration can now receive simulated feedback, including rigid boundaries of virtual objects, soft tissue, and surface texture properties. A tempting goal is to simulate all haptic experiences and to be able to recreate objects with all their physical properties in virtual worlds so they can be touched and handled in a natural way. The tremendous dexterousity of the human hand makes this very difficult. Yet, astonishing results can already be achieved, for example with the CyberForce device, which can produce forces on each finger and the entire arm. The same company, Immersion Corp.,[5] also supplies the iDrive, a hybrid of a rotary knob and joystick input interface to board computers of BMW's flagship cars. This is the first attempt outside the gaming industry to bring haptic and force-feedback interfaces to the general consumer.

[3] http://www.sensable.com

[4] http://www.forcedimension.com

[5] http://www.immersion.com

Figure 10.4. SensAble Technologies, Inc. PHANTOM haptic input/output device (reprinted with permission).

- Computer vision. Computer vision has many advantages as an input modality for multimodal or perceptual interfaces. Visual information is clearly important in human-human communication, as meaningful information is conveyed through identity, facial expression, posture, gestures, and other visually observable cues. Sensing and perceiving these visual cues from video cameras appropriately placed in the environment is the domain of computer vision. The following section describes relevant computer vision technologies in more detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset