Chapter 10. Form Factors and Configurations

TO PROTECT SPACE TRAVELERS from the 500-degree temperature swing, radiation, and the unforgiving void of space, the original astronaut suits went through countless concepts and prototypes. What many people might not know is who made the handmade suits worn by Neil Armstrong and Buzz Aldrin when they first stepped on the surface of the moon. The company Playtex, at the time most well known for creating women’s undergarments like girdles and bras, lent its expertise and seamstresses to that groundbreaking mission.

Each of the 21 layers of the suit served a specific purpose, from managing pressurization to wicking body moisture and sweat. The suit itself was modular, which ensured the integrity of the suit and its life support systems. This also allowed the astronauts to put it on, take it off, and maintain it. In an open competition to design the suit, the unlikely team from ILC (originally International Latex Corporation) won against engineering teams who already worked often with the space program at the time. The comfort and physical dexterity of the soft suit (combined with its ability to fulfill the environmental requirements of extravehicular activity) gave it the edge against submissions from two other manufacturing competitors who had done a great deal of work already with the space program.1

While outer space is a much harsher context than the average personal device is designed for, the same principles apply. Form factors, or the physical specifications of devices and their components, allow a device to withstand its usage environment as well as provide usability and delight. The configurations, on the other hand, are the types of components and software used to enable of functionalities that meet performance requirements. Together, these two sets of specifications drive the physical experience and interface modes of a device.

Creating Multimodal Properties

Form factors and configurations have a great impact on each other. Form factors determine the overall physical design of the product, which influences how much space is available for the componentry inside. Configurations determine what kinds of physical capabilities are available. Buttons have to be reachable; speakers and microphones have to be positioned correctly. Overall, components like processors create heat, and batteries add weight, which must all be accounted for in the user experience. Physical requirements (like water resistance and scratch proofing) have technical attributes, and they also contribute to the overall style and impression of quality of the device.

Over the course of humanity’s history, we have figured out quite a bit about the physical phenomena of the senses, and we use them as inspiration. Acoustic guitars use hollow bodies the way the hollow of our chest and lungs amplify the sound from our vocal chords. Reed instruments emulate the way our vocal chords vibrate when air is passed over them. Old ear trumpets magnified the cup-shape of our ears to concentrate sound to a single focal point—our ear canal (see Figure 10-1). Similar inspiration exists for more complex devices. Mercury tilt switches are a sensor component that use the movement of mercury against electrical contacts. This works much like the vestibular system within the inner ear, responsible for balance and the ability to detect our own physical rotation. These capabilities are termed biomimetic, because they mimic solutions found in biology.

A Victorian-era ear trumpet and a modern guitar: both emulate the properties of human bodies and how they enhance hearing (Source of top image: Wellcome Images)
Figure 10-1. A Victorian-era ear trumpet and a modern guitar: both emulate the properties of human bodies and how they enhance hearing (Source of top image: Wellcome Images)

In some cases, technologies reverse the ability to sense as a way to create new sensations. Speakers can be the reverse of an eardrum, causing the vibration of a membrane to emit sounds rather than gathering them. RGB pixels work similarly to how our eyes detect light: the cones in our eyes are usually differentiated by sensitivity to long, medium, and short light waves, which do roughly correspond to red, green, and blue. This phenomenon is known as trichromatic vision. Technologies simply developed along similar lines as human ability (see Figure 10-2).

Speakers reverse the way ears work, and screen pixels roughly follow the ways the eye responds to light (Source speaker diagram: Iain; pixel: Prateek Karandikar. Both Creative Commons Share Alike)
Figure 10-2. Speakers reverse the way ears work, and screen pixels roughly follow the ways the eye responds to light (Source speaker diagram: Iain; pixel: Prateek Karandikar. Both Creative Commons Share Alike)

Visual arts can physically simulate surface textures and shapes that reflect or refract light into our eyes in the same way that real physical objects do. Film aspect ratios, like CinemaScope, were meant to capture the full range of visual field in order to create the most visually immersive experience possible. There were just different schools of thought on what should be included in the definition of the visual field, which accounts for much of that variation.

Other interface properties are simulated mathematically via software visual or audio effects, like global lighting in 3D animation and rendering, or echo and reverberation within sound. These kinds of effects take advantage of the physics and mathematics of the stimulus, familiar environments, and the abilities and limitations of our senses. Binaural sound recording physically re-creates the dampening of sound by the presence of our head between our ears, and how it affects the behavior of a sound wave traveling through and around it. Dolby Atmos re-creates this mathematically. Convolution reverb uses audio samples taken from one location and then applies that same reverberation to other audio. Now more than ever, interfaces can harness incredibly convincing simulations of physical experience (see Figure 10-3).

Digital interfaces now closely approximate the physical properties of experiences and tools (Source: Digital images created with Wetbrush by artist Daniela Flamm Jackson; painting detail by Fritz Welch)
Figure 10-3. Digital interfaces now closely approximate the physical properties of experiences and tools (Source: Digital images created with Wetbrush by artist Daniela Flamm Jackson; painting detail by Fritz Welch)

Many product interfaces are now completely within the realm of physical simulation but still draw from real physical controllers of the mechanical age. Tap is similar to the spring-loaded buttons. Slider and rocker switch elements are common to most interactive form design. Our onscreen keyboards are reproductions of mechanical typewriters (see Figure 10-4). The technology within many devices has been profoundly transformed. Our senses, minds, and bodies develop in the physical world and shape our internal logic and understanding of interaction. It’s not a surprise then that interfaces stay rooted in the physical world along with us.

Following our existing understanding of how things work means that interfaces continue interaction paradigms from earlier products (Source, Ericcson phone: Alexandre Dulaunoy)
Figure 10-4. Following our existing understanding of how things work means that interfaces continue interaction paradigms from earlier products (Source, Ericcson phone: Alexandre Dulaunoy)

Multimodal product design focuses less on any one of the traditional design disciplines like industrial, graphic, and interaction design. Instead, an interface is the sum of the physical, mechanical, and computational interaction elements and how they work together. There are also more ways to directly integrate digital technology into existing physical interfaces and environments. Blurring the lines between physical and digital experiences opens up many new doors within the realm of interaction design.

Configuring Interface Modes

There are three important experience factors when selecting the technologies for creating multimodal interfaces. The first two are the building blocks that are required for the mode, and the dimensions and resolution required for each of those building blocks. Different technologies vary by characteristics including mode, resolution, power consumption, input/output capabilities. The last, but definitely not least of the experience, is the level of focus that is comfortable and safe for the user within the experience. This is often comprised of several factors, such as learning curve, balancing between the level of attention needed, the urgency and complexity of the information, as well as the context of the user experience.

For example, televisions allow us to watch shows and films. At a very high level, the viewing experience is deeply immersive, and the quality of the images and sound are really important to creating that immersion. Most televisions now have a remote control, with some form of on-screen and audio feedback during usage. In this case, the input is primarily a low-resolution/low-dimension haptic mode (buttons). Two of the most important forms of image and sound processing are buffering and rendering. Buffering is the ability to store enough immediate video and audio information to ensure smooth playback, and rendering is deciding which video or audio information to play at any moment (as well as how to display any additional interfaces with it). These require memory for storing large files and a moderate amount of processing for rendering the UI and video together. The output for using the remote control and for watching films and shows is a high-resolution/low-dimension audiovisual mode (screen and speakers). Within the context of watching television or movies, there are generally few additional needs to support other modalities or focus within the experience. The screen is designed to dominate the visual field and home speaker systems can be custom configured to the size and layout of the room.

The controls and dashboard of a car have similar modes to watching television: the input is primarily haptic with supporting visual and audio modes, and the output is also audio-visual modes. The big difference is the context of experience. The driver of a car must maintain a high degree of visual, auditory, and haptic focus on their driving. In this case, driving a car is a low resolution/high dimension visuo-haptic mode (steering wheels, pedals, gearshift, signals, and various mechanical knobs and sliders for music and environmental controls) with supporting audio modes (various gauges, warning lights/alerts, as well as certain external cues). The information has been highly prioritized and simplified to minimize distraction from the primary visual focus—allowing the driver to keep their eyes on the road. The position of the instrumentation panel is right below the field of vision, requiring only a small shift in visual direction to read, minimizing the amount of time a driver’s eyes are not on the road. The field of vision is dominated by the windshield and viewing mirrors, allowing for maximum visibility of the driving context.

When creating interfaces, several technology characteristics tie back to supporting individual human modalities. In many cases, these technologies—especially hardware components—can serve multiple functions and sometimes even modalities within the same device. Of course there is a range within each type of technology. For example, both OLED and e-ink screens are used for visual modes, but the slow refresh rate of e-ink screens does not allow it to support motion graphics very well. Speakers can range in capability from the tinny beeps of a portable alarm clock to those configured to deliver Dolby Atmos spatialized sound in a movie theater, with sub-woofers so powerful that moviegoers can both feel and hear the rumble.

The following tables are common technologies used to create interface modes. While visual and audio technologies are highly developed, there is a wide array of haptic input technologies. Despite this, output remains limited compared to the human tactile and proprioceptive capabilities. Additional technologies can be used to create interactions when physical information lies outside the range of human sensibility.

Table 10-1. Visual mode technologies
MONITOR (INPUT)

Cameras

Varying resolutions, low dimenion

A blend of optical sensor arrays and lens technologies, for single images or image sequences

Types of cameras:

  • Stereoscopic cameras create two streams of images or videos (this can also be simulated digitally).
  • Photography and video cameras capture images or film for human viewing.
  • Infrared (IR) cameras have unique properties, like showing human bruises that might be invisible on the skin surface.
  • Laser cameras are used for 3D imaging and can often do cool things like “see” around corners.

Optical/Electromagnetic Sensors

Low resolution, low dimension

Used to detect a specific range of the electromagnetic spectrum, they can be customized across a wide array of applications, like optical heart rate sensors, telescopes, microscopes, and radiation detectors.

ANALYZE

Optical character recognition (OCR)

Medium resolution, low dimension

Converting scanned images of typed or handwritten text into text files that can be edited, serached, or analyzed.

Machine vision

High resolution, low dimension

Services like Microsoft Computer Vision, Google Cloud Vision, IBM Vision Recognition, Cloud Sight, and Clarifai are used to automate object recognition and tagging.

3D Scanning

High resolution, low dimension

Used to create spatial models of physical objects and environments.

Motion tracking

Varying resolution, low dimension

Used to track physical movement, there are a number of different technologies depending on the level of resolution required.

Lidar

High resolution and dimensions

Uses pulsed lasers and sensors to measure distance in 3D mapping for a variety of uses including autonomous vehicle navigation systems.

DECIDE

Image recognition/tagging

High resolution, low dimensions

Automated object, content, and location information, often an extension of machine vision.

Autonomous navigation

High resolution and dimensions

Often blending multiple modes to allow autonomous movement of vehicles, robots, and drones.

RESPOND AND CONTROL (OUTPUT)

Screens

Varying resolutions and dimensions

Screens convert electrical measurements into pixels, and there are many different kinds.

Types of screens:

  • Liquid Crystal Display (LCD) screens are perhaps the most common type of screen across many different types of devices.
  • Organic Light Emitting Diodes (OLED) work without the backlight required by LCDs, are more energy efficient, produce deeper blacks, and can be bendable.
  • E Ink is the name for a range of screens that emulate the look of paper, so frequently used for e-readers like Kindle, but also watches and public signage. They consume very little power compared to other types of screens. Though lower resolution, and with a slower refresh rate, they look better in sunlight and cause less eyestrain.
  • Virtual Reality (VR) or Augmented Reality (AR) goggles such as the Oculus Rift and HTC Vive use bicameral screens that must render two video streams—one per eye—at higher than HD resolution to creat the 3D field of vision of a person.

Lights

Low resolution and dimension

An LED simply emits a specific color and brightness of light. Adding the restful breathing pattern of brightening and dimming poetically indicates the sleep mode of Macbook. The Nike Fuelband used an array of LEDs to playfully create a sports inspired scoreboard style text display, instead of using a pixel-based screen. The Amazon Echo uses its ring of LEDs to create directionality—it almost seems to look at the person speaking. Traffic lights use three different colors to control the flow of traffic.

Table 10-2. Audio mode technologies
MONITOR (INPUT)

Microphones

Varying resolutions, low dimension

Microphones convert the mechanical waveforms into electrical measurements.

Types of microphones:

  • Stereophonic microphones create two streams of sound (this can also be simulated digitally).
  • Binaural microphones match the acoustic properties of head positioning and separation, to simulate human hearing.
  • Microphone arrays are used to track positioning or to create spatialization effects.
  • Micro-electro-mechanical systems (MEMS) microphones have a very small form factor, allowing them to be used for noise cancellation, hearing aids, and beam forming.

ANALYZE

Natural language processing and speech synthesis

High resolution, low dimension

Two technologies often paired to understand and recreate human speech.

Real-time translation

High resolution, low dimension

Increasingly used to allow human speech to be translated across languages.

Speech-to-text

High resolution, low dimension

Allows speech to be captured to text files.

Spatialization

Varying resolutions and dimensions

Processes audio stream from multiple microphones to triangulate location, which can be cross-referenced with GPS. It can also be used to position playback of sounds across an array of speakers, like in movie theatres.

DECIDE

Noise reduction/cancellation

Varying resolutions and dimensions

Often a precursor to audio analysis, filtering unnecessary sound during analysis or playback.

RESPOND AND CONTROL (OUTPUT)

Speakers

Varying resolutions and dimensions

Speakers convert electrical measurements into mechanical waveforms. Different kinds of speakers, like tweeters, woofers, and sub-woofers are used to create different ranges of sound wavelengths. Parabolic speakers can focus sound to a specific location in space.

Table 10-3. Haptic mode technologies
MONITOR (INPUT)

Motion/location tracking

Varying resolutions, low dimension

A blend of sensors including accelerometers, gyroscopes, GPS, and barometeres (for altitude) used singly, together, or in arrays, used to detect the movement and/or location of devices and their users.

Multitouch

Low resolution and dimensions

A combination of a touch-sensitive screen and force sensitive resistors—both technologies use arrays of sensors and cross reference them against each other to measure the position, direction, movement, speed, and pressure of contact with a screen, usually with fingers or a stylus.

Force Touch

High resolution and dimensions

This Apple technology extended multitouch technology by combining a touch-sensitive screen with force-sensitive resistors.

Temperature

Low resolution and dimensions

Used to detect the temperature of objects or environments.

Fingerprints

High resolution, varying dimensions

Can be sensed using optical, ultrasonic, and capacitance sensors. The last method is generally deemed more secure and used on phone by Apple, Galaxy, HTC, LG, and others.

ANALYZE

AR and VR

High resolution and dimensions

Use specialized hardware and software services to map and render 3D environments, tracking our field of vision, gaze, and head movement.

DECIDE

Activity recognition

Low resolution, low dimension

Used with motion tracking to determine the activity of a user, like when fitness trackers can detect walking, running, and biking, or a smartphone can detect driving.

Location recognition

Low resolution and dimensions

Cross references motion tracking with geographic informations systems (GIS) to determine the specific location of a device or user.

RESPOND AND CONTROL (OUTPUT)

Haptic motors

Varying resolution, low dimension

A form of actuator, used to create more precise vibrations, and are useful when a user needs more information, like smartphones, game controllers, and power steering.

Speakers

Low resolution and dimension

Can sometimes be used to create tactile vibrations such as alerts on pagers and GPS devices.

Actuators

Varying resolution and dimensions

PP

Encompass a range of technologies like motors, muscle wire, and materials that demonstrate mechanical properties when exposed to energy. These are often used to create autonomous movement for driverless cars, robots, and functional prosthetics.

Mapping Modal Behaviors to Modal Technologies

While these technical variations are important in creating products that deliver multimodal experiences, it also means that there are endless ways to merge digital and physical experiences together. With that said, having a voice interface on a lawnmower is probably a really terrible idea—well, at least until we can make their motors quieter and their blades safer. Product modes should be selected to support the preferred human modalities within the activity.

Some human behaviors are thought of as unimodal, meaning that one sensory modality is sufficient for that specific activity. For example, reading is considered unimodal with vision, though it can be enhanced when blended with other visual behaviors, like spatial mapping, and other modalities. This is why reading a physical book allows for better comprehension and retention than reading from a screen. We use the physical layout of the book itself to help us reinforce the narrative and sequential structure of what we are reading. Listening to music is also considered unimodal, but our sense rhythm almost magically syncs what we hear to our other modalities. Some musicians can feel how a song is sung in their throats or how it is played on an instrument with their fingers while listening to it. People very commonly tap out the beat with their feet. On the other hand, spatial orientation is very obviously multimodal. For most, it isn’t possible to figure out where you are in a specific location without both vision and proprioception. If a room goes completely dark, a person almost immediately freezes, because they cannot see where they are going.

There are many different ways to categorize form factors and configurations for multimodal experiences, but there isn’t really a neat and tidy system. Products can vary across wide range of form factors and technology configurations, even when they fulfill the same purpose. Our modalities are equally flexible. For example, when searching for an item in the pantry, we may use vision to see which bottle we are trying to grab. But for shelves that are above eye level, we may rely more on touch to recognize that item by shape, weight, or the material of the bottle. We shift modalities in an instant, without thinking about it, or remembering that we did so later. Comparative case studies highlight the effect of different multimodal combinations on a product’s capabilities and overall experience.

Vision Dominant Activities

Because vision is our dominant sense—even within multimodalities—we have developed a great deal of technology for it. The most fundamental technologies are of course our written languages and image-based technologies, including visual and film-based media. We have been writing and drawing for millennia—by definition, all of recorded human history. It is somewhat unique, because with stereoscopic vision, we can use it for sensing both 2D and 3D forms. Along with hearing and smell, it is one of our abilities that we rely on for perception across distance. This is why the technology used to create visual experience runs a very wide gamut, across multiple types of engineering and design. In addition, our eyes, heads, and bodies move, allowing us to focus on specific visual depth ranges or objects within our visual field, to track objects moving across it, or to steady what we see when we are moving ourselves. We can change our view, perspective, and proximity to objects and environments.

Many of the modalities within vision provide sufficient information to support understanding and motor skills. This is why they can also become building blocks of multimodal behaviors. Many experiences that we cannot directly experience are mapped to visual technologies, using learnable visual languages and systems or specialized ones requiring a legend or key for the user. Vision can often be a decision-making factor for how we apply sensory abilities as well. The sight of a vibrant red apple increases our desire to taste it. Seeing and hearing water boil in a pot warns us not to touch it. Seeing a big pile of garbage spilled across the street perhaps convinces us to hold our breath while we walk past it.

Immersive Activities: Screen-Based Experiences and VR

Screen-based activities are among the most visually immersive experiences available, requiring that we focus our attention away from the physical world around us and to the events and images on the other side of a piece of glass. When we are looking at them, we focus and fill our visual field and may not spend a lot of time perceiving visual stimuli outside that frame of focus. We are most often reading, observing, or examining the visual information in front us. When we are watching a film, the experience is largely passive. We are simply observing a story unfold before us. Filmmaking has a great many techniques for maintaining visual focus, and for drawing the eye into the frame, where the movements of a camera simulate the way we ourselves would look at an event unfolding. Combined with sound and in some cases, haptic vibration through subwoofers, we are immersed in a story. We are empathizing with the emotions and decisions of the characters in front of us. In this case, a screen need only be the sufficient size, distance, and resolution to engage our visual field and focus.

Experiencing the story for ourselves takes the experience to a deeper level of physical embodiment, engaging our proprioceptive and in some cases, our haptic modalities in the experience. Like many advanced interactive technologies, VR was developed largely for gaming, though other applications are being rapidly explored. Gaming itself extends the techniques in filmmaking in terms of world building and creating characters that the user can easily step into and embody, becoming an actor in what we see. (It would be a bit more challenging to step into the body of a Pegasus with four legs, and learn how to flap our wings, though perhaps really fun!) The goggles use small stereoscopic screens a short distance from our eyes, making it easier to fill our visual field and provide 3D vision. They block out visual stimulus external to that screen experience, forcing us to rely on the screens for our visual reality. Proprioception and vision validate each other for spatial orientation and balance. When they do not align—because of timing or because they are sensing two incompatible experiences, people can suffer from motion sickness or dizziness. Even in just 2D film, unsteady camera movement can also cause motion sickness. This is aggravated when, for example, a screen is showing that we are moving, and our bodies are seated. When playing a video game, the stationary edges of the screen anchor our spatial reality, to offset that physical reaction. In VR, that spatial anchor is deliberately suppressed, and so a great deal of technology is used to anchor spatial reality to the virtual environment.

The speed and tracking of user movement is essential to VR experiences. The Oculus Rift uses a great deal of mathematical computation to build 3D virtual worlds. However, it takes a great deal of computational effort to keep our bodies from rejecting it. To prevent motion sickness and to accurately sync user movement to 3D rendering, the Oculus team turned to motion capture techniques more commonly used in the visual effects realm of filmmaking. The Constellation motion tracking system uses the position of infrared LEDs on the headset to map the movements of our heads quickly to its 3D world building, stabilizing the movement of the images in front of us.

Augmented or Auxiliary Activities: Visual Indicators for Peripheral Information

We already “augment” our physical realities with graphics and text that don’t require head-worn gear. LEDs, printed labels and icons, and small are very common in physical devices. They are often used as an indicator for power level or for other factors, like connectivity in a router, or letting you know which elevator in a bank has arrived. Lately, the usage of LEDs has become very sophisticated, especially with the type of housing and material used around it. A common technique is called counterboring, where small depressions are molded or drilled into the back of the housing material. This allows the light to be seen and focused when it is on, but the surface of the housing remains visually uninterrupted when it is off. This is related to punch-cut screens, which only allowed light to pass through certain areas, using predetermined shapes to spell out numbers, letters, or special characters, like old digital clocks and cross walk signs. Some examples of this include indicators on Apple laptops, and the Areaware wooden clock. Jawbone created custom shaped icons for their counterbores, and used color, indicating states for specific functionality.

In addition, color and animation play a strong role in creating visual systems or codes. “On” or “off” is a binary, usually mapped to “yes” or “no” of some state. Colors can mean a few different things, if people learn them. This idea has been around since stop lights, with red, yellow, and green. Increasingly, some of the technologies that are applicable to LED usage can be applied to other light sources, using “smart” technologies. One example of this is the Phillips Hue, which can be used to emulate sunlight as a kind of wake-up alarm clock. Home lighting has always been an indicator of sorts. When a fuse blows, or there is a blackout, it’s always reassuring to see the lights come back on, which lets you know that the power is running again. It also lets others know that someone is home.

Augmented Reality Versus Augmented Products: Visual Arrays of Control and Choice

For a long time, control panels and dashboards have been a very important part of our interactions. We use dashboards in cars, planes, and other forms of transport, but they are not just for heavy equipment. Artists use their palette to review the colors they have available for drawing an image—both with physical paint and in software such as Photoshop. We have signage to help us determine where in the grocery store we would find a can of peas, or where to find a book in a library. In these cases, we are given an array of information that we can use to inform our focal action, whether it’s driving, drawing, or shopping. These kinds of visual arrays are most commonly used for activities that require a high level of focus or decision making. But the most important experience attribute of dashboards is really not how much information we can cram into them. It’s the fact that we can look away or ignore them, and then easily remember where to focus our eyes for a specific bit of information or control. It is on a need-to-know basis, over which we develop effortless control, after much repetition. On a car dashboard, all the radio controls are in one place. All the heating and cooling systems are in another. After a time, we can use them without looking, because the spatial position and haptic information is sufficient.

With augmented reality technology, particularly for vision, that ability to look away is not as effortless, and there isn’t always a learning period. It can be difficult to apply the technology to complex activities correctly.

It may be better to use sensory substitution or to add extra steps to the focal experience, to prevent distraction. Splitting cognitive focus has limits, and may not be quickly repeatable or easily sustainable over long periods of time. This is why hands-free technologies are receiving increasing scrutiny. When we divide our attention across two activities, it’s not split 50/50. Our performance in both activities degrades significantly. The more activities we add or the more focus and cognition one activity requires, the more rapidly our performance on any of them deteriorates. People seem to know this to some extent. You wouldn’t try to explain a complex mathematical equation to someone who was driving, and expect them to understand or remember later—especially in difficult driving conditions.

Over time, however, the gap between 2D and 3D screen-based media will be bridged, allowing a more contiguous experience between traditional screen-based UIs, AR, VR, and the physical world.

Automated Visual Capabilities

One of the more interesting aspects of recent technology is re-creating vision through machine learning. People can do a great deal of visual analysis, drawing many important ideas and information with sight. We can recognize who a person is, how they are feeling, and if they are paying attention to us or if they are distracted. We can estimate how fast a car is moving toward us, how many people are in that car, and most importantly, whether we need to get out of the way. We can do all of them at the same time to a certain extent.

This ability to analyze what we see, and to parse it into meaningful information is a bit more challenging for technology. There isn’t one specific technology that can encompass all of human visual ability. Instead, it’s broken across several technologies and applications, and different companies have different capabilities within them. Driverless cars have special emphasis on avoiding obstacles: determining the speed and position of other cars and objects that will enter the projected path of the car. Facial recognition however, focuses on the geometry of our face, and cross referencing it against existing catalogs of faces to make a positive ID. Image tagging focuses on recognizing objects, colors, and other visual attributes within a specific image. In the Google Vision API, there are algorithms that will even try to recognize the breed of dog, if it recognizes one in an image.

The Google driverless car uses a blend of sonar, stereo cameras, lasers, and radar for obstacle detection. “The lidar system bolted to the top of Google’s self-driving car is crucially important for several reasons. First, it’s highly accurate up to a range of 100 meters. There are a few detection technologies on the car that work at greater distances, but not with the kind of accuracy you get from a laser. It simply bounces a beam off surfaces and measures the reflection to determine distance. The device used by Google—a Velodyne 64-beam laser—can also rotate 360-degrees and take up to 1.3 million readings per second, making it the most versatile sensor on the car. Mounting it on top of the car ensures its view isn’t obstructed.”2

Facial recognition requires cameras facing the person, usually at eye level, and in some cases infrared, to determine the contours of the face, and where to focus, as in autofocus cameras. Image tagging requires scanning of image files that are already data files, and so does not require particularly specialized hardware at all.

Creating Focal Experiences with Audio and Speech

Creating auditory and speech experiences entails a stronger emphasis on social experiences. Talking is something we do with other people, and so the social aspects of voice interactions will always play a strong role in their design. While several people in a room can choose to look at a screen or not, sound vibrations carry through the air in all directions—at sufficient volume, everyone can hear them. They also travel through liquies and solids, which makes possible warning sirens, sonar, and sadly, annoying neighbors. Hearing through solids, however, generally requires direct physical contact. Solids can conduct lower tones better than air, so our voices sound lower to us than they are, because we are hearing them through our jawbones and skulls. Our sense of hearing depends largely on the frequency or pitch and the volume of the sound. This is relative to the distance between the listener and sound source, because soundwaves have physical properties like decay (they lose energy as they travel), echo (they bounce off solid objects), and the Doppler Effect (they can get compressed or stretched out depending on movement), which all impact the sound quality. Those qualities also help us to roughly position the source of a sound in space. Like all of our other senses, we can detect patterns in sounds, especially in time and frequency, resulting in our experience of rhythm, harmony, and melody.

Personal Sound Experiences

Typically, the larger the speaker, the lower the pitch and louder the sound it can create. Lower pitched sound waves are physically larger and therefore require more physical volume. That is why headphones can sound tinny (it’s difficult for them to create low frequencies at sufficient volume), and why subwoofers are usually so big. This can be a physical challenge for sound-based devices. Headphones and earbuds are designed to keep sound personal. They may cover our ears, trapping sound against the side of our heads. Earbuds are funnel shaped to fit securely into our ears and take advantage of the little bump in our ear, called the tragus, to hold them snugly in place, trapping both the buds themselves and their sound inside our ear canals. There may be externally facing speakers to detect environmental sound, used for sound cancellation, where the peak and trough of the detected sound wave are inverted in a second sound wave, thus neutralizing the first sound wave. Increasingly, the directionality of sound is being explored.

The downside to all of this innovation, however, can appear when we are actually having a phone call or other conversation. We can look like we are talking to ourselves, or seem like we are addressing someone when we aren’t, creating awkward situations. For now, the social stigma of talking to yourself is still pretty strong. As the technology for auditory experiences advances, many related services are being integrated into the earbud (see Figure 10-5) or headphone form factor.

Google Pixel Buds integrate services like translation and assistance
Figure 10-5. Google Pixel Buds integrate services like translation and assistance

Parametric speakers, on the other hand, take advantage of the ultrasound—mechanical waves that rise above our threshold of hearing. These waves can force audible sound waves to travel along a specific path and distance—in effect targeting where an audible sound can be heard. This is still a somewhat experimental technology.

Social Experiences: Broadcast

Ancient Greek and Roman auditoriums were designed for both hearing and speaking. Their name literally means “a place for hearing.” They were shaped in the same way sound travels from our mouths: facing from one direction with a small starting point that radiated outward and grew larger. Seats were arranged to fall within that radiating shape, so that those sounds could be heard. They were smaller than those we have now because the human voice alone can only travel so far.

Modern technology allows us to amplify sound to be heard over longer distances, but some of the fundamentals still apply for social auditory experiences. In some cases, many small speakers can be used to replace one large speaker. They sound the same within a specific spatial range, but can be more directionalized, preventing sounds from carrying too far. This can be helpful for public spaces like stadiums, or museums with multiple auditory displays in close proximity to each other. The Amazon Echo is cylinder-shaped to allow for a 360° microphone array and speaker configuration to be able to “hear” and “speak” to anywhere in a room from any place in the room.

Conversation Experiences: Speech

On the flip side of social auditory experiences is conversation. The often-quoted number of how 75% of communication is nonverbal is correct in spirit, though perhaps not mathematically accurate. Conversation is a multimodal experience, and we use many visual, auditory, gestural, and spatial cues when we are speaking to one another. We are very good at predicting when it is our turn to speak or listen or what a person is about to say, and use many different behaviors to communicate emotion, attention, and other aspects of dialog and empathy. We are so good at conversation, in fact, that our turntaking response time in conversation is unbelievably fast. The pause between turns can be as little as 200 milliseconds, literally faster than the blink of an eye. This can make creating speech interfaces a bit challenging, as longer than that breaks expectations of dialog. We can also process large quantities of verbal communication. We can listen to a person speak for hours at a clip. Alexa, on the other hand, can handle up to ninety seconds of sound file, and its response time varies on a number of factors, including the speed of your household internet. Despite these technical limitations, its interaction design took cues from how people really converse. A ring of LEDs on the device responds as quickly as a human does in conversation, but it tries to point at the direction of the person speaking, like it’s “looking” at you. This makes the delay feel like thoughtful listening and consideration of response, rather than latency.

The multimodal cues around conversational speech are translated into experiences that may not have the same physical experience of chatting with someone, but echo the timing, conversational cues, and other social qualities of conversation. Other perceptual attributes of conversation may help increase the acceptability of the experience for users.

Creating Haptic Experiences

Our sense of touch is the most multidimensional, detecting mechanical, electromagnetic, and chemical stimulus. It is our most direct sensory ability, requiring physical contact with objects and environments. It’s also tied to some of our most non-aware and our most engaged activities. We feel the smooth, cool cotton of a shirt when we put it on in the morning, but then stop noticing it touching our body almost immediately. We can feel the shape and cushioning of a chair, but quickly forget when we are watching a really great movie. When we do notice, we get tremendous pleasure from touch-based experiences: the fluffiness of down pillows on our faces, the satisfying crunch when biting into a crisp crust of bread, the funny jiggle when poking Jell-O, a warm bath after a long day. We laugh when we are tickled, and we recoil with horror if something fast with a lot of legs touches our feet unexpectedly. Touch can be our most experiential sense—the one that most quickly and effectively immerses our bodies into being in the physical world.

While the cosmetic design of a device may have tactile material characteristics when being handled, it’s quite another thing to simulate them using hardware componentry. All of our other senses have physical focal points: the eyes, ears, nose, and tongue. We have touch receptors all over our bodies, and deep in our muscles, bones, and organs—even in the previously mentioned eyes, ears, nose, and tongue. There are sensors that can detect moisture, but there is no technology that can re-create the feeling of wet without the immediate presence of liquid. Just think about what it feels like to be in a jacuzzi: warm jets of bubbles and water all over our skin, steam and splash hitting our faces. Because of the range and degree of sensitivity in our haptic experiences and how deeply tied they are to our bodies, they are difficult to re-create with digital technology. For now, a full haptic body suit seems a bit farther off into the future.

There is, however, a wide array of sensors that can detect the same kinds of information we can through touch. Sensors can detect many of the physical stimulus we can, like wind speed, temperature, and the presence of moisture. This helps us avoid direct exposure to unpleasant weather conditions, like tsunamis, tornadoes, and raindrops falling on our heads. On sunny days with a cool breeze, birds singing, and flowers blooming, the weather report can be a tease if you are cooped up inside. It is about as experiential as reading the list of ingredients on a chocolate wrapper versus actually tasting one.

Common haptic technologies fall within four dimensions: our ability to detect force, contact, movement, with surface texture sometimes playing a supporting role. They focus more on sensing what we are doing, and are limited in creating a meaningful range of haptic stimulus for us to sense. Helmets can detect force of impact during football, and determine whether the force was significant enough to cause a head injury. Fitness trackers can differentiate between fitness activities like biking, running, walking, and swimming. Between sensors and cameras, game consoles can create 3D avatars of our bodies and approximate many different kinds of interactions, from your golf swing to shaking up a bottle of champagne. Touchscreens can tell a great deal about the way we touch our screens, and used with peripheral devices like styluses, they can simulate many different types of hand tools and instruments, from pencils and paintbrushes, to pianos and steering wheels. Because hand-eye coordination and tool use are so fundamental to human interaction, there is still probably a great deal more we can still do with haptic interfaces.

Summary

The form of a device and its configurations are interdependent and designing for them can quickly become a complex puzzle balancing heat, weight, physical properties, product capabilities, and stylistic considerations. Multimodal products really are greater than the sum of their properties, and the weakest element can quickly drag down the whole. In the same way that many design innovations of the past looked to biological properties for inspiration and elegant solutions, multimodal design can also look to the way that the senses work, and how the mind processes them. This grounds designs in the user’s reality.

1 Nicholas de Monchaux, Spacesuit: Fashioning Apollo, MIT Press 2012.

2 Ryan Whitwam, “How Google’s self-driving cars detect and avoid obstacles,” Extreme Tech, September 2014, http://www.extremetech.com/extreme/189486-how-googles-self-driving-cars-detect-and-avoid-obstacles.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset