15

Analytical evaluation

15.1 Introduction

15.2 Inspections: heuristic evaluation

15.3 Inspections: walkthroughs

15.4 Predictive models

15.1 Introduction

All the evaluation methods you have encountered so far in this book have involved interaction with, or direct observation of, users. In this chapter we introduce an approach, known as analytical evaluation, where users are not directly involved. This approach includes various inspection methods and predictive models. Inspection methods typically involve an expert role-playing the users for whom the product is designed, analyzing aspects of an interface, and identifying any potential usability problems by using a set of guidelines. The most well known are heuristic evaluation and walkthroughs. Predictive models involve analyzing the various physical and mental operations that are needed to perform particular tasks at the interface and operationalizing them in terms of quantitative measures. They predict the times it will take a user to carry out the same task using different interfaces, enabling different designs to be compared. For example, the optimal layout of the physical and soft keys for a cell phone can be predicted in this way. We cover two of the most commonly used in HCI: GOMS and Fitts' Law.

Inspections are often used to evaluate a fully working system such as a website, whereas predictive modeling techniques are used more for testing specific aspects of an interface, such as the layout of keys or menu options. One of the advantages of analytical methods is that they are relatively quick to perform and do not require involving users to take part in a usability test or field study. However, they are only ever ‘guesses’ of the time it will take for hypothetical users to carry out a given task, or the potential usability problems they might come across when interacting with a product. It requires the usability expert to put themselves in the shoes of another kind of user besides themselves. When reading this chapter imagine yourself as the expert trying to be the hypothetical user and consider how easy or difficult it is.

The main aims of this chapter are to:

  • Describe the important concepts associated with inspection methods.
  • Show how heuristic evaluation can be adapted to evaluate different types of interactive products.
  • Explain what is involved in doing heuristic evaluation and various kinds of walkthrough.
  • Describe how to perform two types of predictive technique, GOMS and Fitts' Law, and when to use them.
  • Discuss the advantages and disadvantages of using analytical evaluation.

15.2 Inspections: Heuristic Evaluation

Sometimes users are not easily accessible, or involving them is too expensive or takes too long. In such circumstances, experts or combinations of experts and users can provide feedback. By an expert is meant someone who is practiced in usability methods and has a background in HCI. Various inspection techniques began to be developed as alternatives to usability testing in the early 1990s, drawing on software engineering practice where code and other types of inspections are commonly used. These inspection techniques include expert evaluations called heuristic evaluations, and walkthroughs. In these, experts examine the interface of an interactive product, often role-playing typical users, and suggest problems users would have when interacting with it. One of their attractions is that they can be used at any stage of a design project, including early design before well-developed prototypes are available. They can also be used to complement user-testing.

15.2.1 Heuristic Evaluation

Heuristic evaluation is a usability inspection technique first developed by Jakob Nielsen and his colleagues (Nielsen and Mohlich, 1990; Nielsen, 1994a), in which experts, guided by a set of usability principles known as heuristics, evaluate whether user-interface elements, such as dialog boxes, menus, navigation structure, online help, etc., conform to the principles. These heuristics closely resemble the high-level design principles and guidelines discussed in Chapters 1 and 3, e.g. making designs consistent, reducing memory load, and using terms that users understand. When used in evaluation they are called heuristics. The original set of heuristics identified by Jakob Nielsen and his colleagues was derived empirically from an analysis of 249 usability problems (Nielsen, 1994b); a revised version of these heuristics is listed below (useit.com, accessed February 2006):

  • Visibility of system status

    The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.

  • Match between system and the real world

    The system should speak the users' language, with words, phrases, and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order.

  • User control and freedom

    Users often choose system functions by mistake and will need a clearly marked ‘emergency exit’ to leave the unwanted state without having to go through an extended dialog. Support undo and redo.

  • Consistency and standards

    Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions.

  • Error prevention

    Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before they commit to the action.

  • Recognition rather than recall

    Minimize the user's memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialog to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate.

  • Flexibility and efficiency of use

    Accelerators—unseen by the novice user—may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions.

  • Aesthetic and minimalist design

    Dialogues should not contain information that is irrelevant or rarely needed. Every extra unit of information in a dialog competes with the relevant units of information and diminishes their relative visibility.

  • Help users recognize, diagnose, and recover from errors

    Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution.

  • Help and documentation

    Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large.

The way in which experts are intended to use these heuristics is by judging them against aspects of the interface. For example, if a new email system is being evaluated, the expert would use the last one by examining the kind of help and documentation it provides. For example, he might consider how a user would find out how to import an address book and see how quick it is to find this information from the help facility and whether the instructions for doing this are easy to follow. The evaluator is meant to go through the interface several times inspecting the various interface elements and comparing them with the list of usability principles (i.e. the heuristics), picking up on any missed out and revising others until they are satisfied they have identified the majority of the usability problems.

Some of the core heuristics are too general for evaluating products that have come onto the market since they were first developed, such as mobile devices, digital toys, online communities, and new web services. Nielsen suggests developing category-specific heuristics that apply to a specific class of products as a supplement to the general heuristics. Evaluators and researchers have typically developed their own heuristics by tailoring Nielsen's heuristics together with other design guidelines, market research, and requirements documents for the specific products. Exactly which heuristics are appropriate and how many are needed for different products is debatable and depends on the goals of the evaluation. Most sets of heuristics have between five and ten items, which provides a range of usability criteria by which to judge the various aspects of an interface. More than ten becomes difficult for evaluators to remember, fewer than five tends not to be sufficiently discriminating.

A key question that is frequently asked is how many evaluators are needed to carry out a thorough heuristic evaluation? While one evaluator can identify a large number of usability problems, she may not catch all of them. She may also have a tendency to concentrate more on one usability aspect at the expense of missing others. For example, in a study of heuristic evaluation where 19 evaluators were asked to find 16 usability problems in a voice response system allowing customers access to their bank accounts, Nielsen (1992) found a substantial amount of non-overlap between the sets of usability problems found by the different evaluators. He also notes that while some usability problems are very easy to find by all evaluators, there are some problems that are found by very few experts. Therefore, he argues that it is important to involve multiple evaluators in any heuristic evaluation and recommends between three and five evaluators. His findings suggest that they can typically identify around 75% of the total usability problems, as shown in Figure 15.1 (Nielsen, 1994a).

However, employing multiple experts can be too costly. Skillful experts can capture many of the usability problems by themselves and some consultancies now use this technique as the basis for critiquing interactive devices—a process that has become known as an ‘expert crit’ in some countries. But using only one or two experts to conduct a heuristic evaluation can be problematic since research has challenged Nielsen's findings and questioned whether even three to five evaluators are adequate (e.g. Cockton and Woolrych, 2001; Woolrych and Cockton, 2001). These authors point out that the number of experts needed to find this percentage of problems depends on the nature of the problems. Their analysis of problem frequency and severity suggests that highly misleading findings can result. The take-away message about the question of how many experts are needed is that ‘more is more,’ but more is expensive. However, because users and special facilities are not needed for heuristic evaluation and it is comparatively inexpensive and quick, it is popular with developers and it is known as discount evaluation. For a quick evaluation of an early design, one or two experts can probably identify most potential usability problems but if a thorough evaluation of a fully working prototype is needed then having a team of experts conducting the evaluation and comparing their findings would be advisable.

images

Figure 15.1 Curve showing the proportion of usability problems in an interface found by heuristic evaluation using various numbers of evaluators. The curve represents the average of six case studies of heuristic evaluation (Nielsen 1994a)

15.2.2 Heuristic Evaluation for Websites

In this section we examine heuristics for evaluating websites. We begin by discussing MedlinePlus, the medical information website created by the National Library of Medicine (NLM) that you read about in the previous chapter (Cogdill, 1999). The homepage and two other screens are shown in Figures 14.614.8. Prior to doing a usability evaluation, Keith Cogdill developed a set of heuristics customized for identifying usability problems of a large healthcare website, intended for the general public and medical professionals. The heuristics were based partly on Nielsen's core set and partly on Cogdill's own knowledge of the users' tasks, problems that had already been reported by users, and advice from documented sources (e.g. Shneiderman, 1998a; Nielsen, 1993; Dumas and Redish, 1993). The seven heuristics listed below were identified, some of which resemble Nielsen's original set:

  • Internal consistency

    The user should not have to speculate about whether different phrases or actions carry the same meaning.

  • Simple dialog

    The dialog with the user should not include information that is irrelevant, unnecessary, or rarely needed. The dialog should be presented in terms familiar to the user and not be system-oriented.

  • Shortcuts

    The interface should accommodate both novice and experienced users.

  • Minimizing the user's memory load

    The interface should not require the user to remember information from one part of the dialog to another.

  • Preventing errors

    The interface should prevent errors from occurring.

  • Feedback

    The system should keep the user informed about what is taking place.

  • Internal locus of control

    Users who choose system functions by mistake should have an ‘emergency exit’ that lets them leave the unwanted state without having to engage in an extended dialog with the system.

These heuristics were given to three expert evaluators who independently evaluated MedlinePlus. Their comments were then compiled and a meeting was called to discuss their findings and suggest strategies for addressing problems. The following points were among their findings:

  • Layout

    All pages within MedlinePlus have a relatively uncomplicated vertical design. The homepage is particularly compact, and all pages are well suited for printing. The use of graphics is conservative, minimizing the time needed to download pages.

  • Internal consistency

    The formatting of pages and presentation of the logo are consistent across the website. Justification of text, fonts, font sizes, font colors, use of terms, and links labels are also consistent.

The experts also suggested improvements, including:

  • Arrangement of health topics

    Topics should be arranged alphabetically as well as in categories. For example, health topics related to cardiovascular conditions could appear together.

  • Depth of navigation menu

    Having a higher ‘fan-out’ in the navigation menu in the left margin would enhance usability. By this they mean that more topics should be listed at the top level, giving many short menus rather than a few deep ones. Remember the experiment on breadth versus depth discussed in Chapter 14, which provides evidence to justify this.

Activity 15.1

The heuristic evaluation discussed above was done during the creation of the first version of MedlinePlus.

  1. Using your Internet browser go to http://www.MedlinePlus.org and try out Cogdill's seven heuristics with the revised version. In your opinion, do these heuristics help you to identify important usability issues?
  2. Does being aware of the heuristics influence how you interact with MedlinePlus in any way?
  3. Are there other heuristics that you think should be included, or heuristics in the list that you think are not needed?

Comment

  1. In our opinion the heuristics enabled us to focus on key usability criteria such as whether the interface dialog is meaningful and the consistency of the design.
  2. Being aware of the heuristics caused us to focus more strongly on the design and to think about the interaction. We were more aware of what we were trying to do with the system and how the system was responding.
  3. Internal locus of control was difficult to judge. The concept is useful but judging internal locus of control when role-playing users is difficult. All the heuristics focus on task-oriented design aspects, but the user will react to the aesthetic design as well. For example, I happen to like purple so the purple on the homepage was pleasing to me. I also reacted to the picture of a father and young boy holding a teddy bear. The heuristics do not address these user experience features, which are important to some users. For example, a colleague tells a story about how her teenage daughter selected a cell phone to purchase. The choice was determined by which were available with blue casing!

Turning Design Guidelines into Heuristics for the Web

The example above shows how a set of guidelines was turned into heuristics for evaluating a website. This approach happens quite often because there are many books on web design and several of them offer design heuristics (e.g. Horton, 2005; Koyani et al., 2004; Lazar, 2006); there are also numerous websites, e.g. useit.com, from which to select. So how do experts choose the most appropriate set for the product, service, or device they have to evaluate—given the ever-increasing body of knowledge from which to choose?

The approach that Cogdill adopted when evaluating MedlinePlus is quite common. He started by assessing Nielsen's original set of heuristics and then selected from other guidelines, other experts, and his own experience. He made his selection by considering what are the key interaction issues for the product in question. For example, one of the biggest problems for users of large websites is navigating around the site. The following six guidelines (adapted from Nielsen (1998) and others) are intended to encourage good navigation design. Typically, to develop heuristics, the guidelines would be turned into short statements or questions; these are shown following each guideline.

  • Guideline (G): Avoid orphan pages, i.e. pages that are not connected to the homepage, because they lead users into dead ends.

    Heuristic (H): Are there any orphan pages? Where do they go to?

  • G: Avoid long pages with excessive white space that force scrolling.

    H: Are there any long pages? Do they have lots of white space or are they full of texts or lists?

  • G: Provide navigation support, such as a strong site map that is always present (Shneiderman and Plaisant, 2005).

    H: Is there any guidance, e.g. maps, navigation bar, menus, to help users find their way around the site?

  • G: Avoid narrow, deep, hierarchical menus that force users to burrow deep into the menu structure.

    H: Are menus shallow or deep? Empirical evidence indicates that broad, shallow menus have better usability than a few deep menus (Larson and Czerwinski, 1998; Shneiderman and Plaisant, 2005).

  • G: Avoid non-standard link colors.

    H: What color is used for links? Is it blue or another color? If it is another color, then is it obvious to the user that it is a hyperlink?

  • G: Provide consistent look and feel for navigation and information design.

    H: Are menus used, named, and positioned consistently? Are links used consistently?

Activity 15.2

Consider the following design guidelines for information design and for each one suggest a question that could be used in heuristic evaluation:

  • Outdated or incomplete information is to be avoided (Nielsen, 1998). It creates a poor impression with users.
  • Good graphical design is important. Reading long sentences, paragraphs, and documents is difficult on screen, so break material into discrete, meaningful chunks to give the website structure (Lynch and Horton, 1999; Horton, 2005).
  • Avoid excessive use of color. Color is useful for indicating different kinds of information, i.e. cueing (Koyani et al., 2005).
  • Avoid gratuitous use of graphics and animation In addition to increasing download time, graphics and animation soon become boring and annoying.
  • Be consistent. Consistency both within pages (e.g. use of fonts, numbering, terminology, etc.) and within the site (e.g. navigation, menu names, etc.) is important for usability and for aesthetically pleasing designs.

Comment

We suggest the following questions; you may have identified others:

  • Outdated or incomplete information

    Do the pages have dates on them? How many pages are old and provide outdated information?

  • Good graphical design is important

    Is the page layout structured meaningfully? Is there too much text on each page?

  • Avoid excessive use of color How is color used? Is it used as a form of coding? Is it used to make the site bright and cheerful? Is it excessive and garish?
  • Avoid gratuitous use of graphics and animation Are there any flashing banners? Are there complex introduction sequences? Can they be short-circuited? Do the graphics add to the site?
  • Be consistent Are the same buttons, fonts, numbers, menu styles, etc., used across the site? Are they used in the same way?

Activity 15.3

Look at the heuristics for navigation above and consider how you would use them to evaluate a website for purchasing clothes, e.g. http://www.REI.com, which has a homepage similar to that in Figure 15.2). While you are doing this activity think about whether the heuristics are useful.

  1. Do the heuristics help you focus on what is being evaluated?
  2. Might fewer heuristics be better? Which might be combined and what are the trade-offs? what is packed into it. Producing questions suitable for heuristic evaluation often results in more of them, so there is a trade-off?

Comment

  1. Most people find that the heuristics encourage them to focus more on the design.
  2. Some heuristics can be combined and given a more general description. For example, providing navigation support and avoiding narrow, deep, hierarchical menus could be replaced with “help users to develop a good mental model,” but this is a more abstract statement and some evaluators might not know what is packed into it. Producing questions suitable for heuristic evaluation often results in more of them, so there is a trade-off.

    An argument for keeping the detail is that it reminds evaluators of the issues to consider.

    images

    Figure 15.2 Homepage of REI.com

Another important issue when designing and evaluating web pages is their accessibility to a broad range of users (see case studies 6.1 and 10.1). As much as possible, web pages need to be universally accessible. By this we mean that older users, users with disabilities, non-English speakers, and users with slow Internet connections should be able to access the basic content of the pages. In the USA, Section 508 of the U.S. Rehabilitation Act of 1973 was updated and refined in 1998 to make it more specific. This involved setting forth how the Act should be applied to technology so that it is universally accessible (See Lazar, 2006, 2007 for examples of the application of Section 508). In a comparative evaluation study of methods for assessing web page accessibility for the blind, Jen Mankoff and her colleagues found that multiple developers, using a screen reader, were most consistently successful at finding most types of problems (Mankoff et al., 2005). Approximately 50% of known problems were identified which, surprisingly, was more successful than testing directly with blind users.

As the web diversifies, heuristics have been identified for evaluating a range of applications and services, such as web-based online communities.

Heuristics for web-based Online Communities

Here, a key concern for applications designed for web-based online communities, such as those that have been developed for social networks and support groups, is how to evaluate both usability and social interaction (i.e. sociability) aspects. The following nine sets of example questions are suggested as a starting point for developing heuristics to evaluate aspects of sociability and usability in web-based online communities (Preece, 2000).

  • Sociability: Why should I join this community? (What are the benefits for me? Does the description of the group, its name, its location in the website, the graphics, etc., tell me about the purpose of the community and entice me to join it?)
  • Usability: How do I join (or leave) the community? (What do I do? Do I have to register or can I just post, and is this a good thing?)
  • Sociability: What are the rules? (Is there anything I shouldn't do? Are the expectations for communal behavior made clear? Is there someone who checks that people are behaving reasonably?)
  • Usability: How do I get, read, and send messages? (Is there support for newcomers? Is it clear what I should do? Are templates provided? Can I send private messages?)
  • Usability: Can I do what I want to do easily? (Can I navigate the site? Do I feel comfortable interacting with the software? Can I find the information and people I want?)
  • Sociability: Is the community safe? (Are my comments treated with respect? Is my personal information secure? Do people make aggressive or unacceptable remarks to each other?)
  • Sociability: Can I express myself as I wish? (Is there a way of expressing emotions, such as using emoticons? Can I show people what I look like or reveal aspects of my character? Can I see others? Can I determine who else is present—perhaps people are looking on but not sending messages?)
  • Sociability: Do people reciprocate? (If I contribute will others contribute comments, support, and answer my questions?)
  • Sociability: Why should I come back? (What makes the experience worthwhile? What's in it for me? Do I feel part of a thriving community? Are there interesting people with whom to communicate? Are there interesting events?)

Activity 15.4

Go to a discussion board that interests you or type ‘Yahoo Groups’ in your browser and find one. Social interaction was discussed in Chapter 4, and this activity involves picking up some of the concepts discussed there and developing heuristics to evaluate web-based online communities. Before starting you will find it useful to familiarize yourself with the group:

  • Read some of the messages.
  • Send a message.
  • Reply to a message.
  • Search for information.
  • Count how many messages have been sent and how recently.
  • Find out whether you can post to people privately using email.
  • Can you see the physical relationship between messages easily?
  • Do you get a sense of what the other people are like and the emotional content of their messages?
  • Is a sense of community and of individuals present?

Then use the nine questions above as heuristics to evaluate the community site:

  1. Did these questions support your evaluation of the web-based online community for both usability and sociability issues?
  2. Could these questions be used to evaluate other online communities such as HutchWorld discussed in Chapter 12 or www.myspace.com, or another application?

Comment

  1. You probably found that these questions helped focus your attention to specific issues. However, sociability and usability are two closely related topics and sometimes it is difficult or not useful to distinguish between them. Unlike the website evaluation it is important to pay attention to social interaction, but you may have found that your web-based community had very few visitors. A community without people is not a community no matter how good the software is that supports it.
  2. HutchWorld is designed to support social interaction among cancer patients and their carers and the questions could therefore be used to evaluate these aspects. However, HutchWorld offers many additional features not found in most online communities. For example, it encourages a sense of social presence by allowing participants to show pictures of themselves, tell stories, etc., and HutchWorld has avatars, which are graphical representations of people. The nine questions above are useful but may need adapting to account for these extra features.

15.2.3 Heuristic Evaluation for other Interactive Products

You have seen how heuristics can be tailored for evaluating websites and web-based applications. However, the diversity of new products that need to be evaluated are quite different from the software applications of the early 1990s that gave rise to Nielsen's original heuristics. For example, computerized toys are being developed that motivate, entice, and challenge, in innovative ways. Researchers are starting to identify design and evaluation heuristics for these and other products, and Activity 15.5 is intended to help you start thinking about them.

Activity 15.5

Allison Druin works with children to develop computerized toys, among other interactive products (Druin, 1999). From doing this research Allison and her team know that children like to:

  • be in control and not to be controlled
  • create things
  • express themselves
  • be social
  • collaborate with other children
  1. What kind of tasks should be considered in evaluating a fluffy robot toy dog that can be programmed to move and to tell personalized stories about itself and children? The target age group for the toy is 7–9 years.
  2. Suggest heuristics to evaluate the toy.

Comment

  1. Tasks that you could consider: making the toy tell a story about the owner and two friends; making the toy move across the room, turn, and speak. You probably thought of others.
  2. The heuristics could be written to cover: being in control, being flexible, supporting expression, being motivating, supporting collaboration, and being engaging. These are based on the issues raised by Druin, but the last one is aesthetic and tactile. Several of the heuristics needed would be more concerned with user experience, e.g. motivating, engaging, etc., than with usability. As developers pay more attention to user experience, particularly when developing applications for children and entertainment systems, we can expect to see heuristics that address these issues (e.g. Sutcliffe, 2002).

Heuristic evaluation is suitable for devices that people use ‘on-the-move’ as it avoids some of the difficulties associated with field studies and usability evaluation in such circumstances (e.g. see Section 12.4.2 and Brewster and Dunlop, 2004 for a collection of papers on this topic). An interesting example is evaluating a mobile fax application, known as MoFax (Wright et al., 2005). MoFax users can send and receive faxes to conventional fax machines or to other MoFax users. This application was created to support groups working with construction industry representatives. This user group often sends faxes from place to place to show plans. Using MoFax enables team members to browse and send faxes on their cell phones while out in the field (see Figure 15.3). At the time of the usability evaluation, the developers knew there were some significant problems with the interface, so they carried out a heuristic evaluation using Nielsen's heuristics to learn more. Three expert evaluators performed the evaluation and together they identified 56 problems. Based on these results the developers redesigned MoFax.

images

Figure 15.3 A screen showing MoFax on a cell phone

Heuristic evaluation has also been used to evaluate ambient devices, which are abstract aesthetic peripheral displays that portray non-critical information at the periphery of the user's attention (Mankoff et al., 2003). Since these devices are not designed for task performance, the researchers had to develop a set of heuristics that took this into account. They did this by developing two ambient displays: one indicated how close a bus is to the bus-stop by showing its number move upwards on a screen; the other indicated how light or dark it was outside by lightening or darkening a light display (see Figure 15.4). Then they selected a subset of Nielsen's heuristics that were applicable for this type of application and asked groups of experts to evaluate the devices using these heuristics.

images

Figure 15.4 Two ambient devices: (a) bus indicator, (b) lightness and darkness indicator

The heuristics that were developed are listed below in the order of the number of issues that were identified using each. Those in bold were derived by the researchers, the others are Nielsen's heuristics.

  • Sufficient information design The display should be designed to convey ‘just enough’ information. Too much information cramps the display, and too little makes the display less useful.
  • Consistent and intuitive mapping Ambient displays should add minimal cognitive load. Cognitive load may be higher when users must remember what states or changes in the display mean. The display should be intuitive.
  • Match between system and real world (Nielsen).
  • Visibility of state An ambient display should be pleasing when it is placed in the intended setting.
  • Aesthetic and pleasing design The display should be pleasing when it is placed in the intended setting.
  • Useful and relevant information The information should be useful and relevant to the users in the intended setting.
  • Visibility of system status (Nielsen).
  • User control and freedom (Nielsen).
  • Easy transition to more in-depth information If the display offers multi-leveled information, the display should make it easy and quick for users to find out more detailed information.
  • Peripherality of display The display should be unobtrusive and remain so unless it requires the user's attention. User should be able to easily monitor the display.
  • Error prevention (Nielsen).
  • Flexibility and efficiency of use (Nielsen).

Using these heuristics, three to five evaluators were able to identify 40–60% of known usability issues. This study suggests two important aspects. First, different types of applications need to be evaluated using different heuristics. Second, the method by which they are derived needs to be reliable. The authors tested their set of heuristics by running a study in which they gave a survey to students and expert evaluators to use with two different systems. The results were then validated for consistency across the different evaluators.

In a follow-up study different researchers used the same heuristics with different ambient applications (Consolvo and Towle, 2005); 75% of known usability problems were found with eight evaluators and 35–55% were found with three to five evaluators. This indicates that the more evaluators you have, the more accurate the result will be.

15.2.4 Doing Heuristic Evaluation

Heuristic evaluation has three stages:

  1. The briefing session, in which the experts are told what to do. A prepared script is useful as a guide and to ensure each person receives the same briefing.
  2. The evaluation period, in which each expert typically spends 1–2 hours independently inspecting the product, using the heuristics for guidance. The experts need to take at least two passes through the interface. The first pass gives a feel for the flow of the interaction and the product's scope. The second pass allows the evaluator to focus on specific interface elements in the context of the whole product, and to identify potential usability problems.

    If the evaluation is for a functioning product, the evaluators need to have some specific user tasks in mind so that exploration is focused. Suggesting tasks may be helpful but many experts suggest their own tasks. However, this approach is less easy if the evaluation is done early in design when there are only screen mockups or a specification; the approach needs to be adapted to the evaluation circumstances. While working through the interface, specification, or mockups, a second person may record the problems identified, or the evaluator may think aloud. Alternatively, she may take notes herself. Experts should be encouraged to be as specific as possible and to record each problem clearly.

  3. The debriefing session, in which the experts come together to discuss their findings and to prioritize the problems they found and suggest solutions.

The heuristics focus the experts' attention on particular issues, so selecting appropriate heuristics is critically important. Even so, there is sometimes less agreement among experts than is desirable, as discussed in the dilemma below.

There are fewer practical and ethical issues in heuristic evaluation than for other techniques because users are not involved. A week is often cited as the time needed to train experts to be evaluators (Nielsen and Mack, 1994), but this depends on the person's initial expertise. Typical users can be taught to do heuristic evaluation, although there have been claims that this approach is not very successful (Nielsen, 1994a). Some closely related methods take a team approach that involves users (Bias, 1994).

Dilemma: Problems or False Alarms?

You might have the impression that heuristic evaluation is a panacea for designers, and that it can reveal all that is wrong with a design. However, it has problems. Shortly after heuristic evaluation was developed, several independent studies compared heuristic evaluation with other techniques, particularly user testing, indicating that the different approaches often identify different problems and that sometimes heuristic evaluation misses severe problems (Karat, 1994). This argues for using complementary techniques. Furthermore, heuristic evaluation should not be thought of as a replacement for user testing.

Another problem that Bill Bailey (2001) warns about is of experts reporting problems that don't exist. In other words, some of the experts' predictions are wrong. Bailey cites analyses from three published sources showing that only around 33% of the problems reported were real usability problems, some of which were serious, others trivial. However, the heuristic evaluators missed about 21% of users' problems. Furthermore, about 43% of the problems identified by the experts were not problems at all; they were false alarms! Bailey points out that if we do the arithmetic and round up the numbers, what this comes down to is that only about half the problems identified are true problems. “More specifically, for every true usability problem identified, there will be a little over one false alarm (1.2) and about one half of one missed problem (0.6). If this analysis is true, heuristic evaluators tend to identify more false alarms and miss more problems than they have true hits.”

How can the number of false alarms or missed serious problems be reduced? Checking that experts really have the expertise that they claim would help, but how can you do this? One way to overcome these problems is to have several evaluators. This helps to reduce the impact of one person's experience or poor performance. Using heuristic evaluation along with user testing and other techniques is also a good idea.

15.3 Inspections: Walkthroughs

Walkthroughs are an alternative approach to heuristic evaluation for predicting users' problems without doing user testing. As the name suggests, they involve walking through a task with the system and noting problematic usability features. Most walkthrough techniques do not involve users. Others, such as pluralistic walkthroughs, involve a team that includes users, developers, and usability specialists.

In this section we consider cognitive and pluralistic walkthroughs. Both were originally developed for desktop systems but, similar to heuristic evaluation, can be adapted to web-based systems, handheld devices, and products such as VCRs.

15.3.1 Cognitive walkthroughs

“Cognitive walkthroughs involve simulating a user's problem-solving process at each step in the human–computer dialog, checking to see if the user's goals and memory for actions can be assumed to lead to the next correct action” (Nielsen and Mack, 1994, p. 6). The defining feature is that they focus on evaluating designs for ease of learning—a focus that is motivated by observations that users learn by exploration (Wharton et al., 1994). The steps involved in cognitive walkthroughs are:

  1. The characteristics of typical users are identified and documented and sample tasks are developed that focus on the aspects of the design to be evaluated. A description or prototype of the interface to be developed is also produced, along with a clear sequence of the actions needed for the users to complete the task.
  2. A designer and one or more expert evaluators then come together to do the analysis.
  3. The evaluators walk through the action sequences for each task, placing it within the context of a typical scenario, and as they do this they try to answer the following questions:
    • Will the correct action be sufficiently evident to the user? (Will the user know what to do to achieve the task?)
    • Will the user notice that the correct action is available? (Can users see the button or menu item that they should use for the next action? Is it apparent when it is needed?)
    • Will the user associate and interpret the response from the action correctly? (Will users know from the feedback that they have made a correct or incorrect choice of action?)

    In other words: will users know what to do, see how to do it, and understand from feedback whether the action was correct or not?

  4. As the walkthrough is being done, a record of critical information is compiled in which:
    • The assumptions about what would cause problems and why are recorded. This involves explaining why users would face difficulties.
    • Notes about side issues and design changes are made.
    • A summary of the results is compiled.
  5. The design is then revised to fix the problems presented.

It is important to document the cognitive walkthrough, keeping account of what works and what doesn't. A standardized feedback form can be used in which answers are recorded to the three bulleted questions in step 3 above. The form can also record the details outlined in points 1–4 as well as the date of the evaluation. Negative answers to any of the questions are carefully documented on a separate form, along with details of the system, its version number, the date of the evaluation, and the evaluators' names. It is also useful to document the severity of the problems, for example, how likely a problem is to occur and how serious it will be for users.

Compared with heuristic evaluation, this technique focuses more on identifying specific users' problems at a high level of detail. Hence, it has a narrow focus that is useful for certain types of systems but not others. In particular, it can be useful for applications involving complex operations to perform tasks. However, it is very time-consuming and laborious to do and needs a good understanding of the cognitive processes involved.

Example: find a book at www.Amazon.com

This example shows a cognitive walkthrough of buying this book at www.Amazon.com.

Task: to buy a copy of this book from www.Amazon.com

Typical users: students who use the web regularly

The steps to complete the task are given below. Note that the interface for www.Amazon.com may have changed since we did our evaluation.

Step 1. Selecting the correct category of goods on the homepage

Q: Will users know what to do?

Answer: Yes—they know that they must find ‘books’.

Q: Will users see how to do it?

Answer: Yes—they have seen menus before and will know to select the appropriate item and click go.

Q: Will users understand from feedback whether the action was correct or not?

Answer: Yes—their action takes them to a form that they need to complete to search for the book.

Step 2. Completing the form

Q: Will users know what to do?

Answer: Yes—the online form is like a paper form so they know they have to complete it.

Answer: No—they may not realize that the form has defaults to prevent inappropriate answers because this is different from a paper form.

Q: Will users see how to do it?

Answer: Yes—it is clear where the information goes and there is a button to tell the system to search for the book.

Q: Will users understand from feedback whether the action was correct or not?

Answer: Yes—they are taken to a picture of the book, a description, and purchase details.

Activity 15.6

Activity 15.3 asked you to do a heuristic evaluation of REI.com or a similar e-commerce retail site. Now go back to that site and do a cognitive walkthrough to buy something, say a pair of skis. When you have completed the evaluation, compare your findings from the cognitive walkthrough technique with those from heuristic evaluation.

Comment

You probably found that the cognitive walkthrough took longer than the heuristic evaluation for evaluating the same part of the site because it examines each step of a task. Consequently, you probably did not see as much of the website. It is likely that you also got much more detailed findings from the cognitive walkthrough. Cognitive walk-through is a useful technique for examining a small part of a system in detail, whereas heuristic evaluation is useful for examining whole or parts of systems. As the name indicates, the cognitive walk-through focuses on the cognitive aspects of interacting with the system. The technique was developed before there was much emphasis on aesthetic design and other user experience goals. The effort and motor skills for physical interaction are also not a focus.

Variation of the cognitive walkthrough

A useful variation on this theme is provided by Rick Spencer of Microsoft, who adapted the cognitive walkthrough technique to make it more effective with a team that was developing an interactive development environment (IDE) (Spencer, 2000). When used in its original state, there were two major problems. First, answering the three questions in step 3 and discussing the answers took too long. Second, designers tended to be defensive, often invoking long explanations of cognitive theory to justify their designs. This second problem was particularly difficult because it undermined the efficacy of the technique and the social relationships of team members. In order to cope with these problems, Rick Spencer adapted the technique by reducing the number of questions and curtailing discussion. This meant that the analysis was more coarse-grained but could be completed in much less time (about 2.5 hours). He also identified a leader, the usability specialist, and set strong ground rules for the session, including a ban on defending a design, debating cognitive theory, or doing designs on the fly.

These adaptations made the technique more usable, despite losing some of the detail from the analysis. Perhaps most important of all, Spencer directed the social interactions of the design team so that they achieved their goals.

15.3.2 Pluralistic Walkthroughs

“Pluralistic walkthroughs are another type of walkthrough in which users, developers and usability experts work together to step through a [task] scenario, discussing usability issues associated with dialog elements involved in the scenario steps” (Nielsen and Mack, 1994, p. 5). Each group of experts is asked to assume the role of typical users. The walkthroughs are then done by following a sequence of steps (Bias, 1994):

  1. Scenarios are developed in the form of a series of hardcopy screens representing a single path through the interface. Often just two or a few screens are developed.
  2. The scenarios are presented to the panel of evaluators and the panelists are asked to write down the sequence of actions they would take to move from one screen to another. They do this individually without conferring with one another.
  3. When everyone has written down their actions, the panelists discuss the actions that they suggested for that round of the review. Usually, the representative users go first so that they are not influenced by the other panel members and are not deterred from speaking. Next the usability experts present their findings, and finally the developers offer their comments.
  4. Then the panel moves on to the next round of screens. This process continues until all the scenarios have been evaluated.

The benefits of pluralistic walkthroughs include a strong focus on users' tasks at a detailed level, i.e. looking at the steps taken. This level of analysis can be invaluable for certain kinds of systems, such as safety-critical ones, where a usability problem identified for a single step could be critical to its safety or efficiency. The approach lends itself well to participatory design practices by involving a multidisciplinary team in which users play a key role. Furthermore, the group brings a variety of expertise and opinions for interpreting each stage of an interaction. Limitations include having to get all the experts together at once and then proceed at the rate of the slowest. Furthermore, only a limited number of scenarios, and hence paths through the interface, can usually be explored because of time constraints.

15.4 Predictive Models

Similar to inspection methods, predictive models evaluate a system without testing users, but rather than involving expert evaluators role-playing users it does so by involving experts using formulas to derive various measures of user performance. Predictive modeling techniques provide estimates of the efficiency of different systems for various kinds of tasks. For example, a cell phone designer might choose to use such a predictive method because it can enable her to determine accurately which is the optimal layout of keys on a cell phone for allowing common operations to be performed.

A well-known predictive modeling technique is GOMS. This is a generic term used to refer to a family of models that vary in their granularity concerning the aspects of a user's performance they model and make predictions about. These include the time it takes to perform tasks and the most effective strategies to use when performing tasks. The models have been used mainly to predict user performance when comparing different applications and devices. Below we describe two of the most well-known members of the GOMS family: the GOMS model and its ‘daughter,’ the keystroke level model.

15.4.1 The GOMS Model

The GOMS model was developed in the early 1980s by Stu Card, Tom Moran, and Alan Newell (Card et al., 1983). It was an attempt to model the knowledge and cognitive processes involved when users interact with systems. The term ‘GOMS’ is an acronym which stands for goals, operators, methods, and selection rules:

  • Goals refer to a particular state the user wants to achieve, e.g. find a website on interaction design.
  • Operators refer to the cognitive processes and physical actions that need to be performed in order to attain those goals, e.g. decide on which search engine to use, think up and then enter keywords in search engine. The difference between a goal and an operator is that a goal is obtained and an operator is executed.
  • Methods are learned procedures for accomplishing the goals. They consist of the exact sequence of steps required, e.g. drag mouse over entry field, type in keywords, press the ‘go’ button.
  • Selection rules are used to determine which method to select when there is more than one available for a given stage of a task. For example, once keywords have been entered into a search engine entry field, many search engines allow users to press the return key on the keyboard or click the ‘go’ button using the mouse to progress the search. A selection rule would determine which of these two methods to use in the particular instance. Below is a detailed example of a GOMS model for deleting a word in a sentence using Microsoft Word.

Goal: delete a word in a sentence

Method for accomplishing goal of deleting a word using menu option:

  1. Step 1. Recall that word to be deleted has to be highlighted
  2. Step 2. Recall that command is ‘cut’.
  3. Step 3. Recall that command ‘cut’ is in edit menu
  4. Step 4. Accomplish goal of selecting and executing the ‘cut’ command
  5. Step 5. Return with goal accomplished

Method for accomplishing goal of deleting a word using delete key:

  1. Step 1. Recall where to position cursor in relation to word to be deleted
  2. Step 2. Recall which key is delete key
  3. Step 3. Press ‘delete’ key to delete each letter
  4. Step 4. Return with goal accomplished

Operators to use in the above methods:

  1. Click mouse
  2. Drag cursor over text
  3. Select menu
  4. Move cursor to command
  5. Press keyboard key

Selection rules to decide which method to use:

  1. Delete text using mouse and selecting from menu if large amount of text is to be deleted
  2. Delete text using delete key if small number of letters are to be deleted

images

15.4.2 The Keystroke Level Model

The keystroke level model differs from the GOMS model in that it provides actual numerical predictions of user performance. Tasks can be compared in terms of the time it takes to perform them when using different strategies. The main benefit of making these kinds of quantitative predictions is that different features of systems and applications can be easily compared to see which might be the most effective for performing specific kinds of tasks.

When developing the keystroke level model, Card et al. (1983) analyzed the findings of many empirical studies of actual user performance in order to derive a standard set of approximate times for the main kinds of operators used during a task. In so doing, they were able to come up with the average time it takes to carry out common physical actions (e.g. press a key, click on a mouse button) together with other aspects of user–computer interaction (e.g. the time it takes to decide what to do, the system response rate). Below are the core times they proposed for these (note how much variability there is in the time it takes to press a key for users with different typing skills).

images

The predicted time it takes to execute a given task is then calculated by describing the sequence of actions involved and then summing together the approximate times that each one will take:

images

For example, consider how long it would take to insert the word not into the following sentence, using a wordprocessor like Microsoft Word:

  1. Running through the streets naked is normal.

So that it becomes:

  1. Running through the streets naked is not normal.

First we need to decide what the user will do. We are assuming that she will have read the sentences beforehand and so start our calculation at the point where she is about to carry out the requested task. To begin she will need to think what method to select. So we first note a mental event (M operator). Next she will need to move the cursor into the appropriate point of the sentence. So we note an H operator (i.e. reach for the mouse). The remaining sequence of operators are then: position the mouse before the word normal (P), click the mouse button (P1), move hand from mouse over the keyboard ready to type (H), think about which letters to type (M), type the letters n, o, and t (3K), and finally press the spacebar (K).

The times for each of these operators can then be worked out:

Mentally prepare (M) 1.35
Reach for the mouse (H) 0.40
Position mouse before the word ‘normal’ (P) 1.10
Click mouse (P1) 0.20
Move hands to home position on keys (H) 0.40
Mentally prepare (M) 1.35
Type ‘n’ (good typist) (K) 0.22
Type ‘o’ (K) 0.22
Type ‘t’ (K) 0.22
Type ‘space’ (K) 0.22
Total predicted time: 5.68 seconds

When there are many components to add up, it is often easier to put together all the same kinds of operators. For example, the above can be rewritten as: 2(M) + 2(H) + 1(P) +1(P1) + 4 (K) = 2.70 + 0.88 + 1.10 + 0.2 + 0.80 = 5.68 seconds.

Over 5 seconds seems a long time to insert a word into a sentence, especially for a good typist. Having made our calculation it is useful to look back at the various decisions made. For example, we may want to think why we included a mental operator before typing the letters n, o, and t but not one before any of the other physical actions. Was this necessary? Perhaps we don't need to include it. The decision when to include a time for mentally preparing for a physical action is one of the main difficulties with using the keystroke level model. Sometimes it is obvious when to include one (especially if the task requires making a decision), but for other times it can seem quite arbitrary. Another problem is that, just as typing skills vary between individuals, so too do the mental preparation times people spend thinking about what to do. Mental preparation can vary from under 0.5 of a second to well over a minute. Practice at modeling similar kinds of tasks together with comparing them with actual times taken can help overcome these problems. Ensuring that decisions are applied consistently also helps. For example, if comparisons between two prototypes are made, apply the same decisions to each.

Activity 15.7

As described in the GOMS model above there are two main ways words can be deleted in a sentence when using a word-processor like Word. These are:

  1. Deleting each letter of the word individually by using the delete key.
  2. Highlighting the word using the mouse and then deleting the highlighted section in one go.

Which of the two methods is quickest for deleting the word ‘not’ from the following sentence?

I do not like using the keystroke level model

Comment

  1. Our analysis for method 1 is:

    images

  2. Our analysis for method 2 is:

    images

The result seems counter-intuitive. Why do you think this is? The reason is that the amount of time required to select the letters to be deleted is longer for the second method than pressing the delete key three times in the first method. If the word had been any longer, for example, ‘keystroke,’ then the keystroke analysis would have predicted the opposite. There are also other ways of deleting words, such as double clicking on the word (to select it) and then either pressing the delete key or the combination of ctrl+X keys. What do you think the keystroke level model would predict for either of these two methods?

CASE STUDY 15.1: Using GOMS in the Redesign of a Phone-based Response System

Usability consultant Bill Killam and his colleagues worked with the US Internal Revenue Services (IRS) to evaluate and redesign the telephone response information system (TRIS). The goal of TRIS is to provide the general public with advice about filling out a tax return—and those of you who have to do this know only too well how complex it is. Although this case study is situated in the USA, such phone-based information systems are widespread across the world.

Typically, telephone answering systems can be frustrating to use. Have you been annoyed by the long menus of options such systems provide when you are trying to buy a train ticket or when making an appointment for a technician to fix your phone line? What happens is that you work your way through several different menu systems, selecting an option from the first list of, say, seven choices, only to find that now you must choose from another list of five alternatives. Then, having spent several minutes doing this, you discover that you made the wrong choice back in the first menu, so you have to start again. Does this sound familiar? Other problems are that often there are too many options to remember, and none of them seems to be the right one for you.

The usability specialists used the GOMS keystroke level model to predict how well a redesigned user interface compared with the original TRIS interface for supporting users' tasks. In addition they also did usability testing.

15.4.3 Benefits and Limitations of GOMS

One of the main attractions of the GOMS approach is that it allows comparative analyses to be performed for different interfaces, prototypes, or specifications relatively easily. Since its inception, a number of researchers have used the method, reporting on its success for comparing the efficacy of different computer-based systems. One of the most well known is Project Ernestine (Gray et al., 1993). This study was carried out to determine if a proposed new workstation, that was ergonomically designed, would improve telephone call operators' performance. Empirical data collected for a range of operator tasks using the existing system was compared with hypothetical data deduced from doing a GOMS analysis for the same set of tasks for the proposed new system.

Similar to the activity above, the outcome of the study was counter-intuitive. When comparing the GOMS predictions for the proposed system with the empirical data collected for the existing system, the researchers discovered that several tasks would take longer to accomplish. Moreover, their analysis was able to show why this might be the case: certain keystrokes would need to be performed at critical times during a task rather than during slack periods (as was the case with the existing system). Thus, rather than carrying out these keystrokes in parallel when talking with a customer (as they did with the existing system) they would need to do them sequentially—hence the predicted increase in time spent on the overall task. This suggested to the researchers that, overall, the proposed system would actually slow down the operators rather than improve their performance. On the basis of this study, they were able to advise the phone company against purchasing the new workstations, saving them from investing in a potentially inefficient technology.

While this study has shown that GOMS can be useful in helping make decisions about the effectiveness of new products, it is not often used for evaluation purposes. Part of the problem is its highly limited scope: it can only really model computer-based tasks that involve a small set of highly routine data-entry type tasks. Furthermore, it is intended to be used only to predict expert performance, and does not allow for errors to be modeled. This makes it much more difficult (and sometimes impossible) to predict how an average user will carry out their tasks when using a range of systems, especially those that have been designed to be very flexible in the way they can be used. In most situations, it isn't possible to predict how users will perform. Many unpredictable factors come into play including individual differences among users, fatigue, mental workload, learning effects, and social and organizational factors. For example, most people do not carry out their tasks sequentially but will be constantly multi-tasking, dealing with interruptions and talking to others.

A dilemma with predictive models, therefore, is that they can only really make predictions about predictable behavior. Given that most people are unpredictable in the way they behave, it makes it difficult to use them as a way of evaluating how systems will be used in real-world contexts. They can, however, provide useful estimates for comparing the efficiency of different methods of completing tasks, particularly if the tasks are short and clearly defined.

15.4.4 Fitts' law

Fitts' Law (1954) predicts the time it takes to reach a target using a pointing device. It was originally used in human factors research to model the relationship between speed and accuracy when moving towards a target on a display. In interaction design it has been used to describe the time it takes to point at a target, based on the size of the object and the distance to the object. Specifically, it is used to model the time it takes to use a mouse and other input devices to click on objects on a screen. One of its main benefits is that it can help designers decide where to locate buttons, what size they should be, and how close together they should be on a screen display. The law states that:

images

where

T = time to move the pointer to a target

D = distance between the pointer and the target

S = size of the target

k is a constant of approximately 200 msec/bit

In a nutshell, the bigger the target the easier and quicker it is to reach it. This is why interfaces that have big buttons are easier to use than interfaces that present lots of tiny buttons crammed together. Fitts' Law also predicts that the most quickly accessed targets on any computer display are the four corners of the screen. This is because of their ‘pinning’ action, i.e. the sides of the display constrain the user from over-stepping the target. However, as pointed out by Tog on his AskTog website, corners seem strangely to be avoided at all costs by designers.

Fitts' Law, therefore, can be useful for evaluating systems where the time to physically locate an object is critical to the task at hand. In particular, it can help designers think about where to locate objects on the screen in relation to each other. This is especially useful for mobile devices, where there is limited space for placing icons and buttons on the screen. For example, in a study carried out by Nokia, Fitts' Law was used to predict expert text entry rates for several input methods on a 12-key cell phone keypad (Silfverberg et al., 2000). The study helped the designers make decisions about the size of keys, their positioning, and the sequences of presses to perform common tasks for the mobile device. Trade-offs between the size of a device, and accuracy of using it, were made with the help of calculations from this model. Comparisons of speed and accuracy of text entry on cell phones have also been informed by the application of Fitts' Law (MacKenzie and Soukoreff, 2002).

Activity 15.8

Microsoft toolbars provide the user with the option of displaying a label below each tool. Give a reason why labeled tools may be accessed faster. (Assume that the user knows the tool and does not need the label to identify it.)

Comment

The label becomes part of the target and hence the target gets bigger. As we mentioned earlier, bigger targets can be accessed more quickly.

Furthermore, tool icons that don't have labels are likely to be placed closer together so they are more crowded. Spreading the icons further apart creates buffer zones of space around the icons so that if users accidentally go past the target they will be less likely to select the wrong icon. When the icons are crowded together the user is at greater risk of accidentally overshooting and selecting the wrong icon. The same is true of menus, where the items are closely bunched together.

Assignment

This assignment continues the work you did on the web-based ticketing system at the end of Chapters 10, 11, and 14. The aim of this assignment is to evaluate the prototypes produced in the assignment of Chapter 11 using heuristic evaluation.

  1. (a) Decide on an appropriate set of heuristics and perform a heuristic evaluation of one of the prototypes you designed in Chapter 11.
  2. (b) Based on this evaluation, redesign the prototype to overcome the problems you encountered.
  3. (c) Compare the findings from this evaluation with those from the usability testing in the previous chapter. What differences do you observe? Which evaluation approach do you prefer and why?
  4. (d) Now you have applied methods from each evaluation approach: usability testing, field studies, and analytical evaluation. Draw up a table that summarizes the findings, benefits, costs, and limitations of each.

Summary

This chapter presented analytical evaluations, focusing on heuristic evaluation and walkthroughs which are done by experts who role-play users' interactions with designs, prototypes, and specifications and then offer their opinions. Heuristic evaluation and walkthroughs offer the evaluator a structure to guide the evaluation process.

The GOMS and keystroke level models, and Fitts' Law can be used to predict user performance. These techniques can be useful for determining whether a proposed interface, system, or keypad layout will be optimal. Typically they are used to compare different designs for a small sequence of tasks. These methods are labor-intensive so do not scale well for large systems.

Key Points

  • Inspections can be used for evaluating requirements, mockups, functional prototypes, or systems.
  • User testing and heuristic evaluation often reveal different usability problems.
  • Other types of inspections used in interaction design include pluralistic and cognitive walkthroughs.
  • Walkthroughs are very focused and so are suitable for evaluating small parts of systems.
  • The GOMS and keystroke level models and Fitts' Law can be used to predict expert, error-free performance for certain kinds of tasks.
  • Predictive models require neither users nor experts, but the evaluators must be skilled in applying the models.
  • Predictive models are used to evaluate systems with limited, clearly defined functionality such as data entry applications, and key-press sequences for cell phones and other hand-held devices.

Further Reading

CARD, S.K., MORAN, T.P. and NEWELL, A. (1983) The Psychology of Human Computer Interaction. Lawrence Erlbaum Associates. This seminal book describes GOMS and the keystroke level model.

COCKTON, G. and WOOLRYCH, A. (2001) Understanding inspection methods: lessons from an assessment of heuristic evaluation. In: A. Blandford and J., Vanderdonckt (eds), People & Computers XV. Springer-Verlag, pp. 171–191. This paper evaluates the efficacy of heuristic evaluation, and questions whether it lives up to the claims made about it.

HORTON, S. (2005) Access by Design: A Guide to Universal Usability for Web Designers. New Riders Press. This book challenges the belief that designing for universal access is a burden for designers. It demonstrates again and again that everyone benefits from universal usability and provides important guidelines for designers.

KOYANI, S.J. BAILEY, R.W. and NALL, J.R (2004) Research-Based Web Design and Usability Heuristics. National Cancer Institute. This book contains a thorough review of usability guidelines derived from empirical research. The collection is impressive but each guideline needs to be evaluated and used thoughtfully.

MACKENZIE, I.S. (1992) Fitts' law as a research and design tool in human–computer interaction. Human–Computer Interaction 7, 91–139. This early paper by Scott Mackenzie, an expert in the use of Fitts' Law, provides a detailed discussion of how it can be used in HCI.

MACKENZIE, I.S., and SOUKOREFF, R.W. (2002) Text entry for mobile computing: models and methods, theory and practice. Human–Computer Interaction 17, 147–198. This later paper provides a useful survey of mobile text-entry techniques and discusses how Fitts' Law can inform their design.

MANKOFF, J., DEY, A.K., HSICH, G., KIENTZ, J. and LEDERER, M.A. (2003) Heuristic evaluation of ambient devices. Proceedings of CHI 2003, ACM 5(1), 169–176. This paper will be useful for those wishing to derive rigorous heuristics for new kinds of applications. It illustrates how different heuristics are needed for different applications.

NIELSEN, J. and MACK, R.L. (eds) (1994) Usability Inspection Methods. John Wiley & Sons. This book contains an edited collection of chapters on a variety of usability inspection methods. There is a detailed description of heuristic evaluation and walkthroughs and comparisons of these techniques with other evaluation techniques, particularly user testing. Jakob Nielsen's web-site useit.com provides additional information and advice on website design. See particularly http://www.useit.com/papers/heuristic (accessed February 2006) for more recent work.

PREECE, J. and (2000) Online Communities: Designing Usability, Supporting Sociability. John Wiley & Sons. This book is about the usability and sociability design of online communities. It suggests guidelines that can be used as a basis for heuristics.

INTERVIEW: with Jakob Nielsen

images

Jakob Nielsen is a pioneer of heuristic evaluation. He is currently principal of the Nielsen Norman Consultancy Group and the author of numerous articles and books, including his recent book, Designing Web Usability (New Riders Publishing). He is well known for his regular sound bites on usability which for many years have appeared at useit.com. In this interview Jakob talks about heuristic evaluation, why he developed the technique, and how it can be applied to the web.

JP: Jakob, why did you create heuristic evaluation?

JN: It is part of a larger mission I was on in the mid-'80s, which was to simplify usability engineering, to get more people using what I call ‘discount usability engineering.’ The idea was to come up with several simplified methods that would be very easy and fast to use. Heuristic evaluation can be used for any design project or any stage in the design process, without budgetary constraints. To succeed it had to be fast, cheap, and useful.

JP: How can it be adapted for the web?

JN: I think it applies just as much to the web, actually if anything more, because a typical website will have tens of thousands of pages. A big one may have hundreds of thousands of pages, much too much to be assessed using traditional usability evaluation methods such as user testing. User testing is good for testing the homepage or the main navigation system. But if you look at the individual pages, there is no way that you can really test them. Even with the discount approach, which would involve five users, it would still be hard to test all the pages. So all you are left with is the notion of doing a heuristic evaluation, where you just have a few people look at the majority of pages and judge them according to the heuristics. Now the heuristics are somewhat different, because people behave differently on the web. They are more ruthless about getting a very quick glance at what is on a page and if they don't understand it then leaving it. Typically application users work a little harder at learning an application. The basic heuristics that I developed a long time ago are universal, so they apply to the web as well. But as well as these global heuristics that are always true, for example ‘consistency,’ there can be specialized heuristics that apply to particular systems. But most evaluators use the general heuristics because the web is still evolving and we are still in the process of determining what the web-specific heuristics should be.

JP: So how do you advise designers to go about evaluating a really large website?

JN: Well, you cannot actually test every page. Also, there is another problem: developing a large website is incredibly collaborative and involves a lot of different people. There may be a central team in charge of things like the homepage, the overall appearance, and the overall navigation system. But when it comes to making a product page, it is the product-marketing manager of, say, Kentucky who is in charge of that. The division in Kentucky knows about the product line and the people back at headquarters have no clue about the details. That's why they have to do their own evaluations in that department. The big thing right now is that this is not being done, developers are not evaluating enough. That's one of the reasons I want to push the heuristic evaluation method even further to get it out to all the website contributors. The uptake of usability methods has dramatically improved from five years ago, when many companies didn't have a clue, but the need today is still great because of the phenomenal development of the web.

JP: When should you start doing heuristic evaluation?

JN: You should start quite early, maybe not quite as early as testing a very rough mockup, but as soon as there is a slightly more substantial prototype. For example, if you are building a website that might eventually have ten thousand pages, it would be appropriate to do a heuristic evaluation of, say, the first ten to twenty pages. By doing this you would catch quite a lot of usability problems.

JP: How do you combine user testing and heuristic evaluation?

JN: I suggest a sandwich model where you layer them on top of each other. Do some early user testing of two or three drawings. Develop the ideas somewhat, then do a heuristic evaluation. Then evolve the design further, do some user tests, evolve it and do heuristic evaluation, and so on. When the design is nearing completion, heuristic evaluation is very useful particularly for a very large design.

JP: So, do you have a story to tell us about your consulting experiences, something that opened your eyes or amused you?

JN: Well, my most interesting project started when I received an email from a co-founder of a large company who wanted my opinion on a new idea. We met and he explained his idea and because I know a lot about usability, including research studies, I could warn him that it wouldn't work—it was doomed. This was very satisfying and seems like the true role for a usability consultant. I think usability consultants should have this level of insight. It is not enough to just clean up after somebody makes the mistake of starting the wrong project or produces a poor design. We really should help define which projects should be done in the first place. Our role is to help identify options for really improving people's lives, for developing products that are considerably more efficient, easier or faster to learn, or whatever the criteria are. That is the ultimate goal of our entire field.

JP: Have there been any changes in the way heuristic evaluation and discount usability methods are used or perceived?

JN: I have changed my preferred approach to heuristic evaluation from emphasizing a small set of general heuristics to emphasizing a large set of highly specific usability guidelines. I did this because there are now millions of people who perform user interface design without knowing anything about general HCI principles, and it's difficult for these people to apply general heuristics correctly. One of my earliest research results for heuristic evaluation is that the method works best with experienced evaluators who have a deep understanding of usability. All very well, but when a team doesn't have experienced usability professionals on staff, what should it do? That's where the specific guidelines come into play. When you tell people, for example, that search should be represented on a website by a type-in box on every page and that the box should be at least 27 characters wide, then you are giving them evaluation criteria that anybody can apply without knowing the theory. For example, my group developed a set of 75 usability guidelines for the design of the public relations area of a corporate website. There are guidelines for everything from the way to present press releases online to how to show the PR department's contact information. These guidelines are based on our own user testing of a broad range of journalists working in newspapers, magazines, and broadcast media in several different countries, so we know that the guidelines represent the needs and wants of the target audience. In real life, PR information is placed on a website by the PR department, and they don't have time to conduct their own user testing with journalists. Neither does the typical PR professional have any educational background in HCI. Thus, I don't think that the broad list of general heuristics would do them much good, but we know from several examples that a company's PR pages get much better when the PR group has evaluated it with the 75 detailed guidelines.

JP: What about changes regarding usability in general?

JN: The general trend has been toward hugely increased investment in usability around the world. I don't think anybody has the real numbers, but I would not be surprised if the amount of resources allocated to usability increased by a thousand percent or more from 1995 to 2005. Of course, this is not nearly enough, because during the same ten years, the number of websites increased by 87,372%. In other words, we are falling behind by a factor of about 87. This is why discount usability engineering is more important than ever.

JP: And how do you think the web will develop? What will we see next, what do you expect the future to bring?

JN: I hope we will abandon the page metaphor and reach back to the earlier days of hypertext. There are other ideas that would help people navigate the web better. The web is really an ‘article-reading’ interface. My website useit.com, for example, is mainly articles, but for many other things people need a different interface, the current interface just does not work. I hope we will evolve a more interesting, useful interface that I'll call the ‘Internet desktop,’ which would have a control panel for your own environment, or another metaphor would be ‘your personal secretary.’ Instead of the old goal where the computer spits out more information, the goal would be for the computer to protect you from too much information. You shouldn't have to actually go and read all those webpages. You should have something that would help you prioritize your time so you would get the most out of the web. But, pragmatically speaking, these are not going to come any time soon. My prediction has been that Explorer Version 8 will be the first good web browser and that is still my prediction. The more short-term prediction is really that designers will take much more responsibility for content and usability of the web. We need to write webpages so that people can read them. For instance, we need headlines that make sense. Even something as simple as a headline is a user interface, because it's now being used interactively, not as in a magazine where you just look at it. So writing the headline, writing the content, designing the navigation are jobs for the individual website designers. In combination, such decisions are really defining the user experience of the network economy. That's why we really have an obligation, every one of us, because we are building the new world and if the new world turns out to be miserable, we have only ourselves to blame, not Bill Gates. We've got to design the web for the way users behave.

JP: Finally, can heuristic evaluation be used to evaluate mobile systems and games?

JN: I only have direct experience from mobile devices, and heuristic evaluation certainly works very well in this domain. You can identify a lot of issues with a phone or other mobile user experience by using exactly the same heuristics as you would for any other platform. However, you have to interpret the heuristics in the context of the smaller screen, which changes their relative importance. For example, say that a user selects a headline from a list of news stories on a mobile device, the next screen will usually be the full text of the story. You might expect that the standard heuristic “visibility of system status” should imply that the system should provide feedback by repeating the selected headline on top of the story. And that's indeed what I would recommend for a Web page. But on a small screen, it's better to devote more of the space to new information and assume that the user can remember the headline from the previous screen. This doesn't violate the heuristic “recognition rather than recall” because you don't need to use the exact wording of the headline for anything while reading the typical news story. If the headline were in fact used on the next screen, then it should be repeated, in order to minimize the user's memory load.

Games are a different matter. I haven't done such a project, so I don't know for a fact, but I suspect that traditional heuristic evaluation might help on the limited question of evaluating the controls of a game. Games are no fun if you can't figure out how to play them, For example, the heuristic for “consistency and standards”, would indicate that if there's a certain way to pick up guns, it should be the same for all forms of guns. Similarly if there's a certain button on the controller that's used to shoot the gun in all other games then our game should use the same button. However, I don't think that the standard heuristics would be very helpful in evaluating the gameplay quality of a game. It's hard to say, for example, whether a game like Civilization should have fewer raging barbarians, or how much more food should be grown on a tile if you irrigate it. It's possible that one could discover a different set of heuristics to help make such decisions, maybe by studying how successful games designers make their trade-offs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset