|
| United States Worldwide |
|
Submitted to MIT Press as a chapter for the upcoming book, "Automated
Spoken Dialog Systems," edited by Susann Luperfoy. Copyright 1997, Sun
Microsystems, Inc.
Using Natural Dialogs as the Basis for Speech Interface DesignNicole YankelovichSun Microsystems Laboratories Many of us have experienced interactive voice response systems that allow users to say numbers rather than pressing the telephone keypad:
  To transfer to another extension, say 2..." The design in the example above is clearly easy to implement, but a human answering the phone in the same situation would never ask a caller to say 1 for this or 2 for that. An advantage that speech application designers have is that it is possible to discover what a human would, in fact, say in situations like the one above. In most speech applications, particularly ones with speech-only interfaces, humans can easily fill the role that will eventually be filled by a computer. MotivationOur group embraced the use of pre-design studies after designing a large electronic mail application without conducting one. We had decided that the application had to support reading, skipping, deleting, and replying to new messages. We also knew that our users would be familiar with the graphical interface for performing these tasks. We thought we knew enough to proceed. At this stage, we could have brought in target users and conducted a pre-design study, but for various reasons we decided to begin the software design after recording only a few natural conversations between members of the development team. Since we had a graphical interface to start from, we mostly translated that directly into the speech interface. Observing the first participant in the user study of the prototype software made it obvious that there were major design flaws. Callers over the telephone were overwhelmed by the same volume and organization of mail headers that worked so effectively in the graphical interface. We went through several iterations of the design before we solved this problem by using an alternate message numbering scheme and grouping messages based on automatic and user-defined rules [Yankelovich95]. In retrospect, it is impossible to know what the outcome would have been had we performed a natural dialog pre-design study; however, our subsequent experience has demonstrated that these studies help to eliminate major design problems. During the pre-design studies we have performed, we have been able to identify common patterns of behavior and have often uncovered unexpected information that significantly improved the design. In a few cases, participants in the study behaved in ways that caused us to rethink basic design assumptions. Pre-Design StudiesThese studies differ significantly from "wizard-of-oz" studies that have been used extensively by others in the design of speech-only and multi-modal systems. Fraser and Gilbert describe the wizard-of-oz technique as "the simulation of a computer system which takes spoken natural language input, processes it in some principled way, and generates spoken natural language responses." In addition, they state that one of the pre-conditions for conducting a wizard-of-oz study "is that before the experiments are begun it should be possible to formulate a detailed specification of how the future system is expected to behave" [Fraser91, p.82]. The pre-design studies that are described in this chapter take place prior to any system design or functional specification. Their purpose is to launch the design process. In fact, a number of researchers who use the wizard-of-oz technique begin the process with a pre-experimental phase that involves studying natural human dialogs [Delomier89, Guyomard87]. Like these other researchers, our pre-design studies use what Fraser and Gilbert refer to as a scenario-based approach:
Specifically, pre-design studies are useful for:
The rationale that other researchers have for simulating the speech system after conducting a natural dialog study is that talking to a computer is a fundamentally different experience than talking to a human. Our work has also shown this to be true; however, the prototyping tools we have developed [Martin96] make it relatively easy to implement dialog designs, test them, and iterate on the design. We have found that effective pre-design studies can be informal, inexpensive, and fast, while significantly more overhead is associated with wizard-of-oz studies. Of course, there is a trade-off. Pre-design studies alone are not nearly as effective in helping to define a complete dialog model as when combined with wizard-of-oz studies. In our development efforts, therefore, more of the onus is placed on the software designer to invent these models; consequently, we must devote more resources to user testing and software redesign. We believe this is a reasonable trade-off since having users interact with a real system produces more accurate results than testing a simulation with a wizard who may or may not be effective in creating realistic error conditions. This chapter presents four natural dialog pre-design case studies. Each one is unique in some respect. The differences reflect our attempt to refine our techniques and also to adapt to the demands of the specific application. The first two case studies involve speech-only systems -- one telephone-based and the other situated in an office. The second two case studies describe multi-modal systems. Each case study includes excerpts from transcripts of natural dialogs as well as dialog samples from the resulting computer-based system. The study descriptions conclude with a summary of lessons learned about general speech user interface concepts. Finally, the chapter ends by examing the six uses of pre-design studies listed above in light of each of the application design experiences. Pre-Design Case StudiesStudy #1: SpeechActs Calendar (speech-only, telephone-based)Purpose of Application
As part of an umbrella speech project at Sun called SpeechActs, the
Calendar application was intended to provide users with telephone access to
Sun's multi-user Calendar Manager (Figure 1 - The Calendar Manager graphical
interface). This graphical application
is heavily used at Sun for maintaining work and personal schedules. Its
multi-user capabilities allow users on the same intranet to browse other
people's public calendar appointments. When users are at home or away from
the office, even if they have a computer, it is quite difficult to access
the graphical software remotely.
The purpose of the SpeechActs Calendar application was to allow users to directly access their own calendar and other calendars when they were away from their computer. We decided, as a first step, to tackle calendar browsing, and leave calendar data entry for a future project. Study DesignBefore designing the software, we wanted to discover how people talked about their calendars. We were lucky enough to find two Sun sales managers whose administrative assistant maintained their calendars for them [Yankelovich94]. The managers would call the assistant several times each day when they were on the road to find out what she had scheduled for them. She always had the graphical Calendar Manager application open on her screen while she spoke with the sales managers on the phone. We received their permission to audio tape several weeks worth of these completely natural telephone conversations. The assistant would simply push a button to turn on the recorder every time one of the sales managers called in.After transcribing the tapes, the interface team combed through the transcripts looking for pieces of the conversation that were directly related to looking up information on the calendar. Here's an excerpt from one conversation (all names have been changed in this and in all subsequent transcript segments):
Ben: I would like to figure out what Tom's calendar looks like for next week... Specifically, the late afternoon of the 23rd and the 24th. Asst: OK. 23rd. He has a meeting at 9 o-clock. Ben: No, How about later that afternoon? Asst: Ok, it's open. Ben: How about the 24th Asst: 24th. What time? In addition to collecting common ways of phrasing date-related questions (e.g, "I would like..." or "How about..."), we learned a number of important lessons from this pre-design study. First, people don't speak about dates and scheduling using the same vocabulary that exists in the graphical interface, even if the application is open on the screen while they're conversing. For example, even though the administrative assistant had the Calendar Manager program open, this dialog occurred:
Ben: Next, next Monday. Can you get into John's calendar? Asst: TThe sales manager, a proficient Calendar Manager user himself, never said anything about "browsing," and the assistant never used any words like "unable to access." We learned that in designing a dialog, it was important to listen to the way people talk about a domain, and not derive the speech interface directly from the graphical application. Another related lesson involved the participants' use of relative date expressions. To understand the meaning of these, it is usually necessary to have some context. This can either be conversational context or context in terms of knowing information about dates in general. Here are some examples of the different types of date expressions that cropped up:
"Any appointments tomorrow?" "And the day after that?" "Is anything happening on Veteran's Day?" "So, either the 11th or the 12th." "I'll get back to you at the beginning of the week." We observed that relative dates were not concepts that were embodied in the graphical Calendar Manager interface. If you have a calendar in front of you, absolute, fully-specified dates work well. If you don't, relative dates are essential, and partially specified dates make the conversation flow much more naturally. Not only did we want to find out how people requested calendar information, but we also wanted to know how the resulting information was presented. In looking for instances of calendar appointments being read out loud, we found that the transcripts emphasized the importance of feedback. This was not really news for us, but it was nonetheless interesting to see how the participants provided each other with frequent feedback, as in this excerpt (asterisks indicate overlapping speech):
Tom: So, S.H., my favorite regional administrator, what do I have today on my calendar? Asst: Your 9-o-clock is now over with. You have a 2-o-clock conference call with NCR Herb Clark, a noted conversational theorist, talks about "discourse as a collaborative process" [Clark92]. The frequent OK's in this dialog provide the speaker with "evidence of understanding." While it is difficult to incorporate this level of feedback in a computer-based speech application, it is nonetheless essential to provide the user with clear evidence that the system has correctly interpreted the user's input. In the example above, notice how the feedback "Mark?!" allows the participants to correct their mutual understanding when the assistant accidentally adds "Mark" to the conference call list. Software DesignIt was not possible, given error rates and vocabulary constraints inherent in current speech technologies, for us to reproduce the conversational richness illustrated in the excerpts above. Our goal was to take what we could from the natural dialogs and design a constrained system with a natural feel. We ended up implementing the following subset of features:
For sake of comparison, if Ben, in the first example above, were to have the same conversation with the SpeechActs Calendar (S.A.), here's what a complete interaction with the software might sound like:
User: What's on Tom's calendar for next week? S.A.: On Monday, July 22nd, Tom has -- User: Notice that absolute dates are provided as feedback to be sure the user's relative date expressions were properly interpreted. In the Calendar pre-design study, as well as the others described below, the participants seem to assume that their conversational partner has understood them, unless they have evidence to the contrary. In designing the Calendar application, we had to make the assumption that the user might not have been understood properly. The speech recognizer could have made an error, returning a partially or completely incorrect sentence. In addition, the natural language processor that is part of the SpeechActs framework [Martin96], could have made an error in converting the relative date into an absolute one. In these cases, the SpeechActs Calendar allows users to correct portions of the date. For example:
User: I'd like Bob's calendar for the 9th. S.A.: On Friday, August 9th, Bob has no appointments User: No, I meant September. S.A.: Sorry, on Monday, September 9th, Bob has ... In the Calendar application, we used the natural dialogs as starting point for the user interface design. The final system, while not nearly as fluent as the human conversations, is nonetheless reminiscent of the natural dialogs. Lessons LearnedThe Calendar design experience, coupled with our prior experience designing the Electronic Mail application, highlights a number of important, general speech user interface design concepts:
Study #2: Office Monitor (speech-only, microphone-based)Purpose of ApplicationThe Office Monitor is a walk-up speech interface situated in a person's office [Yankelovich96]. When the office owner is out, the Office Monitor provides visitors with information about the office owner and offers to take a message. These audio messages are then e-mailed to the office owner. When the office owner returns, the Office Monitor reports on the number of visitors, and offers to continue taking messages (Figure 2 - Diagram of setup for Office Monitor software).
![]() Study DesignIn this pre-design study, as in the other two described below, we were not able to find people performing the exact task we wished to model in the application. Instead, we had to manufacture a scenario with opportunities for interaction that were as close as possible to the setting we wanted the software to operate in. A member of the design team was selected as a "tester." This person, acting as a human office monitor, was the conversational partner in each of the interactions. We then selected three relatively busy offices (10+ visitors a day) and asked the office owner to alert the tester about times when the office would be empty. During these times, the tester sat and worked in the office, waiting for visitors to come by. In total, the tester recorded 14 short conversations over a three-day period.We had decided that the tester would try out different greetings to see how visitors' responses would vary. Here are two complete dialogs with different openings (all names have been changed in these examples):
Tester: Hi, are you looking for Rena? Vis1: Yes. Tester: She's at a...doing some videotaping, video editing, *today.* Vis1: *OK* Tester: Do you want me to take a message? Vis1: [Shakes head no] Tester: OK. Thanks. Tester: Hi. Michelle's in Abe Woll's office. Would you like me to take a message? Vis2: Sure. I'm looking for a video card and a video camera for a system. Tester: OK, I'll let her know you stopped by. Vis2: OK. Thank-you. These conversations were typical in their length. Of all the conversations we recorded in which the visitor was looking for the office owner and not for the tester (which happened quite a few times), the dialogs ranged from 10 to 20 seconds. Since the tester was sitting in a real office of a busy person, we obviously had to vary the greeting to match the actually circumstances surrounding the office owner's absence. For example, here is another greeting:
Tester: Michelle's out today. She'll be back next week. May I take a message? The tester also experimented with no greeting at all. In this case, visitors were forced to start the conversation or walk away. We captured openings initiated by visitors such as: "Where's Kathy?," "Is Kathy here?," "Who are you?," and "What are you doing here? Are you doing her job?" In general, we found that the conversations which were started by the visitor tended to be rather open-ended. If the tester started the conversation, she was usually better able to direct it so that the visitor would leave a message and not add too much extra chit chat. Being able to keep the conversation focused is essential for a software implementation since the computer can only "understand" conversations in a very limited domain. In terms of the software design, we were also interested in how visitors left messages. This exchange was typical:
Vis3: Is Sam here?
Tester: He's around but I don't know where he is.
Do you want to leave a message?
Vis3: Yes. Tell him Anthony stopped by.
As in the example with "Vis2" above, this visitor answered the question and immediately proceeded to leave the message. The same interaction pattern occurred even if the visitor decided not to leave a message:
Tester: She's ummm, she's umm, doing some
video editing with Michelle so
she'll be out for the rest of
the day. *Do y..*
Vis4: *OK*
Tester: Do you want me to leave a message
for her?
Vis4: No, I'll just drop by when she's
here.
Tester: OK, thanks.
Notice in the last two examples that the visitor answers the question about leaving a message with a "yes" or "no," but proceeds to advance the conversations without a pause. Software DesignThe two most challenging aspects of the software design were to try to make the conversations as brief as possible, and to handle the situation where the user merges the answer to a question with the next conversational move. Here's a sample dialog with the Office Monitor (O.M.) software:
O.M.: Hi! Do you want to talk to Nicole? User: Yes. O.M.: I don't know where Nicole is. She left this message at 2:34. "Hi! I had to step out, but I will be back by 4:00 at the latest." Do you want to leave a message? User: Yes. O.M.: OK, begin recording a message after the tone. Pause for several seconds when you've finished. We did not manage to make the dialogs nearly as short as the natural ones. They ranged from 20 seconds to 2 minutes. Response time of the speech recognizer, recognition errors, and our desire to always collect certain information (such as the visitor's name) contributed to the length of the interactions. Our focus on brevity stemmed from our concern that visitors would not stop by an office again if they knew they would have to engage in a lengthy conversation with the Office Monitor. Although this concern was not confirmed in our Office Monitor user study (since we did not examine use over time), it emerged as a major issue for users participating in earlier SpeechActs user studies. We were, however, able to come up with a strategy for handling the case of the user leaving a message without waiting for the "record" prompt. We captured the message, and then asked the user what should be done with it:
O.M.: I don't know when Nicole will be back. She left the following message at 2:34. "Hi! I had to step out, but I will be back by 4:00 at the latest." Do you want to leave a message? User: Yeh. Nicole, I have the demo ready, so stop by when you get a chance. O.M.: I'm clueless as to what you said, but I recorded it all. What do you want done with the recording? Say save it, forget it, or record. User: Save it. O.M.: OK. I'll tell Nicole you stopped by. At certain key places in the dialog, we turn on audio recording as well as the speech recognizer. If the recognizer returns a valid result, we throw away the recording, but if the recognizer returns an error, we store the recording. In the example above, we opt to ask the user what to do with the recording. This is to avoid saving recordings containing unuseful responses like "Yup, I would very much like to please," which might have triggered the recognition error. Also, the user might want to record a more complete message. In the prior example, we also turn on audio recording. This happens when the system tries to recognize the user's name. If it cannot be recognized, the recording is kept and e-mailed to the office owner along with the body of the message. No error dialog is ever initiated at this point in the conversation. The strategy of both recognizing and recording helps in two ways. First, it allows the interaction to occur in a way that is most natural -- merging the answer to a question with the body of the message. Second, it helps to keep the dialogs short. There is no need to engage the user in an error dialog if their name isn't recognized. The effort for the office owner to read the name in the e-mail message is about the same as that required to listen to the recording of the name. Also, by capturing the message the first time the user speaks it, there is no need for the user to repeat the message, which can be both frustrating and time-consuming. While this is the case, our first choice is still to recognize rather than record. Recording does not allow the application to "understand" the dialog, making it hard to control the flow of the conversation. Also, audio data takes up more storage space than text and is harder to retrieve using a search engine. Lessons LearnedThe Office Monitor project illustrates some specific, but significant, speech user interface concepts:
Study #3: Automated Customer Service Representative (speech input, speech/graphical output, telephone-based)Purpose of ApplicationThis application allows a user to interact with an on-line Lands' End catalog on an interactive television display. We envision that a TV viewer, perhaps in a hotel room, can tune in to a special shopping channel and see product advertisement videos. At any time, the viewer can call a phone number shown on the bottom of the television screen to browse the interactive catalog. An Automated Customer Service Representative (ACSR or Automated Rep) answers the call and helps the user browse and purchase items shown interactively on the television screen.Study DesignWe conducted a more formal pre-design study for this application than for the others. The graphical portion of the interface, designed by Lands' End, was already built when we began the study. This two-day study consisted of 15 Lands' End customers. We placed them alone in a simulated hotel room with a television and telephone. They were asked to watch the product advertisements for as long as they wanted and then find at least two items from the electronic Lands' End catalog to purchase. We used a wizard -- a human operator unseen by the study participants -- to control the television display as the customer spoke on the telephone to an experienced Lands' End customer service representative (the Rep). The wizard and the Rep were not able to see the customer, but the telephone was amplified so that both could hear the customer's side of the conversation. The wizard updated the customer's screen during the conversation. We left it to the discretion of the Rep to cue the wizard on when to change the display. In this study we not only wanted to capture the dialog, but also the timing of the graphical presentation in relationship to the conversation.One important piece of information we gathered from the study was the different navigation strategies different customers used. Some, like Mary, followed the hierarchical organization of the material (Figure 3 - First level directory of Lands' End interactive television catalog and screen showing a women's drifter sweater. Illustrations courtesy of Lands' End Direct Merchants.):
     
LE Rep: (displays first-level directory) And Mary which of the product line would you like to see today? Mary: In the womens. LE Rep: (displays women's clothing directory) These are our categories. Any particular categories that you're looking for there? Mary: Sweaters. LE Rep: (displays sweaters directory) Sweaters. O.K., in the sweater line we carry the four different categories. Mary: How about the drifter. Other customers just asked for a specific type of item:
LE Rep: (displays first-level directory) David. And what products are we looking at today? David: I'd like a soft-sided attache. LE Rep: OK. We'll go to our luggage line. (displays luggage directory) We carry two soft types of luggage. We carry what is the Original line attache with a a Lighthouse. One is canvas and one is nylon. You'd be interested in looking at what line? David: The canvas line. LE Rep: OK. So the Original attache. David: Yes. We collected many variations on the way that customers asked to see different portions of the catalog. Here are a few samples:
"Can I see the squall jacket?" "Now could we switch to children's clothing." "Let's look at some casual dresses." "I'd like to see the sweaters please." "I'm looking for things from bed and bath." "Can I go back to the main screen." In some cases, users specified compound navigational requests. Here are a few examples, along with the responses from the Lands' End Rep.
Anne: I'm looking for a blazer and slacks and skirts to go with it. LE Rep: OK, so we'll go to the tailored. Carl: Women's clothing and maybe bed and bath. LE Rep: All right. In the ladies line, as you can see, we have available casual shirts, pants and shirts, tailored. What were you interested in looking at today? Carl: I need a flat sheet and a fitted sheet in queen. LE Rep: And the color choice? In a software implementation, it would be difficult to match the intelligent manner in which the Lands' End Rep handles these compound requests. In the first example, she figures out the appropriate clothing category. In the second example, she decides to handle one request at a time. In the third example (from the same conversation as the second example), she decides she can handle the compound as a single request, so she just asks for a color selection. In addition to navigation, we were particularly interested in the ordering dialog. In one respect, customers behaved similarly to participants in the Office Monitor study. They frequently answered yes/no questions in a manner that advanced the dialog. For example, this exchanged occurred in a conversation about monogramming:
LE Rep: Are we going to do a name? Janet: No, initials. From the Rep's perspective, we observed that consistent with Lands' End's philosophy, she avoided asking if a customer wanted to purchase an item. Most of the time the customer would explicitly indicate their desire to place an order or would imply it by making color and size selections. Here's an example of a straightforward ordering dialog. Carol has been looking at the kids rugby playsuit:
Carol: Oh, I see. All right, maybe I'll take one of those. LE Rep: In the red-navy? Carol: How about the sports stripe. LE Rep: In the sports stripe? And that's in what size for a child? Carol: 3T. LE Rep: In a 3 toddler? Carol: Yes. LE Rep: OK. And the price on that was $25. OK. Anything else? Like customers' tendency to advance the dialog with compound answers to yes/no questions, this dialog demonstrates an interaction pattern that the Rep used consistently. She would ask a confirming question (e.g., "In the sports stripe?"), not wait for the answer, and go on to collect the next piece of information. She was so good at her job that we never recorded an instance of her giving incorrect feedback. Even though she rarely made a mistake (one in the three days that we ran this study), she never failed to give customers feedback on their selections. In fact, she sometimes gave feedback on inferred information. Bev was looking at women's clothing and ended up browsing to a unisex turtleneck. The Rep infered that she wanted a women's shirt, and confirmed it using the same style as in the previous example:
Bev: I think I'd like to order a red one. LE Rep: Is that in a ladies? In what size please? In general, this points out the need to make the ordering dialog fluidly connected to browsing. The Rep would remember the questions the customer asked, the items the customer looked at, and the general category of clothing currently being browsed. With this contextual information, she was able to infer quite a bit of information about what the customer wanted by the time they explicitly indicated a desire to order an item. She could quickly confirm these assumptions and then move on to collect unknown information. In addition to her style of providing feedback, we observed that the Rep varied the way she spoke the same information, often tapering a phrase to a shorter version of one she had said previously. Since each session was quite long (10 to 30 minutes), she certainly was in a position to repeat herself. Here's a simple example of shortening a phrase over the course of a conversation:
LE Rep: Anything else you'd like to look at today? And later...
LE Rep: Anything else? A more complex example of tapering prompts involves playing video clips. Here the Rep explains that a video is available and asks if Bev would like to see it. When Bev indicates that she would, the Rep gives her instructions on how to stop the video:
LE Rep: OK. And we do have a video on the turtlenecks. Would you care to see that? Bev: Sure. LE Rep: OK. Any time that you'd like me to stop it, just let me know. Later in the conversation, the interaction is much shorter:
Bev: OK. Probably, I'd like to the drifter. LE Rep: OK. Would you care to see the video on that? Bev: Yes, please. LE Rep: Not only does the Rep leave out the part about the video being available, but she also drops out the instructions for stopping the playback. One of the most astounding aspects of the dialogs was how the Rep handled the myriad of questions about colors. It turns out that callers browsing a paper catalog also have trouble determining exactly what the colors look like. This problem was exacerbated on the television display. The Rep had a large repertoire of replies to difficult color questions. Here are some examples:
Bret: It's tough to see the difference between dark mahogany and the espresso. LE Rep: Yes. Bret: Can you describe those for me? LE Rep: Dark mahogany has almost a burgundy cast, where the espresso is definitely a brown. Janet: The wild rose looks orange. LE Rep: It does in the picture, although it is definitely a dusty rose. Janet: The balsam is green? LE Rep: The balsam is green. It's not really a hunter green. It would actually be like a dusty green. Size information was another area that customers asked about in almost every dialog. The conversations on this topic were perhaps the most complex. Some interactions were fairly straightforward, as in this example:
Bret: What size would a medium fit? LE Rep: A medium fits a size in boys 10/12. Bret: And a small would be a? LE Rep: And a small would be a 7/8. Bret: A 7/8, OK. Most, unfortunately, tended to be rather involved. The following example was fairly typical:
Barb: Now I wear very bulky sweaters, so I usually take a size 12. Would I take a 12 or a 14? LE Rep: On this particular item, it runs a size small through extra large, although it runs very full through the chest area, because we size for wearing over. Barb: OK. So what would you recommend? LE Rep: I would probably say that still a medium would probably work out for you if you don't need it in the length or in the sleeve length. Barb: No. A medium sounds good. In fact, size was so problematic for so many customers, that the Rep was often proactive about helping people who didn't explicitly ask for advice.
Janet: I guess the large would be the size. LE Rep: OK. So the large actually fits like a 14/16. Janet: Oh, it's that big! No, medium. It's for an 11 year old child. LE Rep: OK, medium. 10/12. She seemed to have an uncanny sense about when the customer was selecting a size that was inappropriate. Software DesignThere were many aspects of the dialogs in the pre-design study that seemed far too challenging to tackle in a software implementation of an Automated Customer Service Representative. We approached the problem by deciding to tackle only two of the many sub-tasks that we observed: navigation and simple ordering.We concentrated on trying to allow the user to browse the catalog in as flexible a way as possible. In support of navigation, we attempted to provide a broad grammar that would allow users to browse either hierarchically or jump directly to an item. A major challenge we faced was translating Lands' End vocabulary into ordinary English terminology. For example, we had to write a set of axioms to state facts such as "a drifter is a sweater," and "claret is a red." That way, when a user asks to see a "red sweater," we can offer them the "claret drifter" as an option. Here's a sample dialog from our prototype Automated Customer Service Rep (ACSR):
ACSR: (plays Lands' End advertisements) User: (dials phone number shown on screen) ACSR: (displays first-level directory) Thank you for calling the Lands' End video catalog. Which of these items are you interested in? User: I'm interested in women's clothing. ACSR: (displays women's clothing directory) Here are the categories for women. Which one would you like? User: Um, show me sweaters. ACSR: (displays women's sweaters directory) Here you go. User: I'd like to see the cashmere please. ACSR: (displays cashmere sweater screen) Would you like to see a video about this item? User: Sure. ACSR: Say "stop" at any time to halt the video. (plays video) Here are some other navigational phrases that a user can say:
"Show me luggage." "I'm looking for a casual shirt." "Do you carry ladies cotton leggings?" Other aspects that we incorporated from the pre-design study include tapering prompts, and allowing the user to answer a question and move on to the next request in a single utterance:
User: Could I see skirts please. ACSR: (displays gabardine skirt page because it is the only skirt available) Want the video? User: No. What colors does it come in? ACSR: (displays skirt colors) Here are the colors. The ordering dialog, although more structured, was even more difficult than navigation. We wanted to support mixed initiative dialogs; ones where either the user or the Automated Rep can direct the conversation. To streamline ordering, it was important for the Automated Rep to remember conversational context as well as information that the user already provided before asking to place an order. Also, unlike the human Rep, the Automated Rep tends to make many more errors. Sometimes the user's entire sentence is misrecognized and sometimes only a word or two is incorrect. The frequency of these errors is due to the large vocabulary and complex grammar the recognizer must handle in this application. Nonetheless, we adopted the strategy of providing feedback and moving on to collect the next piece of information, as in this example:
User: Could I place an order. ACSR: What color would you like? User: I'd like classic navy. ACSR: (switches to screen with size info) Classic navy. What size? User: Size 10. ACSR: That's classic navy in size 10. Correct? User: Yes. ACSR: (displays page with order info) Your total cost for the women's regular gabardine skirt is $54.00. Do you want to place this order? User: Yes, please. ACSR: Thank you. Anything else today? With this strategy, the user must notice if an error has been made, and initiate a correction. For example:
ACSR: Claret. What size? User: No, I wanted CLASSIC NAVY. In the next iteration of the design, we may add a more interactive order form that would enter data into fields. This would provide the user with visual as well as verbal feedback, making it easier to detect errors. We also hope to add support for users to ask limited questions about color and sizes. In addition, the pre-design study pointed out the need to support questions about items that are not carried in the catalog, questions about which items match other items, and questions about differences between items (e.g, "What's the difference between the drifter and the fine gauge sweaters?"). Since this application stretches current speech technologies to the limit, it will probably take several iterations of the design before we are able to create a satisfying customer service dialog. Lessons LearnedThe Automated Rep study reaffirmed the importance of some of the concepts uncovered in early studies. In addition, several new lessons emerged.
Study #4: Multi-modal Drawing (speech/mouse/keyboard input, speech/graphical output, microphone-based)Purpose of Application
This application, still under development, uses speech to enrich the
standard graphical interface that is found in common structured graphics
editors such as MacDraw, Corel Draw, etc. The entire application can be
operated using the mouse and keyboard alone; however, speech input is
intended to streamline some operations. These fall into three categories.
First, speech helps with operations in which the user's hands are busy with
the mouse or keyboard. For example, speech can be used for selecting tools
from the palette when the user is working in the drawing area. It can also
be used for changing text attributes while the user's hands are busy with
the keyboard.
The second use for speech involves streamlining operations that take two or more mouse or keyboard operations to perform. Usually this involves changing multiple attributes of an object, such as when the user wants a "wide, red border" or "14 point bold italic text in green." Third, speech can be coordinated with mouse gestures to specify desired attributes or behavior. For example, the user might say "Align this and this and this" while motioning to three different objects on the screen. Study DesignOur goal in this pre-design study was to set up a situation that would mirror the final system as closely as possible, but that would not require us to write any software prior to the study.We conducted this study in an afternoon with 10 participants. A tester sat down in front of a workstation with each participant. A commercially available graphics editor was open and running on the screen. The tester had the keyboard near him and the mouse was positioned directly between the tester and the participant so it could easily be shared. The tester instructed the participant to draw a "colorful picture of a house and a tree." To do this, the participant was only allowed to use the mouse, and only in the white drawing area provided by the application. To access any of the tools, style palettes, or menu items, he or she had to ask the tester. The mouse was passed back and forth between the two, as needed. Here's an excerpt from the middle of one of the sessions. By this time, the tester and the participant had gotten the hang of working together and sharing the mouse (Figure 4a - Illustration created by "Sam" during Multi-modal Drawing pre-design study using a commercial drawing editor).
Sam: Give me a rectangle tool, no a circle tool. Tester: (selects oval tool) Sam: (draws an oval for a door nob) Uh let me have a line tool. Tester: (selects line tool) Sam: (draws a line) Another line tool. Tester: (selects line tool again) Umm I guess I'll take a rectangle Tester: (selects rectangle tool) Sam: (draws a long, thin rectangle for a tree trunk) Make that brown. Tester: (selects medium brown from the color palette) Sam: Umm how about a polygon? No, no, no, whatever that funny thing is. (points to parallelogram tool) Tester: (selects parallelogram tool) Parallelogram. Sam: Yeah the parallelogram tool. (draws parallelogram for a tree branch) Make it green. Tester: (selects green) Notice in this example, that the participant makes two self-corrections in the course of the dialog. Self-corrections and other sorts of error corrections were among the conversational elements we were looking for. Here's an example of a different sort of self-correction:
Sam: Can I make that green Tester: (selects green) Uh huh. Sam: No black black. Sorry black. Black. Below, a different participant is correcting the default behavior of the system:
Tim: Oh now I don't want that brown. I want it white. Since most of the participants in the study had never used the particular graphics program we choose, the tester provided help about how to use the different tools. This type of verbal help is one of the features we were considering adding to our design, but we did not plan ahead of time what the tester would say. Instead, he gave the sort of help that seemed appropriate to the situation. Here's an example of the tester offering brief help to a user who seemed to understand graphics editing well, but did not know some of the idiosyncrasies of this particular program:
Pat: Ok, umm, give me rectangle. Tester: (selects rectangle tool) Click the left button. Pat: (draws a rectangle) Umm, color it red Tester: (selects red from color palette) Pat: Cool. Umm now triangle. Tester: There's a polygon tool, no triangle. Pat: Umm give me a polygon. Tester: (selects polygon tool) Click once for each vertex. Double click to end. Sometimes the tester would have to coach the participant about who was responsible for what operations:
Pat: Umm give me a hexagon. Tester: (selects hexagon tool) Pat: Umm move it and center it there (points to a spot) Tester: You have to do that. Pat: Oh ok. Um let's drag it. Some participants asked a lot of questions and needed fuller explanations. Since the study was completely unscripted, the tester gave as little help as was necessary, but as much as was needed. Here are two examples:
Bob: Does it matter where I start? Tester: Anywhere you want. You click it where you want the next one to be. Bob: Ohh I got it. Cloe: How come it's black? Tester: That was the last color. Cloe: Oh so it's going to choose the same color. Sometimes the tester needed to prompt for additional information:
Cloe: Ok I need to pick a color. Tester: What color? Cloe: Black. In analyzing the transcripts, we were particularly interested in finding examples of how users asked to change multiple parameters. Here is one of the more complex requests:
Sue: Then duplicate those and make a mirror image of it and flip it. Tester: (duplicates the five selected objects, creates a mirror image, then flips the selection) Is that good enough? Sue: That's great. Thanks. More commonly, participants wanted to select a tool and a color in one move:
Jim: Ahh with the polygon tool can you give me yellow? Tim: Now I need an entry way. Let's go with a black color and a rectangle. This was not a combination of actions we had thought of prior to the study. Unfortunately, none of the participants made the sort of multi-parameter requests we had anticipated. For example, we expected participants to change an object's color, line style, or line color in a single move (e.g., "Make this blue with a thick red border"). This could be due to any number of factors. Most significantly, these types of requests may not be natural. Perhaps people think about modifying these parameters in a step-by-step manner and avoid the complexity of changing too many parameters at once. While this is certainly possible, it is also possible that all the participants were familiar enough with current graphics editing software, that it did not occur to them that this sort of compound request was possible. They may have also been influenced by the structure of the graphical interface, which spread the functionality for these operations across menus and palettes. Also notice in the last example that Tim refers to an object he is about to draw as an "entry way." Likewise, another participant declares:
Matt: All right. There's my beautiful tree. Naming objects or groups of objects is a natural behavior for most people. Although there were not any specific examples in the pre-design study of participants referring back to named objects, the frequent use of names caused us to consider ways to incorporate names into the software design. Software Design
The software for this application is only partially written. The current
implementation includes (Figure 4b - Screen shot of first iteration of
Multi-modal Drawing Editor. The user can ask for colors by the names that
appear in the Color Chooser window.):
In the next iteration of the design, we intend to add a range of new features. We plan to experiment with both generic and user-defined names, so that users can give commands such as:
"Make the entry way black." "Bring the window to the front." We also plan to make the interface more conversational by maintaining historical context based on speech, mouse, and keyboard input. This would allow the user, for example, to say:
We intend to investigate the utility of multi-parameter selections in more detail to determine if speech will provide significant enough added value over the purely graphical interface. Also, we plan to experiment with verbal feedback to compliment the graphical feedback, and verbal help to provide first-time users with contextual hints about how to use each tool. Lessons LearnedThe Multi-modal Drawing application was fundamentally different from the other three applications in that the whole application could be controlled without speech. In this case, speech was intended as an add-on rather than a primary means of user input. We learned the following:
Discussion
Refining Application Requirements and FunctionalityThe natural dialog studies helped us to uncover unanticipated behaviors. For example, without the Calendar pre-design study, we probably would not have incorporated relative dates into the application design. This functionality was completely absent from the Calendar Manager GUI and never came up in our design discussions. The study pointed out how central relative dates were in the calendar-related conversations, and we ended up modifying our functional requirements and basing our design around their use.Collecting Appropriate VocabularyBeing able to select appropriate vocabulary is essential since out-of-vocabulary utterances are a common cause of recognition errors. As a designer, having to invent the vocabulary at the outset of the design process is quite difficult and error prone. Especially if a graphical application already exists, the temptation is to use the same vocabulary as in the GUI. We found in the SpeechActs Calendar study that the sales managers and their assistant almost never used any of the words from the Calendar Manager GUI, even though the assistant had the application open on her screen while she was talking on the phone. Speech user interfaces seem much more effective when they select the same vocabulary that humans use when talking about the domain. Both our experience with the Calendar pre-design study and the Electronic Mail user study indicate how problematic it can be to translate a graphical interface into speech.Determining Commonly Used Grammatical ConstructsGrammatical constructs, both for prompts and for user input, are perhaps even more important than vocabulary. The Multi-modal Drawing study pointed out a command grouping that we had not anticipated (combined tool and color selection) and suggested a new feature idea (allowing users to refer to named objects). In addition, the Drawing study brought our attention to a portion of the design that might not be effective. Users never voluntarily specified multiple color or line style parameters in a single utterance, leading us to focus on learning more about why this was the case. Both the Office Monitor and the Automated Rep studies provided many examples of users answering a question and proceeding with the next conversational move without a pause. These studies made it clear that, although technically challenging to implement, this sort of functionality would be important to the success of the final systems.Discovering Effective Interaction PatternsThe pre-design studies were also extremely helpful in pointing to effective interaction techniques. In the Automated Rep study, we observed a number of different navigational strategies. Not everyone browsed the catalog hierarchically. By examining the natural dialogs, we were able to design a system that supported the different browsing strategies. In addition, we were able to clearly see the importance of maintaining context to streamline the ordering dialog. Users often indicated information such as gender, size, and color before explicitly asking to place an order. In the Drawing study, we gathered insights into the sort of verbal help that might be appropriate to integrate into a graphics editor. Not only did we observe times when help dialogs took place, but we also noted that they rarely occurred once a user was comfortable with the software. It is just as important to know when NOT to talk as it is to know when to talk.Helping with Prompt and Feedback DesignOur prompt designs were also influenced significantly by our study participants. In both the Calendar and Automated Rep studies, we had the benefit of studying interactions with an expert; someone accustomed to performing the exact task we wanted the computer to perform. The Assistant in the Calendar study routinely managed others' calendars, and the Lands' End Rep was highly regarded by her employers for her ability to help customers over the telephone. We were able to hear how the Rep handled difficult questions about color, size, and fit. In some cases, such as when describing colors, we decided we could codify her knowledge and offer color descriptions to users of the automated system. In other cases, however, such as in the discussions about size and fit, we came to the conclusion that the caller would have to be transferred to a live operator. These conversations had too much complexity in terms of real world knowledge, grammatical variance, and unpredictable vocabulary to attempt to design a speech interface to support those interactions.Getting a Feeling for the Tone of the ConversationsAlthough conversational "tone" is not something we can quantify, it is an important aspect of human-human communication. In the Automated Rep project, for example, the Lands' End management said it was a requirement for the computer to treat customers in a friendly, unhurried, low-pressure manner. Tone also turned out to be key in the Office Monitor study. In their natural setting, office visitors exhibited "abbreviated" politeness. The interactions were very brief (10 to 20 seconds). Although visitors used few words, their tone (and facial expressions) indicated that this brevity was not a sign of rudeness or displeasure; rather, it was simply appropriate given the social context.ConclusionThese case studies are intended to demonstrate that extremely useful data can be collected without an elaborate setup or software development effort. Both the Office Monitor and the Drawing studies involved minimal preparation, no software development, and less than an hour of recorded data. With some creative thinking about the study design, it is usually easy to construct a setting for interaction that approximates the interaction that will occur in the software system. While it is useful to think through the ways a tester will handle various situations (when a completely natural setting isn't being studied), it is also important not to script the dialogs. Let the tester's natural social inclinations drive the interaction, or set up a situation where the conversational partners are pairs of study participants. Avoiding scripts not only results in more natural dialogs, it also allows for the opportunity of unexpected interactions. In the final analysis, it is these unexpected events that often give the software its sparkle. In our SpeechActs user study, for example, several participants commented on our varied and tapered error dialogs which were derived from data collected in the pre-design studies. One participant responded to this cooperative behavior by saying, "When you've made your request three times, it's actually nice that you don't have the exact same response. It gave me the perception that it's trying to understand what I'm saying." As mentioned at the outset, these pre-design studies are not substitutes for usability testing. Since the natural dialogs and the eventual computer-human dialogs are quite different, it is still essential to test and refine the resulting system. Having user input from the start of the project, however, is likely to lead to an initial design that more closely matches natural human interaction patterns. AcknowledgementsEach of the projects described in this chapter was a team effort. Thanks to members of all the design teams: Paul Martin, Eric Baatz, Stuart Adams, Gina-Anne Levow, Matt Marx, Cynthia McLain, Rick Crabbe, and Chris Long.References
| ||||||||||||||||||||||||||||||||||||||||||||||||||