
You are here: MyURC.org > publications > Hello, What Do You Do?
Hello, What Do You Do?
Natural Language Interaction with Intelligent Environments
Gregg Vanderheiden,
Gottfried Zimmermann
Karen Blaedow
Shari Trewin
Vanderheiden, G., Zimmerman, G., Blaedow, K., Trewin, S. (2005). Hello, What Do You Do? Natural Language Interaction with Intelligent Environments. Proceedings of the 2005 HCII Las Vegas Conference.
Abstract
This paper reviews the state of the art in natural language interaction with complex products like home appliances. Natural language interaction is now approaching the point when we can begin to look at its practical application in products. It has moved from the state where the user had to speak in the language of the device to one where the device is beginning to be able to adapt to the speech patterns of the user, including adapting to unusual command patterns on the fly. As this happens it can greatly change our ability to create simple interfaces to complex devices for everyone, including individuals who would have a difficulty with even moderately complex interfaces. With the introduction of remote control capabilities the potential exists to retain the original interface built-in to products and to have natural language interface capabilities which the individual would carry with them and bring to products.
1. Introduction
With the advance in technologies, and the drop in the cost for processing and memory, products are increasingly becoming multifunctional and feature laden. This trend is prevalent for everything from copiers (which are now copier, printer and fax data centers) to cell phones (which are now phone, PDA, reminder, and smartcards). Even the Apple iPod is now a music player, alarm clock, address book, photo repository and calendar. Although customers continually complain of the complexity of devices, they also show a marked preference for purchasing products with longer feature lists, thus ensuring the trend toward more complex feature laden devices.
With traditional interface techniques this approach leads to the development of complex hierarchical user interfaces. Even individuals who use only a few features on a product have to deal with interfaces displaying the full functionality. Layering techniques can be used to bring common features to the surface and put other features underneath menus, tabs or other interface layering strategies. However, the first time a user needs any but one of these first level features, they end up having to explore the layers of complexity underneath. Often, the mental model of the designer is not that of the user, leaving them mystified as to the organization of the controls and unable to re-find controls they found in previous excursions.
In addition, the organization of the user interface is different for different manufacturers (or even different models), even if they basically fulfill the same function. Therefore, it doesn't help if a user has already used a similar product. On the contrary, this may even lead to more confusion when dealing with a new product.
Natural language interfaces provide a powerful new tool for addressing these problems of complexity and inconsistency . Natural language interfaces allow a user to access any feature in a product without actually having to see any features they are not interested in. A natural language interface can allow them to directly access even obscure controls without ever having to look for them. When natural language interfaces include intelligent agents users can simply ask for the result they want and can skip the complexity of searching through controls trying to figure out what each does and where the controls are that would give them their desired result.
As long as they know what they desire, they can simply ask for it. If the device can not provide the function they could inform the user or ask them to describe the result in the other words (that it might recognise as matching its capabilities). Once it determines what the person is asking for, it can also adapt itself to accept other descriptors for actions, making it easier for this and other users to control the product.
To learn new functions, an individual can also ask the product what it is able to do. Most often however, the user knows what they would like to have happen and are not particularly interested in what capabilities a product may have.
Note: This paper focuses on the use of natural language for control of devices in a user's environment. This could be built in – or be an external controller linked to the product using a remote control interface such as the ANSI/INCITS (V2) URC standard [Vanderheiden & Zimmermann 2005].
2. Overview of Speech-Based Interaction Systems
Most of the early research and development efforts on speech-based interfaces were focusing on how humans can interact with complex computer-based applications such as databases and expert systems [McTear 2002]. This work started in the 1950s with Artificial Intelligence research on conversational interfaces. It was only recently with the advent of ubiquitous computing, that natural spoken language has been used as a means to access and control devices and services in a person's environment.
[McTear 2002] makes a technically-oriented differentiation of three kinds of spoken dialog systems for information retrieval: Finite state-based systems take the user through predetermined steps or states in a dialog, with the system being in control of the dialog. Typically the user answers the system's questions with a single word or concept. Frame-based systems ask the user questions with the goal of filling "information slots" in a template in order to provide certain information, for example train timetable information. Frame-based systems allow the user to provide more than one piece of information at a time. The most advanced spoken dialog systems are agent-based systems, allowing for complex communication between user and system, with both being able to take the initiative ("mixed initiative").
While natural language understanding technology can be considered as immature even after decades of research, it is already being employed in commercial services [Dahl 2000]. In this paper, we are focusing on the use of natural language technology for controlling devices and services in a person's environment ("intelligent environment").
Rather than following a taxonomy of speech-based interfaces that is motivated by architectural aspects, in the remainder of this section we will categorize existing systems in terms of how much they constrain a user in their communication with the system. For each category, we will roughly assess a system's applicability for environment control, i.e. how suitable a system is for providing access to and control over devices and services in a user's environment, especially with regard to home appliances.
2.1 Command-and-Control Languages
Command-and-control languages are well-known from operating systems such as Unix or DOS where they are typed into a text console. In general, this interaction pattern can also be used for speech interfaces. Commands may be spoken at any time – the user is in full control of the dialog. The language of syntactically permitted utterances (commands) is designed to allow for precise and terse requests with no ambiguity.
For example, imagine a command-and-control language that was designed for controlling audio-visual entertainment devices at home. In this language commands would be made up of a fixed syntax pattern: "device function argument" (with device, function and argument being placeholders). To turn the TV on, one would say: "TV power on". Similarly, "VCR channel 5" would switch the VCR to channel 5, and "VCR record on" would start recording. Any user utterance that does not match the "device function argument" pattern would be rejected as error, for example: "Power on TV" or "Switch to the weather channel".
Command languages are typically domain-specific and designed for experts in that domain for which they are highly efficient. But by severely constraining a user's utterances they require a significant learning effort and technical affinity by the user. Therefore, command-and-control languages are rather unsuited for wide-spread use with regard to controlling home appliances.
2.2 Directed Dialog Systems
With directed dialog systems, the system is in control of the dialog and prompts the user in order to obtain the required information for a task. Directed dialog systems are specific to a particular domain and its tasks. Interactive Voice Response (IVR) systems for telephone portals are prominent representatives of this type of speech-based interaction systems.
The following is an example for a basic dialog with a phone portal for a flight reservation service (from [McTear 2002]):
System: What is your destination?
User: London.
System: Was that London?
User: Yes.
System: What day do you want to travel?
User: Friday.
System: Was that Sunday?
User: No.
System: What day do you want to travel?
...
Some of the various directed dialog systems go beyond a strict system-controlled dialog toward a mixed initiative dialog. Recently, the World Wide Web Consortium standardized VoiceXML 2.0 [VoiceXML 2004], an XML-based description language for audio dialogs (such as used by IVR systems). The core of such a dialog description is a state model that lets the user make choices and provide chunks of information to a VoiceXML interpreter which is connected to a backend application. Various VoiceXML based tools and runtime platforms are commercially available and in use for telephone portals to services such as flight booking and electronic banking services.
One of the main advantages of directed dialog systems is their robustness for recognition errors. Since the user provides information piece by piece, the system can easily verify the user's input. However, dialogs can get quite meandering, in particular when recognition errors occur, as can be seen in the example dialog above.
Altogether, directed dialog systems are not a good choice for controlling an intelligent environment, since these systems are specifically designed for a particular domain, conducting a mostly system-initiated dialog. For controlling a user's environment, we want a system that is applicable across domains, and we want the user be in full control over the dialog.
2.3 Free Natural Language Interaction Systems
Free natural language interaction systems allow users to converse with them as they would do other humans, allowing for linguistic discourses that include references to parts of previous utterances. The user is completely controlling the dialog.
While natural language understanding technology has been profoundly explored for accessing databases [Androutsopoulos 1995], it has been applied for controlling services and devices in a user's environment only recently . One obvious approach is to draw from experiences in using natural language in querying databases. The EXACT system [Yates et al. 2003] provides a conversational interface to appliances by mapping a user's request to SQL queries, thus reducing the appliance control problem to a database query problem.
Other systems using natural language for controlling household appliances include the D'Homme project [Quesada et al. 2001] and a commercial system offered by Linguamatics [Linguamatics 2005]. However, these systems tend to constrain the permitted user utterances to a subset of natural language which improves recognition accuracy but places the burden of learning a new language on the user.
Below is an example dialog from the BUCKS system which provides a conversational interface to personal information management [Paraiso 2004]:
User: Could you list all conferences with deadline in February?
System: Agents et Langue: deadline 7, feb, 2004; International Conference on AI Application and Innovations: deadline 16, feb, 2004; European Conference on Artificial Intelligence – 2004: deadline 18, feb, 2004;
User: I need the email of Jean-Paul Barthès.
System: Jean-Paul Barthès, Professor at UTC-France, branch: 1234, email: barthes@utc.fr
User: I would like to send an email to Jean-Paul Barthes. |
System: What is the subject?, …
In general, free natural language interaction is well suited for home control, with the user initiating the dialog. However, the relatively poor accuracy of speech recognition remains an obstacle for the wide-spread use of natural language interaction in intelligent environments, especially in noisy environments and with speaker independent recognition. Another problem is the resolution of language ambiguities in a user's utterance. In order to improve a system's reliability in determining a user's goal, some systems, so- called "multimodal speech recognition systems", include input from other modalities such as pen gesture, eye gaze, images of lip movements or data from sensors in a room.
MIT's Intelligent Room [Coen 1999] uses a vision system that keeps track of a user's movements and pointing actions. The CASIS system [Leong 2005] uses device states, brightness, speaker location, chair occupancy, speech direction and action history as context information for controlling devices in a living room . In XWand [Wilson & Shafer 2003], a user can point a wand to devices in their environment and control them by natural language utterances. In the Ubiquitous Talker system [Nagao & Rekimoto 1995], the user can address particular devices in a similar way: A PDA's camera reads the bar code of the device that the user points the PDA to, and this information is used in interpreting the user's speech.
2.4 Speech Graffiti
The Universal Speech Interface (USI) project [Tomko & Rosenfeld, 2004] recognizes that speech-based interfaces using free natural language utterances are not mature enough to be adopted by the masses today. Noisy environments and linguistic variability often lead to recognition and interpretation errors, a frustrating experience for the user.
"Speech Graffiti", a structured interaction protocol, is designed to overcome these problems with an approach that goes a middle way between unconstrained natural language interfaces and directed dialog systems. Speech Graffiti defines cross-application input structures, output patterns and keywords that govern the interaction between a user and a service. The idea is that, once a user has learned the basic constructs, they can easily interact with any Speech Graffiti enabled application.
Here is a sample Speech Graffiti dialog (from [Tomko & Rosenfeld, 2004]):
User: Theater is the Manor. Title is Tuck Everlasting.
System: THE CINEMAGIC MANOR THEATRE, TUCK EVERLASTING.
User: Options.
System: SHOW TIME, DAY, GENRE, {…}.
User: What are the show times?
System: 2 MATCHES: 1:25, 5:30.
User: Start over.
System: STARTING OVER.
User: Area is Monroeville.
System: MONROEVILLE.
User: What are the theaters?
System: SHOWCASE CINEMAS PITTSBURGH EAST.
User: Theater is Showcase Cinemas Pittsburgh East.
System: SHOWCASE CINEMAS PITTSBURGH EAST.
User: What are the titles?
System: 11 MATCHES, BLUE CRUSH, ABANDON, BROWN SUGAR, {…}.
User: Genre is comedy.
System: COMEDY.
User: Where am I?
System: AREA IS MONROEVILLE, THEATER IS SHOWCASE CINEMAS PITTSBURGH EAST, WHAT ARE THE TITLES?, GENRE IS COMEDY.
User: What are the titles?
System: 5 MATCHES, MY BIG FAT GREEK WEDDING, BROWN SUGAR, JONAH - A VEGGIETALES MOVIE, {…}.
Interestingly, in a study reported by [Tomko & Rosenfeld, 2004], 74% of users preferred the Speech Graffiti system to a natural language interface in the same domain. However, task completion rates were roughly equal.
In environments where speech recognition accuracy is poor, Speech Graffiti is an interesting alternative to unconstrained use of speech input. However, dialogs such as the example above are sometimes awkward and unnatural, and add cognitive burden to the user where devices at home and their functions are already complex enough.
2.5 Intelligent Referencing
The systems above have some of the characteristics of a natural language system suited for device control but are missing others. Speech graffiti and the multi-modal systems provide the user with convenient ways to identify the objects within the environment to be controlled. In the Bucks example on the other hand, the user must repeat the name of Jean-Paul Barthès in order to email him, even though he just asked for Jean-Paul's email address.
Ideally a system for device control should maintain a model of the specific devices and other objects in the user's domain, which would be instances of device types the system knows about. This model should be extensible by the user. This would enable scenarios such as the one below:
The kid's TV is in the family room.
Turn on Sesame Street on the kid's TV.
And of course it should be able to use pronouns and other shortcuts to refer to previously mentioned objects:
Turn it down.
Intelligent referencing within a conversational context is still considered an area of research. Babble (Blaedow 2005) is a system that constructs a complete semantic model of the interaction with the user. The user is able to establish a referent, the kid's TV in the example above, and subsequently refer to it with a pronoun or other shortened reference.
A harder future goal is to handle a sequence such as:
Turn on As The World Turns.
Turn down the lights.
Louder.
To handle this example the system must identify the most recently mentioned referent that can be made louder. Note the device that will ultimately receive the volume command is never explicitly mentioned. Babble can store this type of information in a semantic structure that can be used across domains.
2.6 Ask What Not How
Another characteristic a device control system should have is the ability for the user to tell the system what to do without having to provide (or know) the steps to achieve it. For example it was not necessary to tell the BUCKS system to open a particular mail program in order to have it email Jean-Paul Barthès.
In order to have Babble record a TV show the user provides just the name of the program. A less friendly system might require the user to specify the exact parameters required by the recording device, such as start time, length, channel, etc.
A sample of Babble interaction after being fed a Television Program Guide is as follows (user in bold):
When is Frasier on?
Thursday, January 6 2005 at 11:36 pm
When is Chinatown on?
There are 2:
Saturday, January 8 2005 at 7:00 pm and
Sunday, January 9 2005 at 1:00 am
Record Chinatown at 1:00
Recording Chinatown Sunday, January 9 2005 at 1:00 am on BRAVO
What time is King Kong?
7:00 pm
Record it.
Recording King Kong Thursday, January 6 2005 at 7:00 pm on Turner Classic Movies
In Babble, the same basic semantic model used to represent the user's interaction can also model the functions of a device or the result of an action, allowing Babble to intelligently resolve how to accomplish a task. The next section will present an even more compelling example of an interface that specifies the desired result.
While we have illustrated the desirability of interfaces that do not require users to explicitly specify all the referents involved in their requests and give them latitude in specifying what to do and not how, it is also true that this will create ambiguity that the system must resolve. Consider the ambiguity if the last command in the previous example is "Off" instead of "Louder". Turn on As The World Turns might also be ambiguous if there are multiple TVs in the environment and several sources (recording, live, pay per view, etc) of a program.
The next section proposes some strategies for dealing with such ambiguities. This is an area we will be exploring using Babble.
3. An Illustration of a Direct Control versus a potential Natural Language Interface
Figure 1 shows an interface for an IBM multifunction copier/printer/fax. This type of multifunction document center is increasingly common in work places. The user interfaces on the various document centers vary, with some interfaces making one task easy but other tasks more difficult. Each has certain copy tasks which are presented on its "top" layer with other features layered below. They also vary in the size of their top layer with some having a rather simple upper layer with most features underneath and others, such as this copier, having a large number of directly accessible functions.

Try the following four tasks on the direct manipulation interface above
- Creating Single Sided Copy from a Double Sided Original
- Copy Single Sided Pages into a Double sided saddle stitched Booklet
- Original Pages from a Book, Copy them to Single Sided Pages
- Copy Selected Pages from a Book into a Booklet
The steps are many and complex and require a familiarity with the copier and perhaps use of the manual.
Compare this against a natural language interface for the same tasks.
- A Double Sided Original, Single Sided Copy
Natural language; "copy this single sided" (for a copier that autosenses)
Or – if the copier does not sense if the original is single or double sided -
"Copy this double sided page to single sided copie s" or
"Make single sided copies from these double sided pages" - Copy Single Sided Pages into a Double sided saddle stitched Booklet
Natural language "Copy these as a booklet, stapled down the middle"
- Original Pages from a Book, Copy; Single Sided Pages
"Copy these pages from a book onto single sided pages".
- Copy Selected Pages from a Book into a Booklet
"Copy these pages from the book into a booklet"
Even if speech recognition were questionable, it would be much easier to type the command in versus figuring out how to do it on the control panel. Fortunately, with directional microphones and a domain constrained application such as this, speaker independent speech recognition can be quite high. With Babble like interfaces it is also possible to have exchanges where individuals can have their own terminology and use interaction with the user to disambiguate. If the user asks for many copies, it is also possible for the copier to do a best guess with the confirmation copy.
Although beyond the present capabilities of Babble or any existing natural language engine, the following is a target for how this could work in the not too distant future.
For example:
User: Please make 50 booklets out of the following pages…
Copier: (after scanning in the pages)
" You would like me to make 50 ten page booklets out of the following 20 single sided pieces of paper. Is this correct?"
User: Yes, that sounds right.
Copier: "Would you like me to print one booklet as a test to be sure I understand you correctly?"
User: Yes, that would be nice.
Copier: (after printing one copy of booklet) Is this correct?
User: You printed the first page on the back of the cover. The back of the cover should be blank.
Copier: "I do not understand. I have printed another copy with numbers in the middle of each page. Which page did you expect to be blank?"
User: Page 2 should be blank and page 2 should be on page 3.
Copier: "Ok. Let me print another sample."
Copier (after printing new booklet): "Is it right this time?"
User: Yes, this is perfect.
Copier: "Shall I print 50 copies like the last sample I printed?"
User: Yes.
In this case, an inexperienced user was able to get a double sided booklet printed without understanding that the copier would normally expect them to insert a blank page in order to get the back of the cover to be blank. It also allowed them to get the proper output without understanding that they needed "Saddle Stitched" binding, or that they did not want book copying, which is in fact the setting for copying from a book, not to a book.
3.1 Value of Verbal Confirmation
Using verbal interaction (via speech or typing) can also provide a valuable means for avoiding complex errors. During usability tests on a different but similar copy center, multiple users ended up with large numbers of copies of a page they were attempting to fax. The difficulty came in thinking they were entering a phone number in the fax routine, when in fact they were entering the number of copies. In both cases, they thought the printing that was happening was actually someone else's print job that was occurring while they were doing a fax. If the model had been an interaction model, the individual would quickly realize their mistake when the device confirmed a very large number of copies rather than confirming that it was about to send a fax.
3.2 Alternative, not Substitute
It is important to note that such natural language interfaces should not be thought of as substitutes for the current direct interface. It should instead be an alternative. There are some individuals who are adept and familiar with the copiers who would be much faster with, and would prefer, a direct manipulation interface. There are others who would prefer the verbal interaction style. The majority, however, are likely to use a direct manipulation interface for simple jobs and rely on verbal exchange model when they need features or functions that are more complex or unfamiliar, or if they are interacting with a different machine than they are used to.
3.3 Importance of Recovery Mechanisms
With the success of any natural language interface however, will be highly dependent upon its ability to understand the commands from the user and in its ability to detect and repair misunderstanding or poorly formed commands. With the advance of newer much more flexible and self learning natural interface technologies, however, we may soon have flexible adaptable language, natural language systems which can significantly simplify interactions with the increasingly complex devices we are finding in our work community and homes. This will be a delightful development for the large portion of the population who find they are increasingly unable to understand and use the products they carry in their homes. It would also be increasingly important to a society interested in keeping its aging population living independently longer.
3.4 Acknowledgment
This work was partially funded by the National Institute on Disability and Rehabilitation Research, US Dept of Education under Grant H133E030012 as part of the Universal Interface and Information Technology Access Rehabilitation Engineering Research Center of the University of Wisconsin - Trace Center. The opinions herein are those of the authors and not necessarily those of the funding agency.
References
[Androutsopoulos 1995] Androutsopoulos, I.; Ritchie, G.D.; and Thanish, P. Natural Language Interfaces to Databases – An Introduction. Natural Language Engineering, vol 1, part 1, 29-81, 1995.
[Blaedow 2005] Babble, the Meaning of Meaning Custom Technology Ltd http://www.customtechnologyltd.com/docs/babblemom.doc
[Coen 1999] M. Coen, L. Weisman, K. Thomas, and M. Groh, A Context Sensitive Natural Language Modality for an Intelligent Room, In 1st International Workshop on Managing Interactions in Smart Environments (MANSE'99), pp.68-79. Dublin, Ireland, December 1999.
[Dahl et al. 2000] Dahl, D.A., Norton, L.M., & Scholz, K.W. Commercialization of natural language processing technology. Communications of the ACM, Vol. 43 , Issue 11es (Nov 2000). ACM Press, New York, NY, USA
[Leong 2005] Leong, L.H.; Kobayashi, S.; Koshizuka, N.; Sakamura, K. CASIS: a context-aware speech interface system. Proceedings of the 10th international conference on Intelligent user interfaces, San Diego, California, USA, 2005, pp. 231-238. ACM Press, New York, NY, USA.
[Linguamatics 2003] http://www.linguamatics.com/technology/dialogue/home.html
[McTear 2002] McTear, M.F. Spoken dialogue technology: enabling the conversational user interface. ACM Computing Surveys, Vol. 34, Issue 1 (Mar 2002), pp. 90-169. ACM Press, New York, NY, USA.
[Nagao & Rekimoto 1995] Nagao, K.; Rekimoto, J. Ubiquitous Talker: Spoken Language Interaction with Real World Objects, In Proceedings of the International Joint Conference on Artificial Intelligence, 1995.
[Paraiso 2004] Paraiso, E.; Barthès, J.-P. Architecture d'une interface conversationnelle pour les agents assistants personnels. AGENTAL «Agents et Langue», Journée ATALA, Paris, 13 mars 2004.
[Quesada et al. 2001] Quesada, J.F.; Garcia, F.; Sena, E.; Bernal, J.A.; Amores, G. Dialogue Managements in a Home Machine Environment: Linguistic Components over an Agent Architecture. SEPLN, 89-98. 2001. http://www.hds.utc.fr/~eparaiso/ArtigosPublicados/AgentsLangue2004.PDF .
[Tomko & Rosenfeld, 2004] Tomko, S. & Rosenfeld, R. Speech Graffiti vs. Natural Language: Assessing the User Experience. Proc. HLT/NAACL, Boston, MA, 2004. http://www-2.cs.cmu.edu/~usi/papers/HLT04.pdf.
[Vanderheiden & Zimmermann 2005] Use of User Interface Sockets to Create Naturally Evolving Intelligent Environments. 3rd International Conference on Universal Access in Human-Computer Interaction (UAHCI 2005), Las Vegas, Nevada, USA, 22-27 July 2005.
[VoiceXML 2004] Voice Extensible Markup Language (VoiceXML) Version 2.0, W3C Recommendation 16 March 2004. http://www.w3.org/TR/2004/REC-voicexml20-20040316/.
[Wilson & Shafer 2003] Wilson, A.; Shafer, S. XWand: UI for Intelligent Spaces, In Proceedings of SIGCHI 2003, pp 545-552, 2003.
[Yates et al. 2003] Yates, A., Etzioni, O., & Weld, D. A reliable natural language interface to household appliances. Proceedings of the 8th international conference on Intelligent user interfaces, Miami, Florida, USA, 2003, pp. 189-196. ACM Press, New York, NY, USA.