Published in the special section in Communications of the ACM (CACM) on spoken interfaces , 2000 September.

Mark Lucente

Conversational interfaces for e-commerce applications

These e-commerce shopping agents are knowledgeable, patient, and affable during their conversational interaction with shoppers - while helping generate sales for site owners.

Consumers spend billions online, though most are unhappy with their shopping experiences [1]. My colleagues and I at Soliloquy, Inc., are creating conversational natural language (NL) interfaces for e-commerce - software that appears on a web site as a sales "Expert" - conversing with the shopper, answering questions, and helping to find the perfect product to purchase. The metaphor is natural conversation - just like talking with a (human) salesperson. The shopper and the Expert co-produce knowledge and understanding through conversation - a shopping experience that is more effective, fun, and human than point-click-download, etc.

By constraining the knowledge domain to a particular product line (e.g., PCs, or restaurants, or mutual funds), software Experts can be more robust and helpful. Domain constraint facilitates several features:

  • (1) rich contextual understanding;
  • (2) robust speech recognition;
  • (3) high-quality text-to-speech synthesis;
  • (4) effective website navigation; and
  • (5) a viable business model based on the application of spoken interfaces.
The first two features benefit especially from the disambiguation made possible by the limited size of a particular domain - its concepts and concomitant words - which is significantly less than that of all concepts and all the words of a human language (e.g., English). The last feature - profitability - must be emphasized: if spoken interfaces are to flourish and grow to meet popular expectation, they must have a "killer app" - an application that is both compelling and profitable.

Our online conversational Experts combine several technologies that are ripe for the picking: speech recognition, speech synthesis, natural language understanding, and the World Wide Web. Early use of our software Experts has yielded many insights from this confluence of technologies and has raised several important questions about the evolving online shopping experience. Particularly important are the multimodal nature of the interaction and the human's natural perception of the Expert's personality.

Shopping Expert

Our shopping Experts appear to the user (i.e., the shopper) as a small conversation window on an e-commerce webpage. (See Figure .) The human-like Expert (which sometimes includes a cartoon face) appears eager to help the shopper to understand what is for sale and to find the perfect items to purchase. The Expert begins the conversation by greeting the shopper and prompting for questions. The shopper speaks to the Expert or types into the window. The Expert responds in several ways: - a spoken response (if audio output is enabled); - a text response; - a hyperlinked multimedia response including images (such as some products for sale).

The Expert's main goal in a conversation is to help the shopper to find the best item to buy. The Expert's response is generated to satisfy several sub-goals: - answer shopper's questions; - proffer items that the shopper may wish to purchase; - ask for clarification; - prompt the shopper for information that allows the Expert to refine the search.

The Expert uses a proprietary natural language understanding system built on top of a structured knowledge base. NL subsystems convert the incoming text (from the keyboard or from a speech recognizer) into conceptual representations, adds these concepts to the conversational context, determines sub-goals, and finally motivates actions, e.g., constructing an answer, searching the product database, navigating the webpage, or prompting the shopper for more information. A conversation manager orchestrates the input and output while updating the context representation. The structured knowledge base contains several components:

  • - functional representations of general concepts (relevant to e-commerce shopping) and their relationships to the words and phrases of a particular language (English for now);
  • - functional representations of domain knowledge (e.g., the language of PCs);
  • - a semantic representation of the essential structure and function of domain products and what they are used for (e.g., a laptop computer has RAM and at least one hard disk, or traveling necessitates a light laptop PC);
  • - a database that lists what products are for sale and their features (e.g., price, speed, size, etc.) on a given e-commerce website. Knowledge engineering facilitates a rich representation of context, allowing the conversation to be natural, nonlinear, and efficient [1]. Our object-oriented approach to conceptual representation makes it possible to quickly scale up the knowledge base of an Expert and create Experts in a multitude of domains.

Figure 1. Expert-based conversation window on an e-commerce Web page.

Inherently Multimodal

The interaction between a human user and our shopping Expert differs from human-human conversation in important ways that we have begun to understand and exploit. Because the Expert appears on a webpage, the shopper-Expert interaction is inherently multimodal. The Expert can sense the shopper's mouse and keyboard activity. The user sees images and hypertext generated on-the-fly. Furthermore, the Expert can offer graphical interface elements (e.g., buttons or hypertext words) to augment the conversation - analogous to a human-human conversation involving physical objects that are gestured toward or used iconically.

The multimodal interaction allows for rich conversation and contextual development that often exceeds that of (human-human) speech-only conversation. Although the Expert's NL understanding capabilities are not as advanced as those of a typical human, the presentation of multimedia responses (e.g., pictures of items for sale) facilitates complex conversation and delights users. For example, if a user asks to spend under $2000, this constraint is illustrated in a small summary table (see Figure), visible to the absent-minded user's quick glance, which reduces redundant or divergent exchanges. (This simple illustration is, of course, not possible in a speech-only conversation, e.g., over the telephone.) Sensing user selections (i.e., mouse position) allows for a multimodal interaction that is often much faster than speech-only interactions. For example, when the Expert responds with three pictures of products that match the user's requirements, the user can point to one picture while saying "How much does this on cost?"

Affable Personality

User testing shows that the shopper perceives the Expert as having a personality [Nass]. Verbal output contributes strongly to this personality, similar to human-human interaction (See Boyce's "Natural Spoken Dialogue Systems for Telephony Applications" in this section). (The prosody of the speech output contributes too, but this topic will be addressed in future publications.) We give the Experts personalities that are suited to their shopping function. We teach the Expert to use an adaptive blend of vernacular, humor, and down-to-business tone, creating a personality that is pleasant to most users but one that projects authority - rather like an affable college professor. The Expert's personality adjusts dynamically, and can be adjusted by the site owner.

Consider the Notebook Expert, an Expert that sits on a notebook PC website and conversationally helps the shopper to find the best to buy. Notebook computers are expensive and personal, and shopping for one is fraught with anxiety, indecision, soul-searching, doubt and confusion over specialized jargon and features. (No wonder so few shoppers become buyers on traditional click-and-download sites!) The perceived personality of the Expert helps the shopper through this process, increasing the rate at which shoppers become buyers. The Expert exhibits patience by remaining attentive - no matter how long the conversation lasts. The Expert is perceived to be honest and forthcoming, as it shows all the information requested and answers questions quickly and precisely. And the Expert adds a personal touch - remembering the shopper's name and particular requirements and engaging the shopper on topics that are central to the shopper's needs. All of this personality (and perceived intelligence) adds up to an improved shopping experience and increased sales. On one notebook PC shopping website, 30% of shoppers who converse with the Expert go on to purchase - compared with under 2% for shoppers who do not use the Expert.

A final and fascinating observation: our constant goal is to educate the Experts and extend their personality and intelligence. We believe that this is a wonderful trend - that of educating as well as engineering - as it allows teachers, psychologists, linguists, consultants and (human) experts from many fields to contribute to the creation of useful software that is a pleasure to engage.

References

[1]    Richard E. Cullingford, Natural language processing: a knowledge-engineering approach, Rowman & Littlefield, Totowa, NJ, 1986

[2]    Manning, H. The right way to test ease-of-use. In The Forrester Brief. Forrester Research, Inc., Cambridge, Mass., 1999.

[3]    Clifford Nass , Kwan Min Lee, Does computer-generated speech manifest personality? an experimental test of similarity-attraction, Proceedings of the CHI 2000 conference on Human factors in computing systems, p.329-336, April 01-06, 2000, The Hague, The Netherlands

Mark Lucente
Soliloquy, Inc.