Virtual assistants are increasingly integrated into our daily routines. They can help with everything from setting alarms to giving map directions and can even assist people with disabilities to more easily manage their homes. As we use these assistants, we are also becoming more accustomed to using natural language to accomplish tasks that we once did by hand.
One of the biggest challenges in building a robust virtual assistant is identifying what a user wants and what information is needed to perform the task at hand. In the natural language processing (NLP) literature, this is mainly framed as a task-oriented dialogue parsing task, where a given dialogue needs to be parsed by a system to understand the user intent and carry out the operation to fulfill that intent. While the academic community has made progress in handling task-oriented dialogue thanks to custom purpose datasets, such as MultiWOZ, TOP, SMCalFlow, etc., progress is limited because these datasets lack typical speech phenomena necessary for model training to optimize language model performance. The resulting models often underperform, leading to dissatisfaction with assistant interactions. Relevant speech patterns might include revisions, disfluencies, code-mixing, and the use of structured context surrounding the user’s environment, which might include the user’s notes, smart home devices, contact lists, etc.
Consider the following dialogue that illustrates a common instance when a user needs to revise their utterance:
A dialogue conversation with a virtual assistant that includes a user revision. |
The virtual assistant misunderstands the request and attempts to call the incorrect contact. Hence, the user has to revise their utterance to fix the assistant’s mistake. To parse the last utterance correctly, the assistant would also need to interpret the special context of the user — in this case, it would need to know that the user had a contact list saved in their phone that it should reference.
Another common category of utterance that is challenging for virtual assistants is code-mixing, which occurs when the user switches from one language to another while addressing the assistant. Consider the utterance below:
A dialogue denoting code-mixing between English and German. |
In this example, the user switches from English to German, where “vier Uhr” means “four o’clock” in German.
In an effort to advance research in parsing such realistic and complex utterances, we are launching a new dataset called PRESTO, a multilingual dataset for parsing realistic task-oriented dialogues that includes roughly half a million realistic conversations between people and virtual assistants. The dataset spans six different languages and includes multiple conversational phenomena that users may encounter when using an assistant, including user-revisions, disfluencies, and code-mixing. The dataset also includes surrounding structured context, such as users’ contacts and lists associated with each example. The explicit tagging of various phenomena in PRESTO allows us to create different test sets to separately analyze model performance on these speech phenomena. We find that some of these phenomena are easier to model with few-shot examples, while others require much more training data.
Examples of Hindi-English, Spanish-English, and German-English code-switched utterances from PRESTO. |
Examples of utterances in English, Japanese, and French with filler words or repetitions. |
We performed targeted experiments to focus on each of the phenomena described above. We ran mT5-based models trained using the PRESTO dataset and evaluated them using an exact match between the predicted parse and the human annotated parse. Below we show the relative performance improvements as we scale the training data on each of the targeted phenomena — user revisions, disfluencies, and code-mixing.
K-shot results on various linguistic phenomena and the full test set across increasing training data size. |
The k-shot results yield the following takeaways:
We also investigate the difference between training monolingual and multilingual models on the train set and find that with fewer data multilingual models have an advantage over monolingual models, but the gap shrinks as the data size is increased.
Additional details on data quality, data collection methodology, and modeling experiments can be found in our paper.
We created PRESTO, a multilingual dataset for parsing task-oriented dialogues that includes realistic conversations representing a variety of pain points that users often face in their daily conversations with virtual assistants that are lacking in existing datasets in the NLP community. PRESTO includes roughly half a million utterances that are contributed by native speakers of six languages — English, French, German, Hindi, Japanese, and Spanish. We created dedicated test sets to focus on each targeted phenomenon — user revisions, disfluencies, code-mixing, and structured context. Our results indicate that the zero-shot performance is poor when the targeted phenomenon is not included in the training set, indicating a need for such utterances to improve performance. We notice that user revisions and disfluencies are easier to model with more data as opposed to code-mixed utterances, which are harder to model, even with a high number of examples. With the release of this dataset, we open more questions than we answer and we hope the research community makes progress on utterances that are more in line with what users are facing every day.
It was a privilege to collaborate on this work with Waleed Ammar, Siddharth Vashishtha, Motoki Sano, Faiz Surani, Max Chang, HyunJeong Choe, David Greene, Kyle He, Rattima Nitisaroj, Anna Trukhina, Shachi Paul, Pararth Shah, Rushin Shah, and Zhou Yu. We’d also like to thank Tom Small for the animations in this blog post. Finally, a huge thanks to all the expert linguists and data annotators for making this a reality.
Podcasts are a fun and easy way to learn about machine learning.
TL;DR We asked o1 to share its thoughts on our recent LNM/LMM post. https://www.artificial-intelligence.show/the-ai-podcast/o1s-thoughts-on-lnms-and-lmms What…
Palantir and Grafana Labs’ Strategic PartnershipIntroductionIn today’s rapidly evolving technological landscape, government agencies face the…
Amazon SageMaker Pipelines includes features that allow you to streamline and automate machine learning (ML)…
When it comes to AI, large language models (LLMs) and machine learning (ML) are taking…
Cohere's Command R7B uses RAG, features a context length of 128K, supports 23 languages and…