What are the main datasets that can be used to train ChatBots ?¶
Here we include a list of dialogue datasets that can be used to train chatbots:
CategoryName |
Context |
Structure |
License |
---|---|---|---|
Schema Guided Conversation |
|||
Conversations with system spanning 20 domains, ranging from banks and events to media, calendar, travel, and weather |
Wide range of annotations including system and user actions and user states |
CC BY SA 4.0 |
|
A collection of data from various sources with labelled intent |
Includes Banking77, HWU64, CLINC150, Restaurant8K, SGD, TOP, MultiWOZ2.1 |
Various Licenses including CC-BY, CC-BY-SA, MIT |
|
Collection of single and multi-domain conversations |
Annotated with goal and belief state |
Apache License 2.0 |
|
Set of conversations between an agent and a simulated user for booking a restaurant ticket and buying movie tickets |
annotated with dialogue state and actions |
Not mentioned. |
|
Conversations about vacation packages including flight and hotel. |
Not mentioned. |
||
Dialogues with state tracking on restaurant search domain |
Annotated with system states |
Not mentioned. |
|
Conversations in 2 domains of restaurant information and tourist information. |
User dialog-act semantics and dialog states are annotated. |
Not mentioned. |
|
Intent Datasets |
|||
Fine grained set of intents in banking domain |
annotated with intents |
CC-BY-4.0 |
|
Popular personal assistant queries |
annotated with intents |
CC-BY-SA 3.0 |
|
Popular personal assistant queries |
annotated with intents |
CC-BY-SA 3.0 |
|
Others |
|||
Daily Life Conversations including ordinary life, school life, tourism, attitude & emotion, relationship, health, work, politics, culture & education and finance. |
13K of conversations with average 8 turns and annotation for new ideas and emotions |
Research Only. No Commercial Use. |
|
Divided into shorter conversations and added alternative responces for last utterance |
Includes adverserial irrelevant responses |
Not mentioned |
|
Chit chat based on profiles |
Includes profiles and conversations |
CC BY 4.0 |
|
Rephrasing of statements in Persona-Chat to avoid obvious overlap |
Includes profiles and conversations |
Not mentioned. |
|
Transcripts of Comedy Series Friends |
Annotated for 36 possible relation types |
Research Only. No Commercial Use. |
|
Selected conversations from the Ubuntu Internet Relay Chat (IRC) channel and an Advising dataset from the University of Michigan |
Not mentioned. |
||
Added complexity to DSTC7 Task1 |
Including conversations with more than 2 participants and multiple simultaneous conversations and predicting whether the dialogus has solved the problem yet. |
Not mentioned. |
|
Conversational data from Reddit |
Includes the conversation and the relevant facts on the topic of the conversation |
Downloaded via script. |
|
Conversations in one of the following 6 domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations |
CC BY SA 4.0 |
||
Conversations in one of the following 7 domains: restaurants, food ordering, movies, hotels, flights, music, and sports |
CC BY SA 4.0 |
||
Dialogs about geographic topics like geopolitical entities and locations |
Includes fine-grained knowledge groundings and dialog-act annotations. |
CC BY NC 4.0 |
|
Conversations on movie preferences |
Annotated with entity mentions, preferences expressed about entities. |
CC BY SA 4.0 |
|
Conversations between users and customer care agents on twitter in 25 organizations |
Includes the URL to a document including additional information on user request |
Apache 2.0 |
|
Human conversations constrained by policies and sequence of actions |
Annotated with user and order details |
MIT License |
|
Knowledge grounded conversations spanning 3 distinct tasks: calendar scheduling, weather information retrieval, and point-of-interest navigation |
Not mentioned. |
||
Human conversations |
Labelled with one of 8 emotions:anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral |
Not mentioned. |
|
Conversations crawled from movie scripts on IMSDb |
Annotated with one of 13 predefined interpersonal relationships. |
Not mentioned. |