What are the main datasets that can be used to train ChatBots ?


Here we include a list of dialogue datasets that can be used to train chatbots:

Table of datasets for chatbot development

CategoryName

Context

Structure

License

Schema Guided Conversation

Schema-Guided Dialogue (SGD)

Conversations with system spanning 20 domains, ranging from banks and events to media, calendar, travel, and weather

Wide range of annotations including system and user actions and user states

CC BY SA 4.0

DialoGLUE

A collection of data from various sources with labelled intent

Includes Banking77, HWU64, CLINC150, Restaurant8K, SGD, TOP, MultiWOZ2.1

Various Licenses including CC-BY, CC-BY-SA, MIT

MultiWoZ

Collection of single and multi-domain conversations

Annotated with goal and belief state

Apache License 2.0

M2M

Set of conversations between an agent and a simulated user for booking a restaurant ticket and buying movie tickets

annotated with dialogue state and actions

Not mentioned.

Frames Dataset

Conversations about vacation packages including flight and hotel.

Not mentioned.

WoZ

Dialogues with state tracking on restaurant search domain

Annotated with system states

Not mentioned.

Dialogue State Tracking Challenge (DSTC2&3)

Conversations in 2 domains of restaurant information and tourist information.

User dialog-act semantics and dialog states are annotated.

Not mentioned.

Intent Datasets

BANKING77

Fine grained set of intents in banking domain

annotated with intents

CC-BY-4.0

CLINC 150

Popular personal assistant queries

annotated with intents

CC-BY-SA 3.0

HWU64

Popular personal assistant queries

annotated with intents

CC-BY-SA 3.0

Others

DailyDialog

Daily Life Conversations including ordinary life, school life, tourism, attitude & emotion, relationship, health, work, politics, culture & education and finance.

13K of conversations with average 8 turns and annotation for new ideas and emotions

Research Only. No Commercial Use.

└ DailyDialog++

Divided into shorter conversations and added alternative responces for last utterance

Includes adverserial irrelevant responses

Not mentioned

Persona-Chat

Chit chat based on profiles

Includes profiles and conversations

CC BY 4.0

└ ConvAI2

Rephrasing of statements in Persona-Chat to avoid obvious overlap

Includes profiles and conversations

Not mentioned.

DialogRE

Transcripts of Comedy Series Friends

Annotated for 36 possible relation types

Research Only. No Commercial Use.

Dialog System Technology Challenges (DSTC)7 Task1

Selected conversations from the Ubuntu Internet Relay Chat (IRC) channel and an Advising dataset from the University of Michigan

Not mentioned.

└ Dialog System Technology Challenges (DSTC)8 Task2

Added complexity to DSTC7 Task1

Including conversations with more than 2 participants and multiple simultaneous conversations and predicting whether the dialogus has solved the problem yet.

Not mentioned.

Dialog System Technology Challenges (DSTC)7 Task2

Conversational data from Reddit

Includes the conversation and the relevant facts on the topic of the conversation

Downloaded via script.

Taskmaster-1

Conversations in one of the following 6 domains: ordering pizza, creating auto repair appointments, setting up ride service, ordering movie tickets, ordering coffee drinks and making restaurant reservations

CC BY SA 4.0

└ Taskmaster-2

Conversations in one of the following 7 domains: restaurants, food ordering, movies, hotels, flights, music, and sports

CC BY SA 4.0

Curiosity

Dialogs about geographic topics like geopolitical entities and locations

Includes fine-grained knowledge groundings and dialog-act annotations.

CC BY NC 4.0

Coached Conversational Preference Elicitation

Conversations on movie preferences

Annotated with entity mentions, preferences expressed about entities.

CC BY SA 4.0

Twitter Conversations Dataset

Conversations between users and customer care agents on twitter in 25 organizations

Includes the URL to a document including additional information on user request

Apache 2.0

Action-Based Conversations Dataset (ABCD)

Human conversations constrained by policies and sequence of actions

Annotated with user and order details

MIT License

Key-Value Retrieval Networks (KVRET)

Knowledge grounded conversations spanning 3 distinct tasks: calendar scheduling, weather information retrieval, and point-of-interest navigation

Not mentioned.

EmotionLines

Human conversations

Labelled with one of 8 emotions:anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral

Not mentioned.

Dyadic Dialogues Relationships (DDRel)

Conversations crawled from movie scripts on IMSDb

Annotated with one of 13 predefined interpersonal relationships.

Not mentioned.