Topical_chat_preprocessor

This project was created to easily and conveniently preprocess the Topical-Chat dataset from the Amazon Alexa team.

The Topical-Chat dataset is already excellently organized in JSON files, but its structure is complex and the mixture of dictionary and list makes preprocessing inconvenient.

Therefore, it returns the data in various forms to facilitate use in ML and DS.

Currently, as a prototype, it only supports conversations data and returning data in list and dictonary type.

Example code

from topical_chat_preprocessor import TopicalChatPreprocessor

Tppre = TopicalChatPreprocessor('your Topical-Chat folder path')
listed_data = Tppre(['train','valid_freq','test_freq'])

Expacted return

for argument 'list'

"return": [
  "files": [
    "article_url": [<url>, <url>, ...]
    "config": []
    "content_agent": [
      "agent": [<agent1>, <agent2>, ...]
    ]
    "content_message": [
      "message": [<text>, <text>, ...]
    ]
    "content_sentiment": [
      "sentiment": [<>, <>, ...]
    ]
    "content_knowledge_source": [
      "knowledge_source": [[<>], [<>, <>], ...]
    ]
    "content_turn_rating": [
      "turn_rating": [<>, <>, ...]
    ]
    "conversation_rating": {
      "agent_1": <>,
      "agent_2": <>
    }
  ]
]

for argument 'dict'

"return": {
  "file name" <train>: {
    "article_url": [<url>, <url>, ...]
    "config": []
    "content_agent": [
      "agent": [<agent1>, <agent2>, ...]
    ]
    "content_message": [
      "message": [<text>, <text>, ...]
    ]
    "content_sentiment": [
      "sentiment": [<>, <>, ...]
    ]
    "content_knowledge_source": [
      "knowledge_source": [[<>], [<>, <>], ...]
    ]
    "content_turn_rating": [
      "turn_rating": [<>, <>, ...]
    ]
    "conversation_rating": {
      "agent_1": <>,
      "agent_2": <>
    }
  }
}

The difference is that the data for each file is now organized into a dictionary with the file name as the key file_name: data, and within each data, additional keys(article_url, config, content_agent, content_message, content_sentiment, content_knowledge_source, content_turn_rating, conversation_rating) are used based on specific names. Other than these changes, it remains similar to the list type.

Reference

Gopalakrishnan, Karthik, et al. "Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations.", Proc. INTERSPEECH 2019

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
test.py		test.py
topical_chat_preprocessor.py		topical_chat_preprocessor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topical_chat_preprocessor

Example code

Expacted return

for argument 'list'

for argument 'dict'

Reference

About

Releases

Packages

Languages

Sion1225/Topical_chat_preprocessor

Folders and files

Latest commit

History

Repository files navigation

Topical_chat_preprocessor

Example code

Expacted return

for argument 'list'

for argument 'dict'

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages