Our documentation is still under construction. Please reach out to Mina Lee at [email protected]
if you have any question or suggestion for improvement.
Raw data contains the raw interaction traces collected for each task, including all logged interaction traces (logs.zip
) as well as survey questions and responses.
To facilitate easier viewing of the raw interaction data, we provide both static
and dynamic replay
visualizations, which are directly viewable in any browser after downlading this repository. Both static and replay visualzations are available for the Crossword puzzles task, and static visualizations are available for the Question answering task.
To make it easier to browse through these interactive visualizations, we have created the following web applications:
For any questions about the data visualizations for the Question answering or Crossword puzzle tasks, feel free to reach out to Megha Srivastava at [email protected]
!
To faciliate the analysis of events between humans and LMs across hundreds of interaction sessions, we standardize the data for each task to based on the raw data we collected.
Each task has event_blocks.csv
and survey_responses.csv
as well as a few extra files depending on the task.
Event blocks. The event_blocks.csv
file contains the event blocks (i.e., a sequence of events of our interest) that occurred during interaction. The definition of an event block varies by task. Here is how we defined event blocks for each task:
- Social dialogue: each turn is an event block (user writes one message)
- Question answering: each question and answer pair is an event block (user answers a question by querying a system zero or more times)
- Crossword puzzles: each query to the system is an event block (user queries a system once to brainstorm about a clue)
- Text summarization: each original document, machine-generated summary, and edited summary pair is an event block (user edits a machine-generated summary of a document)
- Metaphor generation: each metaphorical sentence is an event block (user writes a metaphorical sentence by querying a system zero or more times)
Our standardized data also includes some relevant statistics, such as the number of queries, the elapsed time, etc. for each event block.
For implementation details, please refer to the *_block.py
files in ./src/blocks
for each task.
Survey responses. the survey_responses.csv
file contains the survey responses collected after the task.
Each turn in a dialogue corresponds to a row.
session_id | worker_id | turn_id | model | prompt | elapsed_time | user_input | model_completion | interesting | boring | preference | humanness | sensibility | specificity | commonsense | empathy | quality_overall |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24d... | 650... | 1 | InstructBabbage | Robin and Tracy went to school together. Robin helped Tracy study. | 1.2 | I'm so grateful Robin helped me study! I'm not very good at math and this test was extremely important. | That's amazing! What did you learn from him? | 0 | 0 | 1 | 1 | 1 | 1 | 1 | -1 | 4 |
24d... | 650... | 2 | InstructBabbage | Robin and Tracy went to school together. Robin helped Tracy study. | 1.65 | I learned some mnemonic devices that helped me remember the proper order of doing things. I also learned how to double check my work. | That sounds like such an important skill to have. | 0 | 0 | 1 | 1 | 1 | 0 | 0 | -1 | 4 |
Each question in a quiz corresponds to a row.
session_id | worker_id | order_id | model | prompt | sequence_id | question_id | question_type | question_category | question_text | choice_a | choice_b | choice_c | choice_d | answer | answer_text | lm_used | user_queries | lm_responses | user_answer | user_correct | elapsed_time | num_queries | num_events |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
000... | A2E... | 0 | InstructDavinci | na | 19 | 5 | ctrl | misc | In the phrase 'Y2K' what does 'K' stand for? | millennium | computer code | catastrophe | thousand | d | thousand | 0 | d | 1 | 0.15 | 0 | 3 | ||
000... | A2E... | 1 | InstructDavinci | na | 19 | 14 | ctrl | us_fp | "Within the United Nations, real power is located in" | the Security Council. | the Chamber of Deputies. | the Council of Ministers. | the Secretariat. | a | the Security Council. | 0 | a | 1 | 0.03 | 0 | 2 |
Fields in event_blocks.csv
Each user query and system response in a crossword puzzle session corresponds to a row.
session_id | worker_id | order_id | model | prompt | prompt_dataset | clue_num | clue_direction | clue_type | user_query | query_type | completion |
---|---|---|---|---|---|---|---|---|---|---|---|
61_6... | 657... | 0 | InstructDavinci | 61 | NYT-1 | 1 | across | knowledge | modern persia | ['exact', 'keyword'] | The modern Persian language... |
61_6... | 657... | 1 | InstructDavinci | 61 | NYT-1 | 1 | down | N/A | not give an | ['keyword'] | inch" This phrase... |
Fields in survey_responses.csv
fluency
: user's survey response to the question "How clear (or fluent) were the responses from the AI Teammate?" on 5-point Likert scalehelpfulness
: user's survey response to the question "Independent of its fluency, how helpful was your AI Teammate for solving the crossword puzzle?" on 5-point Likert scaleease
: user's survey response to the question "How easy was it to interact with the AI Teammate?" on 5-point Likert scalejoy
: user's survey response to the question "How enjoyable was it to interact with the AI Teammate?" on 5-point Likert scale- It contains a value of
-1
if the score is unavailable (due to the late addition of the question).
- It contains a value of
session_id | worker_id | model | prompt | prompt_dataset | fluency | helpfulness | ease | joy |
---|---|---|---|---|---|---|---|---|
151_5... | 57c... | Jumbo | 151 | ELECT | 3 | 2 | 3 | 1 |
151_0... | 01f... | InstructBabbage | 151 | ELECT | 3 | 1 | 5 | 2 |
Fields in event_blocks.csv
Each document's summary in a writing session corresponds to a row.
elapsed_time
: time for editing a system-generated summarydocument
: document for which a user is asked to evaluate and edit a summaryoriginal_summary
: a system-generated summaryoriginal_{consistency, relevance, coherency}
: user's evaluation on the system-generated summaryedited_summary
: a system-generated summary edited by the useredited_{consistency, relevance, coherency}
: user's evaluation on the edited summarydistance
: edit distance betweenoriginal_summary
andedited_summary
original_{consistency, relevance, coherency}_third_party
}: third-party evaluation on the system-generated summary (available for a subset of summaries)edited_{consistency, relevance, coherency}_third_party
: third-party evaluation on the system-generated summary (available for a subset of summaries)
session_id | worker_id | order_id | model | prompt | elapsed_time | document | original_summary | original_consistency | original_relevance | original_coherency | edited_summary | edited_consistency | edited_relevance | edited_coherency | distance | original_consistency_third_party | original_relevance_third_party | edited_consistency_third_party | edited_relevance_third_party | edited_coherency_third_party |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6c4... | A2F... | 0 | Jumbo | 11 | 0.01 | The annual event... | A wakeboarding festival in Gwynedd has been cancelled. | TRUE | TRUE | TRUE | A wakeboarding festival in Gwynedd has been cancelled. | TRUE | TRUE | TRUE | 0 | |||||
6c4... | A2F... | 1 | Jumbo | 11 | 0.78 | About 120 cannabis... | Two men have been arrested after police discovered cannabis plants in Coleraine. | TRUE | TRUE | FALSE | Police discovered 120 cannabis plants in Coleraine worth an estimated £60,000 after two men have been arrested. | TRUE | TRUE | TRUE | 17 |
Fields in survey_responses.csv
improvement
: user's survey response to the question "The AI assistant improved as I edited more summaries. (1: strongly disagree / 5: strongly agree)" on 5-point Likert scaleedit
: user's survey response to the question "How much did you have to edit the generated summaries? (1: nothing at all / 5: almost every word)" on 5-point Likert scalehelpfulness
: user's survey response to the question "How helpful was having access to the AI assistant as an automated summarization tool? (1: not helpful at all / 5: extremely helpful)" on 5-point Likert scale
session_id | worker_id | model | prompt | prompt_dataset | fluency | helpfulness | ease | joy |
---|---|---|---|---|---|---|---|---|
151_5... | 57c... | Jumbo | 151 | ELECT | 3 | 2 | 3 | 1 |
151_0... | 01f... | InstructBabbage | 151 | ELECT | 3 | 1 | 5 | 2 |
Each metaphorical sentence in a writing session corresponds to a row.
session_id | worker_id | order_id | norm_order_id | model | prompt | elapsed_time | num_queries | num_events | user_query | model_sent | final_sent | edit_model_final | apt | specific | imageable |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3f3... | Nelson Liu | 0 | 0.0 | Davinci | first | 0.93 | 1 | 54 | I'm improving step by step. | 27 | 0.5 | 0.0 | 0.5 | ||
3f3... | Nelson Liu | 1 | 0.06 | Davinci | first | 1.05 | 1 | 52 | That was a step back. | That was a step back. | 0 | 0.0 | 0.0 | 0.0 |