Shekasteh is an evaluation dataset for Persian colloquial text. It comes from different genres, including blog posts, movie subtitles, and forum chats.
Format: TSV file (tab-separated) with columns 1) original text, 2) fixed broken word to standard forms, 3) fixed style to standard style, 4) genre, 5) source.
- Mohammad Sadegh Rasooli (Preprocessing and evaluations)
- Farzaneh Bakhtyari (Annotation -- development data)
- Fatemeh Shafiei (Annotation -- test data)
- Mahsa Ravanbakhsh (Annotation -- test data)
- Chris Callison-Burch (Supervisor)