Skip to content

This is FarSick Corpus; A Persian Semantic Textual Similarity And Natural Language Inference Dataset.

Notifications You must be signed in to change notification settings

ZahraGhasemi-AI/FarSick

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset

This is FarSick Corpus; A Persian Semantic Textual Similarity And Natural Language Inference Dataset.

If you use this corpus, please cite this paper:

Z. Ghasemi and M. A. Keyvanrad, "FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset," 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE), 2021, pp. 194-199, doi: 10.1109/ICCKE54056.2021.9721521.

@INPROCEEDINGS{9721521,
	author={Ghasemi, Zahra and Keyvanrad, Mohammad Ali},
	booktitle={2021 11th International Conference on Computer Engineering and Knowledge (ICCKE)},
	title={FarSick: A Persian Semantic Textual Similarity And Natural Language Inference Dataset},
	year={2021},
	volume={},
	number={},
	pages={194-199},
	doi={10.1109/ICCKE54056.2021.9721521}}

Contact: [email protected]
Project page: https://github.com/ZahraGhasemi-AI/FarSick

FarSick is the first relatively large-scale STS dataset for Persian language. It consists of about 10,000 English sentence pairs. Each sentence pair is annotated for relatedness and semantics in meaning and for the entailment relation between the two elements. This dataset is collected by translating and editing the sentences of SICK dataset.

  • File Structure: tab-separated text file

Fields:

  • pair_ID: sentence pair ID

  • sentence_A: sentence A

  • sentence_B: sentence B

  • entailment_label: textual entailment gold label (NEUTRAL, ENTAILMENT, or CONTRADICTION)

  • relatedness_score: semantic relatedness gold score (on a 1-5 continuous scale)

  • entailment_AB: entailment for the A-B order (A_neutral_B, A_entails_B, or A_contradicts_B)

  • entailment_BA: entailment for the B-A order (B_neutral_A, B_entails_A, or B_contradicts_A)

  • sentence_A_dataset: dataset from which the original sentence A was extracted (FLICKR vs. SEMEVAL)

  • sentence_B_dataset: dataset from which the original sentence B was extracted (FLICKR vs. SEMEVAL)

  • SemEval_set: set including the sentence pair in SemEval 2014 Task 1 (TRIAL, TRAIN, or TEST)

  • binary_label: semantic gold score (0 vs. 1)

Staistics:

  • Training pairs: 4439

  • Trial pairs: 495

  • Test pairs: 4906

  • Total pairs: 9840

  • Train labels: {'entailment': 1274, 'neutral': 2524, 'contradiction': 641}, {'0': 1065, '1': 3374}

  • Trial labels: {'entailment': 143, 'neutral': 281, 'contradiction': 71}, {'0': 105, '1': 390}

  • Test labels: {'entailment': 1404, 'neutral': 2790, 'contradiction': 712} {'0':1126, '1':3780}

  • Training number of tokens: 3384

  • Trial number of tokens: 1264

  • Test number of tokens: 3360

About

This is FarSick Corpus; A Persian Semantic Textual Similarity And Natural Language Inference Dataset.

Resources

Stars

Watchers

Forks