在医疗领域,人工智能开始发挥其作用,助力医疗技术水平提高。为了进一步促进AI+医疗的研究进展,我们发布了中文医学语言理解测评(CBLUE),包括数据集、基准模型、排行榜。
在8个中文医疗语言理解任务上,我们实验了11个中文预训练模型,发布了任务的基线。
Model | CMedEE | CMedIE | CDN | CTC | STS | QIC | QTR | QQR | Avg. |
---|---|---|---|---|---|---|---|---|---|
BERT-base | 62.1 | 54.0 | 55.4 | 69.2 | 83.0 | 84.3 | 60.0 | 84.7 | 69.0 |
BERT-wwm-ext-base | 61.7 | 54.0 | 55.4 | 70.1 | 83.9 | 84.5 | 60.9 | 84.4 | 69.4 |
ALBERT-tiny | 50.5 | 35.9 | 50.2 | 61.0 | 79.7 | 75.8 | 55.5 | 79.8 | 61.1 |
ALBERT-xxlarge | 61.8 | 47.6 | 37.5 | 66.9 | 84.8 | 84.8 | 62.2 | 83.1 | 66.1 |
RoBERTa-large | 62.1 | 54.4 | 56.5 | 70.9 | 84.7 | 84.2 | 60.9 | 82.9 | 69.6 |
RoBERTa-wwm-ext-base | 62.4 | 53.7 | 56.4 | 69.4 | 83.7 | 85.5 | 60.3 | 82.7 | 69.3 |
RoBERTa-wwm-ext-large | 61.8 | 55.9 | 55.7 | 69.0 | 85.2 | 85.3 | 62.8 | 84.4 | 70.0 |
PCL-MedBERT | 60.6 | 49.1 | 55.8 | 67.8 | 83.8 | 84.3 | 59.3 | 82.5 | 67.9 |
ZEN | 61.0 | 50.1 | 57.8 | 68.6 | 83.5 | 83.2 | 60.3 | 83.0 | 68.4 |
MacBERT-base | 60.7 | 53.2 | 57.7 | 67.7 | 84.4 | 84.9 | 59.7 | 84.0 | 69.0 |
MacBERT-large | 62.4 | 51.6 | 59.3 | 68.6 | 85.6 | 82.7 | 62.9 | 83.5 | 69.6 |
Human | 67.0 | 66.0 | 65.0 | 78.0 | 93.0 | 88.0 | 71.0 | 89.0 | 77.1 |
我们开放了8个任务上的基准模型以及相关代码。
python3 / pytorch 1.7 / transformers 4.5.1 / jieba / gensim / sklearn
在正常下载的文件下,包含8个任务的数据集,每个数据集的目录如下:
├── {Task} # 每个任务的名称
| └── {Task}_train.json
| └── {Task}_test.json
| └── {Task}_dev.json
| └── example_gold.json
| └── example_pred.json
| └── README.md
Note: 个别任务会有附加文件,比如CHIP-CTC任务,有另外的'category.xlsx‘文件。
可以根据需要下载中文预训练模型(上面提供了下载链接)。
目录参考:
├── CBLUE
| └── baselines
| └── run_classifier.py
| └── ...
| └── examples
| └── run_qqr.sh
| └── ...
| └── cblue
| └── CBLUEDatasets
| └── KUAKE-QQR
| └── ...
| └── data
| └── output
| └── model_data
| └── bert-base
| └── ...
| └── result_output
| └── KUAKE-QQR_test.json
| └── ...
我们在 examples/
目录下提供了每个任务训练和预测(生成提交结果)的sh脚本文件,可以直接运行。
也可以使用我们提供的 baselines/
目录下的运行文件 ,再根据需要编写sh脚本:
baselines/run_classifer.py
:支持{sts, qqr, qtr, qic, ctc, ee}
任务;baselines/run_cdn.py
:仅支持{cdn}
任务;baselines/run_ie.py
:仅支持{ie}
任务。
训练模型 (Training models)
直接运行sh脚本文件 bash examples/run_{task}.sh
,也可以根据训练测的需要,调整sh脚本文件的内容,其具体内容如下:
DATA_DIR="CBLUEDatasets" # 数据集总目录
TASK_NAME="qqr" # 具体任务
MODEL_TYPE="bert" # 预训练模型类型
MODEL_DIR="data/model_data" # 预训练模型保存路径
MODEL_NAME="chinese-bert-wwm" # 预训练模型名称
OUTPUT_DIR="data/output" # 模型保存目录
RESULT_OUTPUT_DIR="data/result_output" # 提交结果保存目录
MAX_LENGTH=128
python baselines/run_classifier.py \
--data_dir=${DATA_DIR} \
--model_type=${MODEL_TYPE} \
--model_dir=${MODEL_DIR} \
--model_name=${MODEL_NAME} \
--task_name=${TASK_NAME} \
--output_dir=${OUTPUT_DIR} \
--result_output_dir=${RESULT_OUTPUT_DIR} \
--do_train \
--max_length=${MAX_LENGTH} \
--train_batch_size=16 \
--eval_batch_size=16 \
--learning_rate=3e-5 \
--epochs=3 \
--warmup_proportion=0.1 \
--earlystop_patience=3 \
--logging_steps=250 \
--save_steps=250 \
--seed=2021
Note:最优模型默认保存在'OUTPUT_DIR/MODEL_NAME/'
。
-
MODEL_TYPE
:支持{bert, roberta, albert, zen}
模型类型; -
MODEL_NAME
:支持{bert-base, bert-wwm-ext, albert-tiny, albert-xxlarge, zen, pcl-medbert, roberta-large, roberta-wwm-ext-base, roberta-wwm-ext-large, macbert-base, macbert-large}
中文预训练模型
MODEL_TYPE
-MODEL_NAME
对应关系如下:
MODEL_TYPE | MODEL_NAME |
---|---|
bert |
bert-base , bert-wwm-ext , pcl-medbert , macbert-base , macbert-large |
roberta |
roberta-large , roberta-wwm-ext-base , roberta-wwm-ext-large |
albert |
albert-tiny , albert-xxlarge |
zen |
zen |
预测和生成结果(Inference & generation of results)
直接运行sh脚本文件 bash examples/run_{task}.sh predict
,也可以根据预测的需要,调整sh脚本文件的内容,其具体内容如下:
DATA_DIR="CBLUEDatasets" # 数据集总目录
TASK_NAME="qqr" # 具体任务
MODEL_TYPE="bert" # 预训练模型类型
MODEL_DIR="data/model_data" # 预训练模型保存路径
MODEL_NAME="chinese-bert-wwm" # 预训练模型名称
OUTPUT_DIR="data/output" # 模型保存目录
RESULT_OUTPUT_DIR="data/result_output" # 提交结果保存目录
MAX_LENGTH=128
python baselines/run_classifier.py \
--data_dir=${DATA_DIR} \
--model_type=${MODEL_TYPE} \
--model_name=${MODEL_NAME} \
--model_dir=${MODEL_DIR} \
--task_name=${TASK_NAME} \
--output_dir=${OUTPUT_DIR} \
--result_output_dir=${RESULT_OUTPUT_DIR} \
--do_predict \
--max_length=${MAX_LENGTH} \
--eval_batch_size=16 \
--seed=2021
预测结果 {task}_test.json
将生成在 RESULT_OUTPUT_DIR
目录下。
在提交结果文件之前,可以使用format_checker
来检查文件格式,避免因为格式不正确导致的无效评估。
- Step1: 将原始的测试文件(没有答案)
{taskname}_test.[json|jsonl|tsv]
复制到format_checker
目录底下,并重命名为{taskname}_test_raw.[json|jsonl|tsv]
.
# take the CMeEE task for example:
cp ${path_to_CMeEE}/CMeEE_test.json ${current_dir}/CMeEE_test_raw.json
- Step2: 输入原始测试文件和你预测产生的文件,运行下面的格式检查脚本:
python3 format_checker_${taskname}.py {taskname}_test_raw.[json|jsonl|tsv] {taskname}_test.[json|jsonl|tsv]
# take the CMeEE task for example:
python3 format_checker_CMeEE.py CMeEE_test_raw.json CMeEE_test.json
另外,下面的任务有别于上面的任务: IMCS-NER & IMCS-V2-NER tasks:
- Step1: 将原始的测试文件
IMCS-NER_test.json(IMCS-V2-NER_test.json)
和IMCS_test.json(IMCS-V2_test.json)
复制到format_checker
目录底下,并重命名为IMCS-NER_test_raw.json(IMCS-V2-NER_test_raw.json)
:
# for IMCS-NER task:
cp ${path_to_IMCS-NER}/IMCS-NER_test.json ${current_dir}/IMCS-NER_test_raw.json
cp ${path_to_IMCS-NER}/IMCS_test.json ${current_dir}
# for IMCS-V2-NER task:
cp ${path_to_IMCS-V2-NER}/IMCS-V2-NER_test.json ${current_dir}/IMCS-V2-NER_test_raw.json
cp ${path_to_IMCS-V2-NER}/IMCS-V2_test.json ${current_dir}
- Step2: 按下面的命令执行格式检查的脚本:
# for IMCS-NER task:
python3 format_checker_IMCS_V1_NER.py IMCS-NER_test_raw.json IMCS-NER_test.json IMCS_test.json
# for IMCS-V2-NER task:
python3 format_checker_IMCS_V2_NER.py IMCS-V2-NER_test_raw.json IMCS-V2-NER_test.json IMCS-V2_test.json
将 RESULT_OUTPUT_DIR
目录下的结果文件打包成 .zip
文件便可提交, 提交结果
为了推动语言模型技术在医疗领域的发展和落地,我们收集整理了真实的医疗数据,并发布了8个医疗语言理解任务,包括医学文本信息抽取(实体识别、关系抽取)、医学术语归一化、医学文本分类、医学句子关系判定和医学QA共5大类任务8个子任务。
Dataset | Task | Train | Dev | Test | Evaluation Metrics |
---|---|---|---|---|---|
CMeEE | NER | 15,000 | 5,000 | 3,000 | Micro F1 |
CMeIE | Relation Extraction | 14,339 | 3,585 | 4,482 | Micro F1 |
CHIP-CDN | Diagnosis Normalization | 6,000 | 2,000 | 10,192 | Micro F1 |
CHIP-STS | Sentence Similarity | 16,000 | 4,000 | 10,000 | Macro F1 |
CHIP-CTC | Sentence Classification | 22,962 | 7,682 | 10,000 | Macro F1 |
KUAKE-QIC | Sentence Classification | 6,931 | 1,955 | 1,944 | Accuracy |
KUAKE-QTR | NLI | 24,174 | 2,913 | 5,465 | Accuracy |
KUAKE-QQR | NLI | 15,000 | 1,600 | 1,596 | Accuracy |
本评测任务为面向中文医学文本的命名实体识别,即给定schema及句子sentence,对于给定的一组纯医学文本文档,任务的目标是识别并抽取出与医学临床相关的实体,并将他们归类到预先定义好的类别。将医学文本命名实体划分为九大类,包括:疾病(dis),临床表现(sym),药物(dru),医疗设备(equ),医疗程序(pro),身体(bod),医学检验项目(ite),微生物类(mic),科室(dep)。标注之前对文章进行自动分词处理,所有的医学实体均已正确切分。
数据样例
{ "text": "呼吸肌麻痹和呼吸中枢受累患者因呼吸不畅可并发肺炎、肺不张等。", "entities": [ { "start_idx": 0, "end_idx": 2, "type": "bod", "entity: "呼吸肌" }, { "start_idx": 0, "end_idx": 4, "type": "sym", "entity: "呼吸肌麻痹" }, { "start_idx": 6, "end_idx": 9, "type": "bod", "entity: "呼吸中枢" }, { "start_idx": 6, "end_idx": 11, "type": "sym", "entity: "呼吸中枢受累" }, { "start_idx": 15, "end_idx": 18, "type": "sym", "entity: "呼吸不畅" }, { "start_idx": 22, "end_idx": 23, "type": "dis", "entity: "肺炎" }, { "start_idx": 25, "end_idx": 27, "type": "dis", "entity: "肺不张" } ] }本评测任务为面向中文医学文本的实体关系抽取,即给定schema约束集合及句子sentence,其中schema定义了关系Predicate以及其对应的主体Subject和客体Object的类别,例如: (“subject_type”:“疾病”,“predicate”: “药物治疗”,“object_type”:“药物”) (“subject_type”:“疾病”,“predicate”: “实验室检查”,“object_type”:“检查”)。 任务要求参评系统自动地对句子进行分析,输出句子中所有满足schema约束的SPO三元组知识Triples=[(S1, P1, O1), (S2, P2, O2)…]。
数据样例
{ "text": "慢性胰腺炎@ ###低剂量放射 自1964年起,有几项病例系列报道称外照射 (5-50Gy) 可以有效改善慢性胰腺炎患者的疼痛症状。慢性胰腺炎@从概念上讲,外照射可以起到抗炎和止痛作用,并且已经开始被用于非肿瘤性疼痛的治疗。", "spo_list": [ { "Combined": true, "predicate": "放射治疗", "subject": "慢性胰腺炎", "subject_type": "疾病", "object": { "@value": "外照射" }, "object_type": { "@value": "其他治疗" } }, { "Combined": true, "predicate": "放射治疗", "subject": "非肿瘤性疼痛", "subject_type": "疾病", "object": { "@value": "外照射" }, "object_type": { "@value": "其他治疗" } } } ] }本评测任务主要目标是针对中文电子病历中挖掘出的真实诊断实体进行语义标准化。 给定一诊断原词,要求给出其对应的诊断标准词。所有诊断原词均来自于真实医疗数据,并以《国际疾病分类 ICD-10 北京临床版v601》词表为标准进行了标注(可能存在多个标准词,用##分隔)。
数据样例
[ { "text": "左膝退变伴游离体", "normalized_result": "膝骨关节病##膝关节游离体" }, { "text": "糖尿病反复低血糖;骨质疏松;高血压冠心病不稳定心绞痛", "normalized_result": "糖尿病性低血糖症##骨质疏松##高血压##冠状动脉粥样硬化性心脏病##不稳定性心绞痛" }, { "text": "右乳腺癌IV期", "normalized_result": "乳腺恶性肿瘤##癌" } ]在本测评任务中,我们给定事先定义好的44种筛选标准语义类别 (详见附件的category.xlsx )和一系列中文临床试验筛选标准的描述句子,任务目标是返回每一条筛选标准的具体类别。
数据样例
[ { "id": "s1", "label": "Multiple", "text": " 7.凝血功能异常(INR>1.5 或凝血酶原时间(PT)>ULN+4 秒或 APTT >1.5 ULN),具有出血倾向或正在接受溶栓或抗凝治疗;" }, { "id": "s2", "label": "Addictive Behavior", "text": " (2)重度吸烟(大于10支/天)及酗酒患者" }, { "id": "s3", "label": "Therapy or Surgery", "text": " 13. 有器官移植病史或正等待器官移植的患者;" } ]在本测评任务中,给定来自5个不同病种的问句对,要求判定两个句子语义是否相同或者相近。category表示问句对的病种名称,分别对应:diabetes-糖尿病,hypertension-高血压,hepatitis-乙肝,aids-艾滋病,breast_cancer-乳腺癌。label表示问句之间的语义是否相同。若相同,标为1,若不相同,标为0。
数据样例
[ { "id": "1", "text1": "糖尿病能吃减肥药吗?能治愈吗?", "text2": "糖尿病为什么不能吃减肥药", "label": "1", "category": "diabetes" }, { "id": "2", "text1": "有糖尿病和前列腺怎么保健怎样治疗", "text2": "患有糖尿病和前列腺怎么办?", "label": "1", "category": "diabetes" }, { "id": "3", "text1": "我也是乙肝携带患者,可以办健康证吗在", "text2": "乙肝五项化验单怎么看呢", "label": "0", "category": "hepatitis" } ]在本评测任务中,给定医学搜索问题,要求对医学问题进行意图分类。医学问题分为病情诊断(diagnosis)、病因分析(cause)、治疗方案(method)、就医建议(advice)、指标解读(metric_explain)、疾病描述(disease_express)、后果表述(result)、注意事项(attention)、功效作用(effect)、医疗费用(price)、其他(other) 共11种类型。
数据样例
[ { "id": "s1", "query": "心肌缺血如何治疗与调养呢?", "label": "治疗方案" }, { "id": "s2", "query": "19号来的月经,25号服用了紧急避孕药本月5号,怎么办?", "label": "治疗方案" }, { "id": "s3", "query": "什么叫痔核脱出?什么叫外痔?", "label": "疾病表述" } ]在本测评任务中,给定评估搜索词(Query)表述主题和落地页标题(Title)表述主题,要求判断Query主题和Title主题是否一致及达到多大程度上的一致。
数据样例
[ { "id": "s1", "query": "咳嗽到腹肌疼", "title": "感冒咳嗽引起的腹肌疼痛,是怎么回事?", "label": "2" }, { "id": "s2", "query": "烂牙神经的药对怀孕胚胎", "title": "怀孕两个月治疗牙齿烂牙神经用了含砷失活剂 怀孕两个月治疗...", "label": "1" }, { "id": "s3", "query": "怀孕可以空腹吃葡萄吗", "title": "怀孕四个月,今早空腹吃了葡萄,然后肚子就一直胀胀的...", "label": "1" } ]在本测评任务中,给定两个query,要求评估两个Query所表述主题的匹配程度。
数据样例
[ { "id": "s1", "query": "小孩子打呼噜什么原因", "title": "孩子打呼噜是什么原因", "label": "2" }, { "id": "s2", "query": "小孩子打呼噜什么原因", "title": "宝宝打呼噜是什么原因", "label": "0" }, { "id": "s3", "query": "小孩子打呼噜什么原因", "title": "小儿打呼噜是什么原因引起的", "label": "2" } ]我们同时提供了8个任务的数据处理、模型训练、验证的模块代码,可供快速构建自己的代码。每个任务相应的 Dataset
, Data Processor
, Trainer
模块如下:
Task | Data Processor (cblue.data) | Dataset (cblue.data) | Trainer (cblue.trainer) |
---|---|---|---|
CMeEE | EEDataProcessor |
EEDataset |
EETrainer |
CMeIE | ERDataProcessor /REDataProcessor |
ERDataset /REDataset |
ERTrainer /RETrainer |
CHIP-CDN | CDNDataProcessor |
CDNDataset |
CDNForCLSTrainer /CDNForNUMTrainer |
CHIP-CTC | CTCDataProcessor |
CTCDataset |
CTCTrainer |
CHIP-STS | STSDataProcessor |
STSDataset |
STSTrainer |
KUAKE-QIC | QICDataProcessor |
QICDataset |
QICTrainer |
KUAKE-QQR | QQRDataProcessor |
QQRDataset |
QQRTrainer |
KUAKE-QTR | QTRDataProcessor |
QTRDataset |
QTRTrainer |
from cblue.data import EEDataProcessor, EEDataset
from cblue.trainer import EETrainer
from cblue.metrics import ee_metric, ee_commit_prediction
data_processor = EEDataProcessor(root=...)
# 获取训练样本(已经过数据处理)
train_samples = data_processor.get_train_sample()
# 获取验证样本
eval_samples = data_processor.get_dev_sample()
# 获取测试样本
test_samples = data_processor,get_test_sample()
# 'torch Dataset'
train_dataset = EEDataset(train_sample, tokenizer=..., mode='train', max_length=...)
# trainer
trainer = EETrainer(...)
trainer.train(...)
# 预测
test_dataset = EEDataset(test_sample, tokenizer=..., mode='test', max_length=...)
trainer.predict(test_dataset)
Note: 我们的 CMeIE 的 baseline 采用两阶段建模,故两阶段的模块代码拆分为了 'ER...' 和 'RE...'
from cblue.data import ERDataProcessor, REDataProcessor, REDataset, ERDataset
from cblue.trainer import ERTrainer, RETrainer
from cblue.metrics import er_metric, er_commit_prediction, re_metric, re_commit_prediction
# 调用方式同上
Param | Value |
---|---|
warmup_proportion | 0.1 |
weight_decay | 0.01 |
adam_epsilon | 1e-8 |
max_grad_norm | 1.0 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 5 | 32 | 128 | 4e-5 |
bert-wwm-ext | 5 | 32 | 128 | 4e-5 |
roberta-wwm-ext | 5 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 5 | 12 | 65 | 2e-5 |
roberta-large | 5 | 12 | 65 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 5e-5 |
albert-xxlarge | 5 | 12 | 65 | 1e-5 |
zen | 5 | 20 | 128 | 4e-5 |
macbert-base | 5 | 32 | 128 | 4e-5 |
macbert-large | 5 | 12 | 80 | 2e-5 |
PCL-MedBERT | 5 | 32 | 128 | 4e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 7 | 32 | 128 | 5e-5 |
bert-wwm-ext | 7 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 7 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 7 | 16 | 80 | 4e-5 |
roberta-large | 7 | 16 | 80 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 4e-5 |
albert-xxlarge | 7 | 16 | 80 | 1e-5 |
zen | 7 | 20 | 128 | 4e-5 |
macbert-base | 7 | 32 | 128 | 4e-5 |
macbert-large | 7 | 20 | 80 | 2e-5 |
PCL-MedBERT | 7 | 32 | 128 | 4e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 8 | 32 | 128 | 5e-5 |
bert-wwm-ext | 8 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 8 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 8 | 16 | 80 | 4e-5 |
roberta-large | 8 | 16 | 80 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 4e-5 |
albert-xxlarge | 8 | 16 | 80 | 1e-5 |
zen | 8 | 20 | 128 | 4e-5 |
macbert-base | 8 | 32 | 128 | 4e-5 |
macbert-large | 8 | 20 | 80 | 2e-5 |
PCL-MedBERT | 8 | 32 | 128 | 4e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 5 | 32 | 128 | 5e-5 |
bert-wwm-ext | 5 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 5 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 5 | 32 | 50 | 3e-5 |
roberta-large | 5 | 24 | 50 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 4e-5 |
albert-xxlarge | 5 | 20 | 50 | 1e-5 |
zen | 5 | 20 | 128 | 4e-5 |
macbert-base | 5 | 32 | 128 | 4e-5 |
macbert-large | 5 | 20 | 50 | 2e-5 |
PCL-MedBERT | 5 | 32 | 128 | 4e-5 |
Param | Value |
---|---|
recall_k | 200 |
num_negative_sample | 5+5(random) |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 32 | 128 | 4e-5 |
bert-wwm-ext | 3 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 3 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 3 | 32 | 40 | 4e-5 |
roberta-large | 3 | 32 | 40 | 4e-5 |
albert-tiny | 3 | 32 | 128 | 4e-5 |
albert-xxlarge | 3 | 32 | 40 | 1e-5 |
zen | 3 | 20 | 128 | 4e-5 |
macbert-base | 3 | 32 | 128 | 4e-5 |
macbert-large | 3 | 32 | 40 | 2e-5 |
PCL-MedBERT | 3 | 32 | 128 | 4e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 20 | 32 | 128 | 4e-5 |
bert-wwm-ext | 20 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 20 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 20 | 12 | 40 | 4e-5 |
roberta-large | 20 | 12 | 40 | 4e-5 |
albert-tiny | 20 | 32 | 128 | 4e-5 |
albert-xxlarge | 20 | 12 | 40 | 1e-5 |
zen | 20 | 20 | 128 | 4e-5 |
macbert-base | 20 | 32 | 128 | 4e-5 |
macbert-large | 20 | 12 | 40 | 2e-5 |
PCL-MedBERT | 20 | 32 | 128 | 4e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 40 | 3e-5 |
bert-wwm-ext | 3 | 16 | 40 | 3e-5 |
roberta-wwm-ext | 3 | 16 | 40 | 4e-5 |
roberta-wwm-ext-large | 3 | 16 | 40 | 4e-5 |
roberta-large | 3 | 16 | 40 | 2e-5 |
albert-tiny | 3 | 16 | 40 | 5e-5 |
albert-xxlarge | 3 | 16 | 40 | 1e-5 |
zen | 3 | 16 | 40 | 2e-5 |
macbert-base | 3 | 16 | 40 | 3e-5 |
macbert-large | 3 | 16 | 40 | 3e-5 |
PCL-MedBERT | 3 | 16 | 40 | 2e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 50 | 2e-5 |
bert-wwm-ext | 3 | 16 | 50 | 2e-5 |
roberta-wwm-ext | 3 | 16 | 50 | 2e-5 |
roberta-wwm-ext-large | 3 | 16 | 50 | 2e-5 |
roberta-large | 3 | 16 | 50 | 3e-5 |
albert-tiny | 3 | 16 | 50 | 5e-5 |
albert-xxlarge | 3 | 16 | 50 | 1e-5 |
zen | 3 | 16 | 50 | 2e-5 |
macbert-base | 3 | 16 | 50 | 3e-5 |
macbert-large | 3 | 16 | 50 | 2e-5 |
PCL-MedBERT | 3 | 16 | 50 | 2e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 40 | 4e-5 |
bert-wwm-ext | 3 | 16 | 40 | 2e-5 |
roberta-wwm-ext | 3 | 16 | 40 | 3e-5 |
roberta-wwm-ext-large | 3 | 16 | 40 | 2e-5 |
roberta-large | 3 | 16 | 40 | 2e-5 |
albert-tiny | 3 | 16 | 40 | 5e-5 |
albert-xxlarge | 3 | 16 | 40 | 1e-5 |
zen | 3 | 16 | 40 | 3e-5 |
macbert-base | 3 | 16 | 40 | 2e-5 |
macbert-large | 3 | 16 | 40 | 2e-5 |
PCL-MedBERT | 3 | 16 | 40 | 3e-5 |
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 30 | 3e-5 |
bert-wwm-ext | 3 | 16 | 30 | 3e-5 |
roberta-wwm-ext | 3 | 16 | 30 | 3e-5 |
roberta-wwm-ext-large | 3 | 16 | 30 | 3e-5 |
roberta-large | 3 | 16 | 30 | 2e-5 |
albert-tiny | 3 | 16 | 30 | 5e-5 |
albert-xxlarge | 3 | 16 | 30 | 3e-5 |
zen | 3 | 16 | 30 | 2e-5 |
macbert-base | 3 | 16 | 30 | 2e-5 |
macbert-large | 3 | 16 | 30 | 2e-5 |
PCL-MedBERT | 3 | 16 | 30 | 2e-5 |
@inproceedings{zhang-etal-2022-cblue,
title = "{CBLUE}: A {C}hinese Biomedical Language Understanding Evaluation Benchmark",
author = "Zhang, Ningyu and
Chen, Mosha and
Bi, Zhen and
Liang, Xiaozhuan and
Li, Lei and
Shang, Xin and
Yin, Kangping and
Tan, Chuanqi and
Xu, Jian and
Huang, Fei and
Si, Luo and
Ni, Yuan and
Xie, Guotong and
Sui, Zhifang and
Chang, Baobao and
Zong, Hui and
Yuan, Zheng and
Li, Linfeng and
Yan, Jun and
Zan, Hongying and
Zhang, Kunli and
Tang, Buzhou and
Chen, Qingcai",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.544",
pages = "7888--7915",
abstract = "Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually offering great promise for medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.",
}
[1] CLUE: A Chinese Language Understanding Evaluation Benchmark [pdf] [git] [web]
[2] GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding [pdf] [web]
[3] SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems [pdf] [web]