Feature: Custom NER Inference Pipeline #34

lalital · 2021-03-26T09:08:05Z

Issue: #32

Proposed solution:

Pretokenize with PyThaiNLP's newmm tokenizer
Retokenize with the subword tokenizer (SentencePiece)
Map the prediction results of the subword tokens to the tokens tokenized with newmm
Return the prediction results in word-level and chunk-level

NER Pipeline Demo (via Colab): https://colab.research.google.com/drive/1-54NeM_wsjitaiSXfMBpcnqzbPMR0a9R#scrollTo=VzSGZbwWaiOI

Added files:

thai2transformers/pipelines/inference/token_classification.py: The NER inference pipeline (class::TokenClassificationPipeline)
thai2transformers/tests/test_pipelines_inference_token_classification.py: Test cases for NER inference pipeline

in setUp() method

and add test cases Currently, NER model inference is performed sequentially.

B-, B_, I-, and I_

I and B followed and underscore (e.g. I_MEA, B_MEA)

handle custom tag_delimeter and BIOE tag scheme Use seqeval to group entities

object

codecov · 2021-03-26T10:53:40Z

Codecov Report

❗ No coverage uploaded for pull request base (dev@7611079). Click here to learn what that means.
The diff coverage is n/a.

@@          Coverage Diff           @@
##             dev      #34   +/-   ##
======================================
  Coverage       ?   89.65%           
======================================
  Files          ?        4           
  Lines          ?      580           
  Branches       ?        0           
======================================
  Hits           ?      520           
  Misses         ?       60           
  Partials       ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7611079...38bcfc0. Read the comment docs.

cstorm125 · 2021-04-04T04:58:42Z

LGTM

* Patch/issue 28 tokenizers package conflict (#30) * Change the version of required packages in order to avoid package conflicting isseue (as reported in issue #28) Reference (transformers required packages): https://github.com/huggingface/transformers/blob/v3.5.0/setup.py#L130 * Change the dev release version from 0.1.0dev2 to 0.1.1dev0 * Unspecify the veresion of pandas * Bumpup the version from 0.1.1dev0 to "0.1.1dev1 * GitHub workflow (#35) * Add github workflowm unittest * Format YAML * Create blank.yml (#36) * Gh workflow (#37) * Add github workflowm unittest * Format YAML * Rename file * GitHub workflow (#38) * Add github workflowm unittest * Format YAML * Rename file * Rename file from testing.yml to unittest * Remove duplicated library name `datasets` * Change the version of sentencepiece from 0.1.94 to 0.1.91 * Change the version of tokenizers from 0.9.4 to 0.9.3 transformers 3.5.0 depends on tokenizers==0.9.3 * Delete testing.yml * Delete blank.yml * Update unittest.yml * add workable qa notebook * add squad_newmm metric * add evaluation functions; minor fixes to normalize_answers * refactor prepare_qa_xxxfeatures * add notebook * minor fix to notebook * add qa training script * run notebook for good output * change model_max_length to optional * model_max_length 512 to 416 * Rename filename and module name from `unittest` to `test` * Update test.yml * Add argument local_rank * Update condition * Add return statement * Specify seed to torch and numpy * Set torch.backends.cudnn to be deterministic * Import module * add combine_iapp_thaiqa.py * Feature: Custom NER Inference Pipeline (#34) * Ignore tmp directory * Implement TokenClassificationPipeline (NER) and add test cases * Edit incorrect assertion * Edit incorrect assertion * Remove overspecified condition * Remove overlapped condition (strict=False) * Initialize TokenClassificationPipeline instance in setUp() method * Refer to self.base_pipeline intialized in setUp() method * Add test case for `_merged_pred` private method * Add feature to multiple sentences inference, and add test cases Currently, NER model inference is performed sequentially. * Fix rule for strict group_entities, replace all -I to O * Fix rule for strict group_entities, replace all -I to O * Handle when 'O' is the begining of sentence * Remove unused code * Edit assertion * Remove debugging messages * Replace special symbol `space_token` to space " " * Add test case * Fix error, specifying wrong operator (should be = , not +=) * Edit assertion * Support two types of IOB prefix B-, B_, I-, and I_ * Refer to variable instead of hardcoded IOB prefix * Add test case for another type of IOB prefix I and B followed and underscore (e.g. I_MEA, B_MEA) * Add required library * Specify version of seqeval * Add test case for model trained on LST20 * Add additional class argumetn to handle custom tag_delimeter and BIOE tag scheme Use seqeval to group entities * Fix incorrect reference to seqeval.scheme.Entity object * Convert attribute in Entity object to a tuple * Edit test sentence * Add @unittest.skip * Remove @unittest.skip * Edit test case * Set pipeline.strict to True * Output non-entity tag, 'O' * Remove debugging message * Remove @unittest.skip('not implement') * Add new test cases for BIOE tag (LST20) * Add condition to handle non strict entity grouping * Add script for language model finetuning on XNLI dataset (Thai sentence pairs) (#42) * Ignore tmp directory * Add language model finetuning script on XNLI dataset (only Thai sentence pairs) * add allow_no_answer flag * edit qa training script to include allow_no_answer flag * convert train_question_answering_lm_finetuning.ipynb to colab version * feature: token classification pipeline, POS tagging (#46) * refactor: rename class attributes - base_pipeline -> thainer_ner_pipeline - lst20_base_pipeline -> lst20_ner_pipeline * Perform entity groupping only if `scheme` is specified * Change condition from `and or self.scheme != None:` to `or self.scheme == None:` * Add test case for POS tagging with finetuned `wangchanberta-base-att-spm-uncased` on LST20 corpus (POS) * Add option for text file * Use swifter * Update debug message * Set default value to False * Add tqdm * Change version * Add adam beta args * Fix error * Change argument name to evaluation_strategy * Add deepspeed argument * Add deepspeed config * Move prediction_loss_only arg to TrainingArguments * Add run name * Add train_micro_batch_size_per_gpu * Set total_num_steps to 50k * Set amp * Remove amp * Set default value to None * Remove deepspeed * Add deepspeed * Change gradient_accumulation_steps, total_num_step, warmup_num_steps * Change zero_optimization to stage 2 * Divide paramters by 8 * Revert * Add new config file (ds) that compensate the global step (divide total_num_steps by 4, 24000 / 4 = 6000) * Rename config file, - Adjuste Adam epsilon to 1e-6 * Load pretrained mode via from_pretrained * Load model from checkpoint with from_pretrained * Add resume_from_checkpoint * Add zero-3 config * Change train_micro_batch_size_per_gpu from 128 to 64 * Add zero optimization stage 2 configuration * Update config zero 3 * Change bz to 32 * Change bz to 32 * Change bz to 32 * Rename file * Change optimizer type to Adam * Change warmup_num_steps to 2400 from 1250 * Change max LR * Change Max LR * Change max LR * Add config for 1cycle LR * Set cycle_max_mom to 0.999 * Set decay_step_size to 0 * Change cycle_first_step_size and cycle_second_step_size * Set cycle_max_mom to 0.99 * Set cycle_max_mom to 0.9 * Add new DS config (max step = 50k, warmup = 5k) * Change beta2 to 0.98 * Set max steps to 31250 * Rename file * Change cycle_second_step_size * Pass train_max_length and eval_max_length to MLMDataset instance * Remove redundant argument passing * Add zero-3 config * Rename file * Update config, change batch size to 64 * Change max LR * Change peak LR * Not offload param * Add new config - train_batch_size = 8064 - train_micro_batch_size_per_gpu = 48 - gradient_accumulation_steps = 21 * Add config: bz=40, grad_acc=25, train_batch_size=8000 * Add new config * Change bz to 44 and peak LR to 3e-4 * Change beta2 to 0.99 * Change beta2 to 0.999 * Change config * Add new config * Fix bz * Rename file * Add config for 8 GPUs * Fix incorrect value of gradient_accumulation_steps * Fix decay_step_size and decay_lr_rate * Change LR * Change * Update config * Add config for thwiki+news pretraining * Set decay_lr_rate to 0 * Change bz * Add ds_legal-bert-v3 config - Set cycle_min_lr to 3e-8 instead of 0.0 - Set cycle_min_mom to 0.9 instead of 0.85 * Warmup for the first 5,000 steps then linearly decays to e3-8 for 45,000 steps * Add MLM pretraining script that support any model architecture pretraining.. This script will substitีute `train_mlm_camembert_thai.ddp.py` and `train_mlm_camemberta_thai.py` as it is only applicable for RoBERTa pretraining and the arguments provided via ArgumentParser are required to add manually in order to match with the new version of transformers * Add arguments for DataTrainingArguments as follows. - train_max_length - eval_max_length * Set default value of `do_lower_case` to False * Call main function * Test if trainer is process zero with is_world_process_zero * Initlalize CamembertTokenizer * Initalize model from Config with `from_config` class method * Fix typo, change from `binarized_path_val` to `binarized_path_eval` * Add new config for deberta-base pretraining on thwiki+news * Set beta2 to 0.999 * Change batch size * Rename file * Add config with bz=40 * Rename file * Change bz to 32, effective bz to 4096 * Add configutation with effective bz = 4080, and per-device bz = 34 * Add new config - "train_batch_size": 4032, - "train_micro_batch_size_per_gpu": 28, - "gradient_accumulation_steps": 18, * Add config for effective bz = 4032 "train_batch_size": 4032, "train_micro_batch_size_per_gpu": 24, "gradient_accumulation_steps": 21, * Rename file * set logging level to debug * Add debugging message * Print debugging message only on main process * Add new config with effective bz of 4032 for 4x GPUs * Specify a number of zero_optimization parameters to True - cpu_offload - contiguous_gradients - overlap_comm * Rename file * Change bz * Update bz * Rename * Change bz * Add iapp_thaiqa dataset directory to .gitignore * Implement DataCollatorForSpanLevelMask for span-level masking * Addoption to choose masking strategy either subword-level or span-level * Fix typing checking * Fix incorrect data structure for metadata field * Access value of enum variable * Access to the value of enum variable * Edit * Fix typo * Add argumetn to specify symbol representing space token * Get the vocab_size from what tokenizer actually loaded which included additional_special_tokens * Pass vocab_size to AutoConfig.from_pretrained to overide default vocab_size * Fix data structure accessing error and remove debugging message * Assign additional_special_tokens in CamembertTokenizer.from_pretrained * Wrap to torch.LongTensor * Set the DataCollatorForSpanLevelMask to not perform token masking on "pad_token" * Pass pad_to_multiple_of to the super class (DataCollatorForLanguageModeling) * fix logical error: Fix a logical error where _mask_tokens function will not exclude special tokens from token masking. As there is an incorrect statement at L77, ```indices = [i for i in range(len(tokens)) if tokens[i] not in self.special_token_ids]``` The left operand "tokens[i] " is word token (str) while the right operand is a list of token IDS (List[int]) * Change dataclass init function * Limit the uperbound of num_to_predict by total number of input tokens (excluded special_tokens) * Fix incorrect num_to_predict * Add DeBERTa base config * Move DS config to another directory * Remove unused config file * Rename directory * Remove config file * Edit script to rename output config file as specified in -o argument * Add DeBERTa v1 config files Co-authored-by: cstorm125 <[email protected]> Co-authored-by: Charin <[email protected]> Co-authored-by: Charin <[email protected]>

lalital added 30 commits March 24, 2021 09:30

Ignore tmp directory

2f840d4

Implement TokenClassificationPipeline (NER) and add test cases

dbabbcc

Edit incorrect assertion

b3a18e0

Edit incorrect assertion

fb8efaa

Remove overspecified condition

1c2dc15

Remove overlapped condition (strict=False)

b7a4bd1

Initialize TokenClassificationPipeline instance

ae5ad89

in setUp() method

Refer to self.base_pipeline intialized

8b35de5

in setUp() method

Add test case for _merged_pred private method

0273715

Add feature to multiple sentences inference,

cb9e526

and add test cases Currently, NER model inference is performed sequentially.

Fix rule for strict group_entities, replace all -I to O

e0f009d

Fix rule for strict group_entities, replace all -I to O

448bd45

Handle when 'O' is the begining of sentence

662e9dc

Remove unused code

1271afa

Edit assertion

044223b

Remove debugging messages

6c55e30

Replace special symbol space_token to space " "

a82b1bf

Add test case

2ff871c

Fix error, specifying wrong operator (should be = , not +=)

fbb35f6

Edit assertion

e81c485

Support two types of IOB prefix

b42cdeb

B-, B_, I-, and I_

Refer to variable instead of hardcoded IOB prefix

46e0500

Add test case for another type of IOB prefix

a59f1ac

I and B followed and underscore (e.g. I_MEA, B_MEA)

Add required library

61694d6

Specify version of seqeval

aa270d4

Add test case for model trained on LST20

d934d14

Add additional class argumetn to

997342a

handle custom tag_delimeter and BIOE tag scheme Use seqeval to group entities

Fix incorrect reference to seqeval.scheme.Entity

d159979

object

Convert attribute in Entity object to a tuple

58ddc3a

Edit test sentence

f8bbc1b

lalital added 9 commits March 25, 2021 15:24

Add @unittest.skip

1161654

Remove @unittest.skip

cfef2c1

Edit test case

fe01af3

Set pipeline.strict to True

b843767

Output non-entity tag, 'O'

71b54b5

Remove debugging message

b466415

Remove @unittest.skip('not implement')

ad9f6f2

Add new test cases for BIOE tag (LST20)

27eab64

Add condition to handle non strict entity grouping

9875fb7

lalital changed the base branch from master to dev March 26, 2021 09:08

lalital closed this Mar 26, 2021

lalital reopened this Mar 26, 2021

lalital added 2 commits March 26, 2021 17:15

Merge branch 'dev' into feature/ner_pipeline

0773b30

Merge branch 'dev' into feature/ner_pipeline

38bcfc0

lalital merged commit 22a8337 into dev Apr 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Custom NER Inference Pipeline #34

Feature: Custom NER Inference Pipeline #34

lalital commented Mar 26, 2021 •

edited

Loading

codecov bot commented Mar 26, 2021

cstorm125 commented Apr 4, 2021

Feature: Custom NER Inference Pipeline #34

Feature: Custom NER Inference Pipeline #34

Conversation

lalital commented Mar 26, 2021 • edited Loading

codecov bot commented Mar 26, 2021

Codecov Report

cstorm125 commented Apr 4, 2021

lalital commented Mar 26, 2021 •

edited

Loading