Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Custom NER Inference Pipeline #34

Merged
merged 41 commits into from
Apr 4, 2021
Merged

Feature: Custom NER Inference Pipeline #34

merged 41 commits into from
Apr 4, 2021

Conversation

lalital
Copy link
Contributor

@lalital lalital commented Mar 26, 2021

Issue: #32

Proposed solution:

  1. Pretokenize with PyThaiNLP's newmm tokenizer
  2. Retokenize with the subword tokenizer (SentencePiece)
  3. Map the prediction results of the subword tokens to the tokens tokenized with newmm
  4. Return the prediction results in word-level and chunk-level

NER Pipeline Demo (via Colab): https://colab.research.google.com/drive/1-54NeM_wsjitaiSXfMBpcnqzbPMR0a9R#scrollTo=VzSGZbwWaiOI

Added files:

and add test cases

Currently, NER model inference is performed sequentially.
I and B followed and underscore (e.g. I_MEA, B_MEA)
handle custom tag_delimeter and BIOE tag scheme

Use seqeval to group entities
@lalital lalital changed the base branch from master to dev March 26, 2021 09:08
@lalital lalital closed this Mar 26, 2021
@lalital lalital reopened this Mar 26, 2021
@codecov
Copy link

codecov bot commented Mar 26, 2021

Codecov Report

❗ No coverage uploaded for pull request base (dev@7611079). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##             dev      #34   +/-   ##
======================================
  Coverage       ?   89.65%           
======================================
  Files          ?        4           
  Lines          ?      580           
  Branches       ?        0           
======================================
  Hits           ?      520           
  Misses         ?       60           
  Partials       ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7611079...38bcfc0. Read the comment docs.

@cstorm125
Copy link
Contributor

LGTM

@lalital lalital merged commit 22a8337 into dev Apr 4, 2021
lalital added a commit that referenced this pull request Jul 13, 2021
* Patch/issue 28 tokenizers package conflict (#30)

* Change the version of required packages
in order to avoid package conflicting isseue
(as reported in issue #28)

Reference (transformers required packages): https://github.com/huggingface/transformers/blob/v3.5.0/setup.py#L130

* Change the dev release version
from 0.1.0dev2 to 0.1.1dev0

* Unspecify the veresion of pandas

* Bumpup the version from 0.1.1dev0 to "0.1.1dev1

* GitHub workflow (#35)

* Add github workflowm unittest

* Format YAML

* Create blank.yml (#36)

* Gh workflow (#37)

* Add github workflowm unittest

* Format YAML

* Rename file

* GitHub workflow (#38)

* Add github workflowm unittest

* Format YAML

* Rename file

* Rename file from testing.yml to unittest

* Remove duplicated library name `datasets`

* Change the version of sentencepiece from 0.1.94 to 0.1.91

* Change the version of tokenizers from 0.9.4 to 0.9.3

transformers 3.5.0 depends on tokenizers==0.9.3

* Delete testing.yml

* Delete blank.yml

* Update unittest.yml

* add workable qa notebook

* add squad_newmm metric

* add evaluation functions; minor fixes to normalize_answers

* refactor prepare_qa_xxxfeatures

* add notebook

* minor fix to notebook

* add qa training script

* run notebook for good output

* change model_max_length to optional

* model_max_length 512 to 416

* Rename filename and module name from `unittest` to `test`

* Update test.yml

* Add argument local_rank

* Update condition

* Add return statement

* Specify seed to torch and numpy

* Set torch.backends.cudnn to be deterministic

* Import module

* add combine_iapp_thaiqa.py

* Feature: Custom NER Inference Pipeline (#34)

* Ignore tmp directory

* Implement TokenClassificationPipeline (NER) and add test cases

* Edit incorrect assertion

* Edit  incorrect assertion

* Remove overspecified condition

* Remove overlapped condition (strict=False)

* Initialize TokenClassificationPipeline instance
in setUp() method

* Refer to self.base_pipeline intialized
in setUp() method

* Add test case for `_merged_pred` private method

* Add feature to multiple sentences inference,
and add test cases

Currently, NER model inference is performed sequentially.

* Fix rule for strict group_entities, replace all -I to O

* Fix rule for strict group_entities, replace all -I to O

* Handle when 'O' is the begining of sentence

* Remove unused code

* Edit assertion

* Remove debugging messages

* Replace special symbol `space_token` to space " "

* Add test case

* Fix error, specifying wrong operator (should be = , not +=)

* Edit assertion

* Support two types of IOB prefix

B-, B_, I-, and I_

* Refer to variable instead of hardcoded IOB prefix

* Add test case for another type of IOB prefix

I and B followed and underscore (e.g. I_MEA, B_MEA)

* Add required library

* Specify version of seqeval

* Add test case for model trained on LST20

* Add additional class argumetn to
handle custom tag_delimeter and BIOE tag scheme

Use seqeval to group entities

* Fix incorrect reference to seqeval.scheme.Entity
object

* Convert attribute in Entity object to a tuple

* Edit test sentence

* Add @unittest.skip

* Remove @unittest.skip

* Edit test case

* Set pipeline.strict to True

* Output non-entity tag, 'O'

* Remove debugging message

* Remove @unittest.skip('not implement')

* Add new test cases for BIOE tag (LST20)

* Add condition to handle non strict entity grouping

* Add script for language model finetuning on XNLI dataset (Thai sentence pairs) (#42)

* Ignore tmp directory

* Add language model finetuning script on  XNLI dataset
(only Thai sentence pairs)

* add allow_no_answer flag

* edit qa training script to include allow_no_answer flag

* convert train_question_answering_lm_finetuning.ipynb to colab version

* feature: token classification pipeline, POS tagging (#46)

* refactor: rename class attributes

- base_pipeline -> thainer_ner_pipeline
- lst20_base_pipeline -> lst20_ner_pipeline

* Perform entity groupping only if `scheme` is specified

* Change condition

from `and or self.scheme != None:`
to `or self.scheme == None:`

* Add test case for POS tagging with
finetuned `wangchanberta-base-att-spm-uncased` on
LST20 corpus (POS)

* Add option for text file

* Use swifter

* Update debug message

* Set default value to False

* Add tqdm

* Change version

* Add adam beta args

* Fix error

* Change argument name to evaluation_strategy

* Add deepspeed argument

* Add deepspeed config

* Move prediction_loss_only arg to TrainingArguments

* Add run name

* Add train_micro_batch_size_per_gpu

* Set total_num_steps to 50k

* Set amp

* Remove amp

* Set default value to None

* Remove deepspeed

* Add deepspeed

* Change gradient_accumulation_steps, total_num_step, warmup_num_steps

* Change zero_optimization to stage 2

* Divide paramters by 8

* Revert

* Add new config file (ds) that
compensate the global step
(divide total_num_steps by  4, 24000 / 4 = 6000)

* Rename config file,

- Adjuste Adam epsilon to 1e-6

* Load pretrained mode via from_pretrained

* Load model from checkpoint with from_pretrained

* Add resume_from_checkpoint

* Add zero-3 config

* Change train_micro_batch_size_per_gpu from 128 to 64

* Add zero optimization stage  2 configuration

* Update config zero 3

* Change bz to 32

* Change bz to 32

* Change bz to 32

* Rename file

* Change optimizer type to Adam

* Change warmup_num_steps to 2400 from 1250

* Change max LR

* Change Max LR

* Change max LR

* Add config for 1cycle LR

* Set cycle_max_mom to 0.999

* Set  decay_step_size to 0

* Change  cycle_first_step_size and cycle_second_step_size

* Set cycle_max_mom to 0.99

* Set cycle_max_mom to 0.9

* Add new DS config (max step = 50k, warmup = 5k)

* Change beta2 to 0.98

* Set max steps to 31250

* Rename file

* Change cycle_second_step_size

* Pass train_max_length and eval_max_length to MLMDataset instance

* Remove redundant argument passing

* Add zero-3 config

* Rename file

* Update config, change batch size to 64

* Change max LR

* Change peak LR

* Not offload param

* Add new config
- train_batch_size = 8064
- train_micro_batch_size_per_gpu = 48
- gradient_accumulation_steps = 21

* Add config: bz=40, grad_acc=25, train_batch_size=8000

* Add new config

* Change bz to 44 and peak LR to 3e-4

* Change beta2 to 0.99

* Change beta2 to 0.999

* Change config

* Add new config

* Fix bz

* Rename file

* Add config for 8 GPUs

* Fix incorrect value of gradient_accumulation_steps

* Fix decay_step_size and decay_lr_rate

* Change LR

* Change

* Update config

* Add config for thwiki+news pretraining

* Set decay_lr_rate to 0

* Change bz

* Add ds_legal-bert-v3 config

- Set cycle_min_lr to 3e-8 instead of 0.0
- Set cycle_min_mom to 0.9 instead of 0.85

* Warmup for the first 5,000 steps then linearly
decays to e3-8 for 45,000 steps

* Add MLM pretraining script that support any model architecture pretraining..

This script will substitีute `train_mlm_camembert_thai.ddp.py` and
`train_mlm_camemberta_thai.py` as it is only applicable for RoBERTa pretraining
and the arguments provided via ArgumentParser are required to add manually in order
to match with the new version of transformers

* Add arguments for DataTrainingArguments as follows.
- train_max_length
- eval_max_length

* Set default value of `do_lower_case` to False

* Call main function

* Test if trainer is process zero with is_world_process_zero

* Initlalize CamembertTokenizer

* Initalize model from Config with
`from_config` class method

* Fix typo, change from `binarized_path_val`
 to `binarized_path_eval`

* Add new config for deberta-base pretraining on thwiki+news

* Set beta2 to 0.999

* Change batch size

* Rename file

* Add config with bz=40

* Rename file

* Change bz to 32, effective bz to 4096

* Add configutation with effective bz = 4080,
and per-device bz = 34

* Add new config

- "train_batch_size": 4032,
-  "train_micro_batch_size_per_gpu": 28,
-  "gradient_accumulation_steps": 18,

* Add config  for effective bz = 4032
   "train_batch_size": 4032,
    "train_micro_batch_size_per_gpu": 24,
    "gradient_accumulation_steps": 21,

* Rename file

* set logging level to debug

* Add debugging message

* Print debugging message only on main process

* Add new config with effective bz of 4032 for 4x GPUs

* Specify a number of zero_optimization parameters to True
- cpu_offload
- contiguous_gradients
- overlap_comm

* Rename file

* Change bz

* Update bz

* Rename

* Change bz

* Add iapp_thaiqa dataset directory to .gitignore

* Implement DataCollatorForSpanLevelMask for
span-level masking

* Addoption to choose masking strategy either subword-level
or span-level

* Fix typing checking

* Fix incorrect data structure for metadata field

* Access value of enum variable

* Access to the value of enum variable

* Edit

* Fix typo

* Add argumetn to specify symbol representing space token

* Get the vocab_size from what tokenizer actually loaded
which included additional_special_tokens

* Pass vocab_size to AutoConfig.from_pretrained to
overide default vocab_size

* Fix data structure accessing error and
remove debugging message

* Assign additional_special_tokens in  CamembertTokenizer.from_pretrained

* Wrap to torch.LongTensor

* Set the DataCollatorForSpanLevelMask to not
perform token masking
on "pad_token"

* Pass pad_to_multiple_of to the super class (DataCollatorForLanguageModeling)

* fix logical error: Fix a logical error where _mask_tokens function
will not exclude special tokens from token masking.

As there is an incorrect statement at L77,
```indices = [i for i in range(len(tokens)) if tokens[i] not in self.special_token_ids]```
The left operand "tokens[i] " is word token (str) while
the right operand is a list of token IDS (List[int])

* Change dataclass init function

* Limit the uperbound of num_to_predict by
total number of input tokens
(excluded special_tokens)

* Fix incorrect num_to_predict

* Add DeBERTa base config

* Move DS config to another directory

* Remove unused config file

* Rename directory

* Remove config file

* Edit script to rename output config file as specified in -o argument

* Add DeBERTa v1 config files

Co-authored-by: cstorm125 <[email protected]>
Co-authored-by: Charin <[email protected]>
Co-authored-by: Charin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants