Fine-tuning a base model for a token classification task (postal address parsing)
Steps
- Step 0: get millions of postal addresses from openaddresses.io
- Step 1: select columns of interests and create train / validation / split
- Step 2: attempt to "de-normalize" the postal addressess so that they are more likely to represent what we get in real life.
- Step 3: fine-tune a fill-mask base model for a token classification; e.g. base_model = google-bert/bert-base-multilingual-uncased
- Step 4: demo time on HuggingFace Spaces: Didier/Postal_address_canada_parsing