add new CentralEuropeanStreetNameClassifier #88

missinglink · 2020-04-17T17:46:03Z

adds a new CentralEuropeanStreetNameClassifier which is able to handle the cases mentioned in #83

it's still fairly basic, but relatively safe.

in the future we may consider expanding this to cover:

more than one unclassified span before the housenumber
the inverted order of 1 xxx instead of xxx 1 (although this might be dangerous?)

closes: #83

Joxit · 2020-04-20T09:46:37Z

You are using section classifier and forcing length to 2, this definitely reduce side effects 👍.

But we should be careful with words and phrases. In your PR the Alpha member should not be classified with a public classification, which is good IMO. But the section is composed by words... And one word can also be a phrase (#47).
Here the word Paris is classified as an Alpha, but the phrase is classified as Locality... Theoretically this would mean that CentralEuropeanStreetNameClassifier should not classify it 😕
It's ok for now because the confidence is low, this is a reminder for me 😅

$ node bin/cli.js Paris 75000, France

master:

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Paris 75000, France
SECTIONS                        ➜   Paris 75000   0:11    France  12:19 
S0 TOKENS                       ➜   Paris  0:5   7500  6:10 
S1 TOKENS                       ➜   France  13:19 
S0 PHRASES                      ➜   Paris 7500  0:10   Paris  0:5   7500  6:10 
S1 PHRASES                      ➜   France  13:19 

================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris                           ➜   alpha  1.00   start_token  1.00  
75000                           ➜   numeric  1.00   housenumber  0.90   postcode  1.00  
France                          ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris                           ➜   given_name  1.00   surname  1.00   area  1.00   locality  1.00  
France                          ➜   given_name  1.00   surname  1.00   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

central_european_streets:

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Paris 75000, France
SECTIONS                        ➜   Paris 75000   0:11    France  12:19 
S0 TOKENS                       ➜   Paris  0:5   7500  6:10 
S1 TOKENS                       ➜   France  13:19 
S0 PHRASES                      ➜   Paris 75000  0:10   Paris  0:5   7500  6:10 
S1 PHRASES                      ➜   France  13:19 

================================================================
CLASSIFICATIONS (6ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris                           ➜   alpha  1.00   start_token  1.00   street  0.50  
75000                           ➜   numeric  1.00   housenumber  0.90   postcode  1.00  
France                          ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris                           ➜   given_name  1.00   surname  1.00   area  1.00   locality  1.00  
France                          ➜   given_name  1.00   surname  1.00   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

(0.79) ➜ [ { street: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

(0.77) ➜ [ { street: 'Paris' },
  { housenumber: '75000' },
  { country: 'France' } ]

missinglink · 2020-04-20T10:20:11Z

Yeah agreed, it should ensure that the tokens have no public classifications at all.

missinglink · 2020-04-20T10:27:19Z

It's a really tricky case to handle without a gazetteer and/or a geocoder.

There is a street I cycle past quite often called Esplanade and I'm wondering how we will ever be able to correctly parse those addresses, eg Esplanade 17, 13187 Berlin, Germany

missinglink · 2020-04-20T10:29:45Z

Maybe we also add a check that the housenumber span doesn't also have a postcode classification.

missinglink · 2020-04-23T10:45:09Z

Joxit · 2020-04-23T15:18:24Z

Nice, your PR seems to work for Esplanade too ! (Which is a street prefix in French)

$ node bin/cli.js Esplanade 17, 13187 Berlin, Germany

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Esplanade 17, 13187 Berlin, Germany
SECTIONS                        ➜   Esplanade 17  0:12    13187 Berlin  13:26    Germany  27:35 
S0 TOKENS                       ➜   Esplanade  0:9   17  10:12 
S1 TOKENS                       ➜   13187  14:19   Berlin  20:26 
S2 TOKENS                       ➜   Germany  28:35 
S0 PHRASES                      ➜   Esplanade 17  0:12   Esplanade  0:9   17  10:12 
S1 PHRASES                      ➜   13187 Berlin  14:26   13187  14:19   Berlin  20:26 
S2 PHRASES                      ➜   Germany  28:35 

================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Esplanade                       ➜   alpha  1.00   start_token  1.00   street_prefix  1.00   street  0.50  
17                              ➜   numeric  1.00   housenumber  1.00  
13187                           ➜   numeric  1.00   housenumber  0.20   postcode  1.00  
Berlin                          ➜   alpha  1.00  
Germany                         ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Berlin                          ➜   surname  1.00   area  1.00   locality  1.00   region  1.00  
Germany                         ➜   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.82) ➜ [ { street: 'Esplanade' },
  { housenumber: '17' },
  { postcode: '13187' },
  { locality: 'Berlin' },
  { country: 'Germany' } ]

(0.82) ➜ [ { street: 'Esplanade' },
  { housenumber: '17' },
  { postcode: '13187' },
  { region: 'Berlin' },
  { country: 'Germany' } ]

…sifier

missinglink · 2020-04-24T14:06:11Z

I just added two more test cases.
I also added some code to check the parent phrases but it caused one test to fail, so I'm thinking we just leave it as-is for now?

missinglink mentioned this pull request Apr 17, 2020

Parsing Czech Republic addresses #83

Closed

Joxit approved these changes Apr 20, 2020

View reviewed changes

missinglink added 2 commits April 24, 2020 15:42

feat(central_european_streets): add new CentralEuropeanStreetNameClas…

c6db8fa

…sifier

test(central_european_streets): add additional tests

ae0aa7b

missinglink force-pushed the central_european_streets branch from 39e3d29 to ae0aa7b Compare April 24, 2020 14:04

missinglink merged commit 9581567 into master Apr 24, 2020

missinglink deleted the central_european_streets branch April 24, 2020 14:54

Joxit mentioned this pull request Apr 30, 2020

Add support for unit type numbered #87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new CentralEuropeanStreetNameClassifier #88

add new CentralEuropeanStreetNameClassifier #88

missinglink commented Apr 17, 2020 •

edited

Loading

Joxit commented Apr 20, 2020

missinglink commented Apr 20, 2020

missinglink commented Apr 20, 2020

missinglink commented Apr 20, 2020

missinglink commented Apr 23, 2020

Joxit commented Apr 23, 2020

missinglink commented Apr 24, 2020

add new CentralEuropeanStreetNameClassifier #88

add new CentralEuropeanStreetNameClassifier #88

Conversation

missinglink commented Apr 17, 2020 • edited Loading

Joxit commented Apr 20, 2020

missinglink commented Apr 20, 2020

missinglink commented Apr 20, 2020

missinglink commented Apr 20, 2020

missinglink commented Apr 23, 2020

Joxit commented Apr 23, 2020

missinglink commented Apr 24, 2020

missinglink commented Apr 17, 2020 •

edited

Loading