Tokenizer not splitting by infix in some cases #13084

borjafdezgauna · 2023-10-25T06:42:25Z

How to reproduce the behaviour

I would like the tokenizer to split by nearly any punctuation symbol, and I am having issues in some weird cases.

I initialize the tokenizer this way:

        prefix_re = compile_prefix_regex(nlp.Defaults.prefixes).search
        suffix_re = compile_suffix_regex(nlp.Defaults.suffixes).search
        infix_re = re.compile(r'[-—-−‐/:;,\.\(\)\?ºª“”]').finditer
          
        tokenizer = Tokenizer(nlp.vocab, prefix_search=prefix_re, suffix_search=suffix_re, infix_finditer=infix_re, rules=None)

But, although the dot is set as an infix, I get this:

tokenizer.explain('Alonso A .2014. Membrane')
[('TOKEN', 'Alonso'), ('TOKEN', 'A'), ('TOKEN', '.2014'), ('SUFFIX', '.'), ('TOKEN', 'Membrane')]

I can't understand why '.2014' is output as a token and is not split in '.' and '2014'

Is there something weird going on there? Or am I missing something? Any help is appreciated

Your Environment

Operating System: Windows 10
Python Version Used: Python 3.11.4
spaCy Version Used: Spacy 3.6.1

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2023-10-25T08:18:26Z

The infix matching skips matches that start at index 0 in the token string. Could you match this as a prefix instead (probably still in addition to the infix matching)?

adrianeboyd · 2023-10-25T08:18:35Z

Let me convert this to a discussion...

adrianeboyd added the feat / tokenizer Feature: Tokenizer label Oct 25, 2023

explosion locked and limited conversation to collaborators Oct 25, 2023

adrianeboyd converted this issue into discussion #13085 Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Tokenizer not splitting by infix in some cases #13084

Tokenizer not splitting by infix in some cases #13084

borjafdezgauna commented Oct 25, 2023

adrianeboyd commented Oct 25, 2023

adrianeboyd commented Oct 25, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Tokenizer not splitting by infix in some cases #13084

Tokenizer not splitting by infix in some cases #13084

Comments

borjafdezgauna commented Oct 25, 2023

How to reproduce the behaviour

Your Environment

adrianeboyd commented Oct 25, 2023

adrianeboyd commented Oct 25, 2023

This issue was moved to a discussion.