Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer not splitting by infix in some cases #13084

Closed
borjafdezgauna opened this issue Oct 25, 2023 · 2 comments
Closed

Tokenizer not splitting by infix in some cases #13084

borjafdezgauna opened this issue Oct 25, 2023 · 2 comments
Labels
feat / tokenizer Feature: Tokenizer

Comments

@borjafdezgauna
Copy link

How to reproduce the behaviour

I would like the tokenizer to split by nearly any punctuation symbol, and I am having issues in some weird cases.

I initialize the tokenizer this way:

        prefix_re = compile_prefix_regex(nlp.Defaults.prefixes).search
        suffix_re = compile_suffix_regex(nlp.Defaults.suffixes).search
        infix_re = re.compile(r'[-—-−‐/:;,\.\(\)\?ºª“”]').finditer
          
        tokenizer = Tokenizer(nlp.vocab, prefix_search=prefix_re, suffix_search=suffix_re, infix_finditer=infix_re, rules=None)

But, although the dot is set as an infix, I get this:

tokenizer.explain('Alonso A .2014. Membrane')
[('TOKEN', 'Alonso'), ('TOKEN', 'A'), ('TOKEN', '.2014'), ('SUFFIX', '.'), ('TOKEN', 'Membrane')]

I can't understand why '.2014' is output as a token and is not split in '.' and '2014'

Is there something weird going on there? Or am I missing something? Any help is appreciated

Your Environment

  • Operating System: Windows 10
  • Python Version Used: Python 3.11.4
  • spaCy Version Used: Spacy 3.6.1
@adrianeboyd adrianeboyd added the feat / tokenizer Feature: Tokenizer label Oct 25, 2023
@adrianeboyd
Copy link
Contributor

The infix matching skips matches that start at index 0 in the token string. Could you match this as a prefix instead (probably still in addition to the infix matching)?

@adrianeboyd
Copy link
Contributor

Let me convert this to a discussion...

@explosion explosion locked and limited conversation to collaborators Oct 25, 2023
@adrianeboyd adrianeboyd converted this issue into discussion #13085 Oct 25, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feat / tokenizer Feature: Tokenizer
Projects
None yet
Development

No branches or pull requests

2 participants