Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Usage of vocab.items() raises an attribute error when vocab is a list of lists. #19

Open
Udayk02 opened this issue Jan 5, 2025 · 0 comments

Comments

@Udayk02
Copy link

Udayk02 commented Jan 5, 2025

Bug Description:

I used the Autotiktokenizer for "Cohere/Cohere-embed-multilingual-v3.0" which led to an attribute error as the .items() is used on vocab which is a list.

Minimal Example:

from autotiktokenizer import AutoTikTokenizer

tokenizer = AutoTikTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")

Current behaviour:

    116 """Convert vocab to binary mergeable_ranks.
    117 
    118 Args:
   (...)
    123     mergeable_ranks (dict): The mergeable ranks of tokens in binary format.
    124 """
    125 mergeable_ranks = {}
--> 126 sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])
    127 for rank, (token, _) in enumerate(sorted_vocab, start=0):
    128     # Converting wordpiece to equivalent std BPE form
    129     if tokenizer_type == 'wordpiece':

AttributeError: 'list' object has no attribute 'items'

Workaround:

Issue occurred because vocab is assumed to be a dictionary. But, in the cohere tokenizer instance, vocab is a list of lists. Variables should pass or route through a type check.

Udayk02 added a commit to Udayk02/autotiktokenizer that referenced this issue Jan 6, 2025
- type-casting the `vocab` into dict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant