Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support updating template processors #1652

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Oct 14, 2024

Goal:

from tokenizers import Tokenizer
from tokenizers.processors import TemplateProcessing
tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")
tokenizer.post_processor 

tokenizer.post_processor[1] = TemplateProcessing(
    single="[CLS] $0 [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 0)],
)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@McPatate McPatate force-pushed the sequential-post-processor branch from 11533c5 to 4bb595b Compare January 14, 2025 02:37
@McPatate McPatate marked this pull request as ready for review January 16, 2025 02:33
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-approving here as it is your PR @ArthurZucker, waiting for your review before merging

Copy link
Collaborator Author

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! WOuld just add python tests! 😉

let's check set_item and also that get_item_ is mutable

@@ -14,7 +14,7 @@ serde = { version = "1.0", features = ["rc", "derive"] }
serde_json = "1.0"
libc = "0.2"
env_logger = "0.11"
pyo3 = { version = "0.23", features = ["abi3", "abi3-py39"] }
pyo3 = { version = "0.23", features = ["abi3", "abi3-py39", "py-clone"] }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this we probably need to rebase!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think so, @Narsil removed it in f4c9fd7

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's revert these as well!

PyNormalizerTypeWrapper::Single(ref inner) => match &*inner
.as_ref()
.read()
.map_err(|_| PyException::new_err("rwlock is poisoned"))?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map_err(|_| PyException::new_err("rwlock is poisoned"))?
.map_err(|_| PyException::new_err("rwlock was poisoned when trying get subtype of PyNormalizer"))?

@@ -218,7 +223,7 @@ macro_rules! getter {
($self: ident, $variant: ident, $name: ident) => {{
let super_ = $self.as_ref();
if let PyNormalizerTypeWrapper::Single(ref norm) = super_.normalizer {
let wrapper = norm.read().unwrap();
let wrapper = norm.read().expect("rwlock is poisoned");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here let's be a little bit more helpful

Comment on lines +418 to +422
fn __len__(self_: PyRef<'_, Self>) -> usize {
match &self_.as_ref().normalizer {
PyNormalizerTypeWrapper::Sequence(inner) => inner.len(),
PyNormalizerTypeWrapper::Single(_) => 1,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks!
I think the issue you mentioned about length comes from here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's a Single? Not sure, it should deser to empty Sequence, but the type is Sequence with length 1 containing an NKFC

Comment on lines +451 to +453
.read()
.map_err(|_| PyException::new_err("rwlock is poisoned"))?
.clone();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so we get a clone of the rust object with wrtie lock nice

pub(crate) enum PyNormalizerTypeWrapper {
Sequence(Vec<Arc<RwLock<PyNormalizerWrapper>>>),
Single(Arc<RwLock<PyNormalizerWrapper>>),
}

impl<'de> Deserialize<'de> for PyNormalizerTypeWrapper {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment as to why we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We have to implement this to make sure the it's a Sequence(Sequence) and not Single(Sequence) wirite?

Comment on lines +750 to +754
.map_err(|_| PyException::new_err("rwlock is poisoned"))?
.normalize(normalized),
PyNormalizerTypeWrapper::Sequence(inner) => inner.iter().try_for_each(|n| {
n.read()
.map_err(|_| PyException::new_err("rwlock is poisoned"))?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same let's have a more helpful error. I think in totenizers we create a type for the PyException, can be "IndexingError" or something


// Setter for `single`
pub fn set_single(&mut self, single: Template) {
println!("Setting single to: {:?}", single); // Debugging output
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove!

Comment on lines +78 to 80
normalizers[1] = Strip()
assert normalizers[1].__class__ == Strip

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's check set_item and also that get_item_ is mutabkle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't atm, need to impl getters and setters for each variant of the Normalizer/Pretok/Processor. Will do in a subsequent PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants