You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be cool to implement a soft tokenizer so we can use it in some of our actual models. The soft tokenizer considers overlap between the query and the universe (vocab). Using this information, we can randomly sample (with replacement) using the overlap score as a probability distribution.
From the meeting, it was noted that smaller regions would show up more often than large regions since their overlap percentage would always be larger (they are smaller)
It would be cool to implement a soft tokenizer so we can use it in some of our actual models. The soft tokenizer considers overlap between the query and the universe (vocab). Using this information, we can randomly sample (with replacement) using the overlap score as a probability distribution.
Here is a rust crate that will let you sample form distributions: https://docs.rs/rand_distr/latest/rand_distr/
I would use it similarly in Python:
The text was updated successfully, but these errors were encountered: