-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word Embeddings #39
Comments
Could someone please elaborate on @kimiyoung answer? I would like to perform a "BERT-like" word embeddings extraction from the pretrained model. |
My objective is also the same but I need the embeddings for a different language. For English, you could try to use an existing XLNet model and pass it to get_embedding_table to get the vectors. Not sure about this though... |
I'm new in this field. For embeddings if I want to use the 'Custom usage of XLNet' I have to tokenize my input file first using sentencepiece for the input_ids right? |
@kottas I think you'd like to acquire the "contextual word embedding" rather than the "vanilla word embedding", right? |
Right. |
@kottas Now there is no explicit interface for that purpose. However, I think the authors or some other developers familiar with tensorflow will publish the usage of contextual embedding. :) |
|
@kimiyoung can you please tell me what input_mask do I have to provide for getting the word embeddings. I have provided None and it is giving an error saying 'Expected binary or unicode string, got [20135, 17, 88, 10844, 4617]' where [20135, 17, 88, 10844, 4617] is the 1st line sentencepiece token of my data |
If there's nothing to mask, you could set |
import sentencepiece as spm FLAGS = flags.FLAGS with open('input.txt') as foo: import pickle I used this code for tokenization and I used both unicode encoding and id encoding. Then I used this code for word embeddings . import xlnet SEG_ID_A = 0 def assign_to_gpu(gpu=0, ps_dev="/device:CPU:0"): flags.DEFINE_bool("use_tpu", False, help="whether to use TPUs") xlnet_config = xlnet.XLNetConfig(json_path='D://xlnet_cased_L-24_H-1024_A-16//xlnet_config.json') for both unicode encoding and id encoding the code was giving same error. |
You need to pass a placeholder into xlnet, and use a tf session to fetch the output from xlnet. In other words, you need to construct a computational graph first, and then do the actual computation on it. |
Great job on this model and thanks for publishing the code! Unfortunately, the code is not very nice to use for simple tasks. I've been trying to load the model and get the output for a single string. I gave up after 3 hours. There are just too many TF details that I have to deal with before I can even use the model... It would be amazing if you could provide a simpler API and more modular helpers. I don't know why a lot of the helper functions take the |
Thanks for your suggestion. We will try to improve the interface. As for how to use it as is, if you look at the code here, the only thing that is created using FLAGS is just the |
Thanks! That example indeed looks simple, but this omitted part is my problem:
If you could give an example of how to do this for a single sample or a set of samples, that would be amazing. I tried with your data_utils and model_utils, but they are not well documented and mostly require a |
I agree that it would be great to have a simple notebook showing us how to turn a string (phrase, sentence, paragraph etc) into numeric features! |
@kimiyoung I tried to use the 'custom usage of xlnet' for sentence embedding but I'm getting the vocabulary embeddings. My dataset contains around 27000 lines but the output I'm getting is of dimension 32000 X 1024. Any idea about what I'm doing wrong? any suggestion would be of great help to me. |
@Arpan142 Exactly the same issue! embedding_table() gives the 32000 tokens from the actual trained model itself |
@gayatrivenugopal I've just opened a Pull Request #151 with an helper script that does exactly what you need. It gets a file containing list of sentences and outputs a JSON file containing one line per sentence such that each line contains contextual word embedding for each token. I hope it will be useful. |
@Hazoom Awesome, thank you! It seems that answers all my questions. It would be great if it could get merged! |
That's GREAT!!! Will try it out and let you know. Thank you! |
@gayatrivenugopal @cpury Please check now again, I found a bug in the alignment between the real tokens to the padding tokens. Now it was fixed in my repository and in the PR itself. |
just tried running it with and I am getting the json output of word embeddings. |
@Hazoom -- how do I force the use of GPU using gpu_extract script? Currently I have 1 GPU but not sure how to specify using GPU as by default it uses on the CPUs. Thanks in advance. |
Thanks a lot. This is extremely useful. Ran the script; got the json output successfully. Thanks again ! |
Can we retrieve word embeddings from the model?
The text was updated successfully, but these errors were encountered: