Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Embeddings #39

Closed
gayatrivenugopal opened this issue Jun 24, 2019 · 25 comments
Closed

Word Embeddings #39

gayatrivenugopal opened this issue Jun 24, 2019 · 25 comments

Comments

@gayatrivenugopal
Copy link

Can we retrieve word embeddings from the model?

@kimiyoung
Copy link
Collaborator

Sure. See https://github.com/zihangdai/xlnet/blob/master/xlnet.py#L278

@kottas
Copy link

kottas commented Jun 25, 2019

Could someone please elaborate on @kimiyoung answer? I would like to perform a "BERT-like" word embeddings extraction from the pretrained model.

@gayatrivenugopal
Copy link
Author

My objective is also the same but I need the embeddings for a different language. For English, you could try to use an existing XLNet model and pass it to get_embedding_table to get the vectors. Not sure about this though...

@Arpan142
Copy link

I'm new in this field. For embeddings if I want to use the 'Custom usage of XLNet' I have to tokenize my input file first using sentencepiece for the input_ids right?

@SivilTaram
Copy link

@kottas I think you'd like to acquire the "contextual word embedding" rather than the "vanilla word embedding", right?

@kottas
Copy link

kottas commented Jun 25, 2019

@kottas I think you'd like to acquire the "contextual word embedding" rather than the "vanilla word embedding", right?

Right.

@SivilTaram
Copy link

@kottas Now there is no explicit interface for that purpose. However, I think the authors or some other developers familiar with tensorflow will publish the usage of contextual embedding. :)

@kimiyoung
Copy link
Collaborator

get_sequence_output() returns contextual embeddings, while get_embedding_table() returns non-contextual embeddings. An example of tokenization has also been added.

@Arpan142
Copy link

@kimiyoung can you please tell me what input_mask do I have to provide for getting the word embeddings. I have provided None and it is giving an error saying 'Expected binary or unicode string, got [20135, 17, 88, 10844, 4617]' where [20135, 17, 88, 10844, 4617] is the 1st line sentencepiece token of my data

@kimiyoung
Copy link
Collaborator

If there's nothing to mask, you could set input_mask to None. I think this error has other causes. It might be helpful if you post more details.

@Arpan142
Copy link

import sentencepiece as spm
from prepro_utils import preprocess_text, encode_ids
from absl import flags
import sys

FLAGS = flags.FLAGS
spiece_model_file='D://xlnet_cased_L-24_H-1024_A-16//xlnet-master//spiece.model'
sp_model = spm.SentencePieceProcessor()
xp=[]
sp_model.Load(spiece_model_file)

with open('input.txt') as foo:
text=foo.readline()
while text:
text = preprocess_text(text, lower=False)
print(text)
#ids = encode_ids(sp_model, text)
ids = sp_model.EncodeAsPieces(text.encode('utf-8'))
xp.append(ids)
text=foo.readline()

import pickle
with open('token1.pickle','wb') as get:
pickle.dump(xp,get)

I used this code for tokenization and I used both unicode encoding and id encoding. Then I used this code for word embeddings .

import xlnet
from data_utils import SEP_ID, CLS_ID
from absl import flags
import pickle
import numpy as np
import sys

SEG_ID_A = 0
SEG_ID_B = 1
SEG_ID_CLS = 2
SEG_ID_SEP = 3
SEG_ID_PAD = 4
import os
import tensorflow as tf

def assign_to_gpu(gpu=0, ps_dev="/device:CPU:0"):
def _assign(op):
node_def = op if isinstance(op, tf.NodeDef) else op.node_def
if node_def.op == "Variable":
return ps_dev
else:
return "/gpu:%d" % gpu
return _assign

flags.DEFINE_bool("use_tpu", False, help="whether to use TPUs")
flags.DEFINE_bool("use_bfloat16", False, help="whether to use bfloat16")
flags.DEFINE_float("dropout", default=0.1,
help="Dropout rate.")
flags.DEFINE_float("dropatt", default=0.1,
help="Attention dropout rate.")
flags.DEFINE_enum("init", default="normal",
enum_values=["normal", "uniform"],
help="Initialization method.")
flags.DEFINE_float("init_range", default=0.1,
help="Initialization std when init is uniform.")
flags.DEFINE_float("init_std", default=0.02,
help="Initialization std when init is normal.")
flags.DEFINE_integer("clamp_len", default=-1,
help="Clamp length")
flags.DEFINE_integer("mem_len", default=70,
help="Number of steps to cache")
flags.DEFINE_integer("reuse_len", 256,
help="Number of token that can be reused as memory. "
"Could be half of seq_len.")
flags.DEFINE_bool("bi_data", default=True,
help="Use bidirectional data streams, i.e., forward & backward.")
flags.DEFINE_bool("same_length", default=False,
help="Same length attention")
with open('token.pickle','rb') as new:
tokens=pickle.load(new)
input_ids=np.asarray(tokens)
seg_ids=None
input_mask=None
FLAGS=flags.FLAGS
FLAGS.use_tpu=False
FLAGS.bi_data=False
FLAGS(sys.argv)

xlnet_config = xlnet.XLNetConfig(json_path='D://xlnet_cased_L-24_H-1024_A-16//xlnet_config.json')
run_config = xlnet.create_run_config(is_training=False, is_finetune=False,FLAGS=FLAGS)
xlnet_model = xlnet.XLNetModel(
xlnet_config=xlnet_config,
run_config=run_config,
input_ids=input_ids,
seg_ids=seg_ids,
input_mask=input_mask)
embed=xlnet_model.get_embedding_table()

for both unicode encoding and id encoding the code was giving same error.

@kimiyoung
Copy link
Collaborator

You need to pass a placeholder into xlnet, and use a tf session to fetch the output from xlnet. In other words, you need to construct a computational graph first, and then do the actual computation on it.
You may find the tutorials and guides useful.

@cpury
Copy link

cpury commented Jul 1, 2019

Great job on this model and thanks for publishing the code!

Unfortunately, the code is not very nice to use for simple tasks. I've been trying to load the model and get the output for a single string. I gave up after 3 hours. There are just too many TF details that I have to deal with before I can even use the model...

It would be amazing if you could provide a simpler API and more modular helpers. I don't know why a lot of the helper functions take the FLAGS argument. Don't you want people to use your library outside of scripts?

@kimiyoung
Copy link
Collaborator

Thanks for your suggestion. We will try to improve the interface.

As for how to use it as is, if you look at the code here, the only thing that is created using FLAGS is just the run_config. Alternatively, you can directly construct a RunConfig.

@cpury
Copy link

cpury commented Jul 2, 2019

Thanks! That example indeed looks simple, but this omitted part is my problem:

initialize instances of tf.Tensor, including input_ids, seg_ids, and input_mask

If you could give an example of how to do this for a single sample or a set of samples, that would be amazing. I tried with your data_utils and model_utils, but they are not well documented and mostly require a FLAGS object. I also tried following the logic of the classifier examples but just got lost in a maze.

@matthias-samwald
Copy link

I agree that it would be great to have a simple notebook showing us how to turn a string (phrase, sentence, paragraph etc) into numeric features!

@Arpan142
Copy link

Arpan142 commented Jul 9, 2019

@kimiyoung I tried to use the 'custom usage of xlnet' for sentence embedding but I'm getting the vocabulary embeddings. My dataset contains around 27000 lines but the output I'm getting is of dimension 32000 X 1024. Any idea about what I'm doing wrong? any suggestion would be of great help to me.

@Dhanasekar-S
Copy link

@Arpan142 Exactly the same issue! embedding_table() gives the 32000 tokens from the actual trained model itself

@Hazoom
Copy link

Hazoom commented Jul 11, 2019

@gayatrivenugopal I've just opened a Pull Request #151 with an helper script that does exactly what you need. It gets a file containing list of sentences and outputs a JSON file containing one line per sentence such that each line contains contextual word embedding for each token.

I hope it will be useful.

@cpury
Copy link

cpury commented Jul 11, 2019

@Hazoom Awesome, thank you! It seems that answers all my questions. It would be great if it could get merged!

@gayatrivenugopal
Copy link
Author

gayatrivenugopal commented Jul 11, 2019

@gayatrivenugopal I've just opened a Pull Request #151 with an helper script that does exactly what you need. It gets a file containing list of sentences and outputs a JSON file containing one line per sentence such that each line:

  1. Contains contextual word embedding for each token.
  2. Contains a pooled vector from all the tokens, using the pooling strategy input parameter.

I hope it will be useful.

That's GREAT!!! Will try it out and let you know. Thank you!

@Hazoom
Copy link

Hazoom commented Jul 11, 2019

@gayatrivenugopal @cpury Please check now again, I found a bug in the alignment between the real tokens to the padding tokens. Now it was fixed in my repository and in the PR itself.

@hiwaveSupport
Copy link

just tried running it with and I am getting the json output of word embeddings.
I used gpu_extract script to get the word embeddings.
Python 2.7

@hiwaveSupport
Copy link

@Hazoom -- how do I force the use of GPU using gpu_extract script? Currently I have 1 GPU but not sure how to specify using GPU as by default it uses on the CPUs. Thanks in advance.

@gayatrivenugopal
Copy link
Author

@gayatrivenugopal @cpury Please check now again, I found a bug in the alignment between the real tokens to the padding tokens. Now it was fixed in my repository and in the PR itself.

Thanks a lot. This is extremely useful. Ran the script; got the json output successfully. Thanks again !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants