Word Embeddings #39

gayatrivenugopal · 2019-06-24T04:07:40Z

Can we retrieve word embeddings from the model?

kimiyoung · 2019-06-24T04:08:43Z

Sure. See https://github.com/zihangdai/xlnet/blob/master/xlnet.py#L278

kottas · 2019-06-25T08:31:41Z

Could someone please elaborate on @kimiyoung answer? I would like to perform a "BERT-like" word embeddings extraction from the pretrained model.

gayatrivenugopal · 2019-06-25T10:20:27Z

My objective is also the same but I need the embeddings for a different language. For English, you could try to use an existing XLNet model and pass it to get_embedding_table to get the vectors. Not sure about this though...

Arpan142 · 2019-06-25T11:21:07Z

I'm new in this field. For embeddings if I want to use the 'Custom usage of XLNet' I have to tokenize my input file first using sentencepiece for the input_ids right?

SivilTaram · 2019-06-25T13:05:49Z

@kottas I think you'd like to acquire the "contextual word embedding" rather than the "vanilla word embedding", right?

kottas · 2019-06-25T13:08:37Z

@kottas I think you'd like to acquire the "contextual word embedding" rather than the "vanilla word embedding", right?

Right.

SivilTaram · 2019-06-25T13:21:27Z

@kottas Now there is no explicit interface for that purpose. However, I think the authors or some other developers familiar with tensorflow will publish the usage of contextual embedding. :)

kimiyoung · 2019-06-26T00:53:01Z

get_sequence_output() returns contextual embeddings, while get_embedding_table() returns non-contextual embeddings. An example of tokenization has also been added.

Arpan142 · 2019-06-26T12:44:54Z

@kimiyoung can you please tell me what input_mask do I have to provide for getting the word embeddings. I have provided None and it is giving an error saying 'Expected binary or unicode string, got [20135, 17, 88, 10844, 4617]' where [20135, 17, 88, 10844, 4617] is the 1st line sentencepiece token of my data

kimiyoung · 2019-06-26T23:39:57Z

If there's nothing to mask, you could set input_mask to None. I think this error has other causes. It might be helpful if you post more details.

Arpan142 · 2019-06-27T05:24:40Z

import sentencepiece as spm
from prepro_utils import preprocess_text, encode_ids
from absl import flags
import sys

FLAGS = flags.FLAGS
spiece_model_file='D://xlnet_cased_L-24_H-1024_A-16//xlnet-master//spiece.model'
sp_model = spm.SentencePieceProcessor()
xp=[]
sp_model.Load(spiece_model_file)

with open('input.txt') as foo:
text=foo.readline()
while text:
text = preprocess_text(text, lower=False)
print(text)
#ids = encode_ids(sp_model, text)
ids = sp_model.EncodeAsPieces(text.encode('utf-8'))
xp.append(ids)
text=foo.readline()

import pickle
with open('token1.pickle','wb') as get:
pickle.dump(xp,get)

I used this code for tokenization and I used both unicode encoding and id encoding. Then I used this code for word embeddings .

import xlnet
from data_utils import SEP_ID, CLS_ID
from absl import flags
import pickle
import numpy as np
import sys

SEG_ID_A = 0
SEG_ID_B = 1
SEG_ID_CLS = 2
SEG_ID_SEP = 3
SEG_ID_PAD = 4
import os
import tensorflow as tf

def assign_to_gpu(gpu=0, ps_dev="/device:CPU:0"):
def _assign(op):
node_def = op if isinstance(op, tf.NodeDef) else op.node_def
if node_def.op == "Variable":
return ps_dev
else:
return "/gpu:%d" % gpu
return _assign

flags.DEFINE_bool("use_tpu", False, help="whether to use TPUs")
flags.DEFINE_bool("use_bfloat16", False, help="whether to use bfloat16")
flags.DEFINE_float("dropout", default=0.1,
help="Dropout rate.")
flags.DEFINE_float("dropatt", default=0.1,
help="Attention dropout rate.")
flags.DEFINE_enum("init", default="normal",
enum_values=["normal", "uniform"],
help="Initialization method.")
flags.DEFINE_float("init_range", default=0.1,
help="Initialization std when init is uniform.")
flags.DEFINE_float("init_std", default=0.02,
help="Initialization std when init is normal.")
flags.DEFINE_integer("clamp_len", default=-1,
help="Clamp length")
flags.DEFINE_integer("mem_len", default=70,
help="Number of steps to cache")
flags.DEFINE_integer("reuse_len", 256,
help="Number of token that can be reused as memory. "
"Could be half of seq_len.")
flags.DEFINE_bool("bi_data", default=True,
help="Use bidirectional data streams, i.e., forward & backward.")
flags.DEFINE_bool("same_length", default=False,
help="Same length attention")
with open('token.pickle','rb') as new:
tokens=pickle.load(new)
input_ids=np.asarray(tokens)
seg_ids=None
input_mask=None
FLAGS=flags.FLAGS
FLAGS.use_tpu=False
FLAGS.bi_data=False
FLAGS(sys.argv)

xlnet_config = xlnet.XLNetConfig(json_path='D://xlnet_cased_L-24_H-1024_A-16//xlnet_config.json')
run_config = xlnet.create_run_config(is_training=False, is_finetune=False,FLAGS=FLAGS)
xlnet_model = xlnet.XLNetModel(
xlnet_config=xlnet_config,
run_config=run_config,
input_ids=input_ids,
seg_ids=seg_ids,
input_mask=input_mask)
embed=xlnet_model.get_embedding_table()

for both unicode encoding and id encoding the code was giving same error.

kimiyoung · 2019-06-27T05:43:39Z

You need to pass a placeholder into xlnet, and use a tf session to fetch the output from xlnet. In other words, you need to construct a computational graph first, and then do the actual computation on it.
You may find the tutorials and guides useful.

cpury · 2019-07-01T14:22:14Z

Great job on this model and thanks for publishing the code!

Unfortunately, the code is not very nice to use for simple tasks. I've been trying to load the model and get the output for a single string. I gave up after 3 hours. There are just too many TF details that I have to deal with before I can even use the model...

It would be amazing if you could provide a simpler API and more modular helpers. I don't know why a lot of the helper functions take the FLAGS argument. Don't you want people to use your library outside of scripts?

kimiyoung · 2019-07-01T17:44:07Z

Thanks for your suggestion. We will try to improve the interface.

As for how to use it as is, if you look at the code here, the only thing that is created using FLAGS is just the run_config. Alternatively, you can directly construct a RunConfig.

cpury · 2019-07-02T09:19:46Z

Thanks! That example indeed looks simple, but this omitted part is my problem:

initialize instances of tf.Tensor, including input_ids, seg_ids, and input_mask

If you could give an example of how to do this for a single sample or a set of samples, that would be amazing. I tried with your data_utils and model_utils, but they are not well documented and mostly require a FLAGS object. I also tried following the logic of the classifier examples but just got lost in a maze.

matthias-samwald · 2019-07-09T08:42:55Z

I agree that it would be great to have a simple notebook showing us how to turn a string (phrase, sentence, paragraph etc) into numeric features!

Arpan142 · 2019-07-09T12:28:53Z

@kimiyoung I tried to use the 'custom usage of xlnet' for sentence embedding but I'm getting the vocabulary embeddings. My dataset contains around 27000 lines but the output I'm getting is of dimension 32000 X 1024. Any idea about what I'm doing wrong? any suggestion would be of great help to me.

Dhanasekar-S · 2019-07-09T23:23:00Z

@Arpan142 Exactly the same issue! embedding_table() gives the 32000 tokens from the actual trained model itself

Hazoom · 2019-07-11T07:09:14Z

@gayatrivenugopal I've just opened a Pull Request #151 with an helper script that does exactly what you need. It gets a file containing list of sentences and outputs a JSON file containing one line per sentence such that each line contains contextual word embedding for each token.

I hope it will be useful.

cpury · 2019-07-11T09:00:30Z

@Hazoom Awesome, thank you! It seems that answers all my questions. It would be great if it could get merged!

gayatrivenugopal · 2019-07-11T10:22:05Z

@gayatrivenugopal I've just opened a Pull Request #151 with an helper script that does exactly what you need. It gets a file containing list of sentences and outputs a JSON file containing one line per sentence such that each line:

Contains contextual word embedding for each token.

Contains a pooled vector from all the tokens, using the pooling strategy input parameter.

I hope it will be useful.

That's GREAT!!! Will try it out and let you know. Thank you!

Hazoom · 2019-07-11T17:04:03Z

@gayatrivenugopal @cpury Please check now again, I found a bug in the alignment between the real tokens to the padding tokens. Now it was fixed in my repository and in the PR itself.

hiwaveSupport · 2019-07-25T11:17:56Z

just tried running it with and I am getting the json output of word embeddings.
I used gpu_extract script to get the word embeddings.
Python 2.7

hiwaveSupport · 2019-07-26T16:03:14Z

@Hazoom -- how do I force the use of GPU using gpu_extract script? Currently I have 1 GPU but not sure how to specify using GPU as by default it uses on the CPUs. Thanks in advance.

gayatrivenugopal · 2019-08-17T16:22:26Z

@gayatrivenugopal @cpury Please check now again, I found a bug in the alignment between the real tokens to the padding tokens. Now it was fixed in my repository and in the PR itself.

Thanks a lot. This is extremely useful. Ran the script; got the json output successfully. Thanks again !

Hazoom mentioned this issue Jul 11, 2019

Extract Contextual Word Embeddings #151

Open

gayatrivenugopal closed this as completed Aug 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Embeddings #39

Word Embeddings #39

gayatrivenugopal commented Jun 24, 2019

kimiyoung commented Jun 24, 2019

kottas commented Jun 25, 2019

gayatrivenugopal commented Jun 25, 2019

Arpan142 commented Jun 25, 2019

SivilTaram commented Jun 25, 2019

kottas commented Jun 25, 2019

SivilTaram commented Jun 25, 2019

kimiyoung commented Jun 26, 2019

Arpan142 commented Jun 26, 2019

kimiyoung commented Jun 26, 2019

Arpan142 commented Jun 27, 2019

kimiyoung commented Jun 27, 2019

cpury commented Jul 1, 2019

kimiyoung commented Jul 1, 2019

cpury commented Jul 2, 2019

matthias-samwald commented Jul 9, 2019

Arpan142 commented Jul 9, 2019

Dhanasekar-S commented Jul 9, 2019

Hazoom commented Jul 11, 2019 •

edited

Loading

cpury commented Jul 11, 2019

gayatrivenugopal commented Jul 11, 2019 •

edited

Loading

Hazoom commented Jul 11, 2019

hiwaveSupport commented Jul 25, 2019

hiwaveSupport commented Jul 26, 2019

gayatrivenugopal commented Aug 17, 2019

Word Embeddings #39

Word Embeddings #39

Comments

gayatrivenugopal commented Jun 24, 2019

kimiyoung commented Jun 24, 2019

kottas commented Jun 25, 2019

gayatrivenugopal commented Jun 25, 2019

Arpan142 commented Jun 25, 2019

SivilTaram commented Jun 25, 2019

kottas commented Jun 25, 2019

SivilTaram commented Jun 25, 2019

kimiyoung commented Jun 26, 2019

Arpan142 commented Jun 26, 2019

kimiyoung commented Jun 26, 2019

Arpan142 commented Jun 27, 2019

kimiyoung commented Jun 27, 2019

cpury commented Jul 1, 2019

kimiyoung commented Jul 1, 2019

cpury commented Jul 2, 2019

matthias-samwald commented Jul 9, 2019

Arpan142 commented Jul 9, 2019

Dhanasekar-S commented Jul 9, 2019

Hazoom commented Jul 11, 2019 • edited Loading

cpury commented Jul 11, 2019

gayatrivenugopal commented Jul 11, 2019 • edited Loading

Hazoom commented Jul 11, 2019

hiwaveSupport commented Jul 25, 2019

hiwaveSupport commented Jul 26, 2019

gayatrivenugopal commented Aug 17, 2019

Hazoom commented Jul 11, 2019 •

edited

Loading

gayatrivenugopal commented Jul 11, 2019 •

edited

Loading