-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Parsing Part for Pytorch Implementation #552
base: master
Are you sure you want to change the base?
Conversation
Looks nice so far! |
also refined some functions in embedding layer
Now, the model initialization part is working
fixed bugs on UNK word embedding set dropout prob to 0.1 add clipping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How in the world can I delete this comment?
w2i = {} | ||
i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this might help. In the previous version with the head start of i = 1, it seems like the wrong vectors might have been used. If one looked up "," in w2i, it might have been mapped to 2 instead of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because we treated empty string "" and unknown "" differently in the previous version, 0 was token by , and i was starting from 1.
In the current version, the "" and "" share the same embedding, so we do not need an extra id for ""/"".
@@ -21,15 +21,13 @@ def load(config): | |||
else: | |||
delimiter = " " | |||
word, *rest = line.rstrip().split(delimiter) | |||
word = "<UNK>" if word == "" else word |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IF Python is OK using an empty string as a key, this should not be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is easier to change the key here instead of changing all tokens through the codes...
emb_dict["<UNK>"] = vector | ||
else: | ||
emb_dict[word] = vector | ||
emb_dict[word] = vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are two copies of the arrays being kept temporarily: one in emb_dict and another in weights? If memory is an issue, it seems like one could record this vector right away in weights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I will refine this later. Thanks!
@MihaiSurdeanu @bethard @kwalcock
Here is my current code of the MTL in torch. Thanks to Steve, I already fixed few bugs in the code.
This is just the task manager and file reader part, I will do another pull after I get the NER task implemented.