-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spaCy Training Performance With different configuration and set up #7
Comments
99 0.000 3.572 0.000 57.933 49.355 53.302 91.894 85.899 1438.3 0.0 |
need to gather more arabic ner training data: |
After add in the ANERCorp here is the accuracy: this is because out of 150k entity tokens, 88% of them are useless 'O' object so our performance not get enhanced @ahalterman @cegme tags accuracy goes down a little bit. so LDC+ANERCORP with no merged class, but fasttext pretrained embedding when trained by 10 times got a better performance: |
Well, not exactly "useless"..., since we need to be able to distinguish between entities and non-entities. What are the different numbers in accuracy? Does each one represent a tag type? We need to figure out how to handle ANER not having a full range of labels that OntoNotes has. One way would be to go from spaCy format to Prodigy format, where each task is one single entity label, rather than all highlighted entities. Then when we use the more limited ANER data, we're not incorrectly telling it there's no entity there when there actually is. It would also be a cool experiment to know whether "Prodigy-style" training underperforms spaCy training (and by how much). |
@ahalterman @cegme so the token accuracy from 58.589 to 58.158 |
http://users.dsic.upv.es/~ybenajiba/downloads.html some related stuff I found similar to what you are talking about @ahalterman https://support.prodi.gy/t/remarkable-difference-between-prodigy-and-custom-training-times/467/3 |
pretrained embedding stuff? @ahalterman wonder if this is what u are talking about. or u have better examples? |
|
to avoid the catastrophic problem, our plan is to using spaCy and ldc+anercorp and we get the model, |
Performance with pretrained embedding and merged tagged class is this:
token accuracy is 58.406 and entity accuracy is 54.254 |
Can you add the header to indicate what the 11 numbers mean? |
yeah it is at the top of this issue: and also here |
training data tag distribution: https://github.com/izarov/cs224n/blob/master/assignment3/handouts/assignment3-soln.pdf |
spaCy training output
exception that during training spaCy throw, and I made the code to eat the exception
The text was updated successfully, but these errors were encountered: