-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prepare-conll-coref does not convert AIDA-YAGO2-dataset #45
Comments
CoNLL coref is not the same as CoNLL-AIDA. I have clarified this in the docs. Please remind me what the AIDA dataset looks like? |
Oops sorry I missed that in the docs. Here is the format of CoNLL-AIDA:
And this is the format of the system output which I believe is accepted by
Thank you so much, |
def aida_to_neleval(f, iob_col=2, kbid_col=3):
def emit():
if 'start' not in cur:
return
kbid = cur['kbid']
if kbid == '--NME--':
kbid = 'NIL0000000'
print(docid, cur['start'], offset, kbid, sep='\t')
del cur['start']
del cur['kbid']
cur = {}
for l in f:
l = l.rstrip()
if not l:
continue
elif l.startswith('-DOCSTART-'):
emit()
docid = l[len('-DOCSTART-'):].strip().replace(' ', '_')
offset = 0
else:
offset += 1
cols = l.split('\t')
if len(cols) == 1:
emit()
continue
if cols[iob_col] == 'B' or cols[kbid_col] != cur.get('kbid'):
emit()
cur['start'] = offset
cur['kbid'] = cols[kbid_col]
if __name__ == '__main__':
import io
f = io.StringIO('''
-DOCSTART- (1 EU)
EU\tB\tEU\t--NME--
rejects
German\tB\tGerman\tGermany\thttp://en.wikipedia.org/wiki/Germany\t/m/0345h
call
to
boycott
British\tB\tBritish\tUnited_Kingdom\thttp://en.wikipedia.org/wiki/United_Kingdom\t/m/07ssc
lamb
.
Peter\tB\tPeter Blackburn\t--NME--
Blackburn\tI\tPeter Blackburn\t--NME--
BRUSSELS\tB\tBRUSSELS\tBrussels\thttp://en.wikipedia.org/wiki/Brussels\t/m/0177z
1996-08-22
The
European\tB\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
Commission\tI\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
said
on
Thursday
it
disagreed
''')
aida_to_neleval(f) outputs
I'll look into making a new command out of it. |
Thank you so much. Sorry for the late reply been super busy with other tasks. Will test your code as soon as I get back to this task. Thanks again :) |
Awesome. I just had to decode before splitting to resolve I forgot to mention that in some rows the kbid actually has a mixed format which looks something like:
How can I fix the format if I've decoded in UTF-8 when splitting? Also, could you take a look at this relatable issue andychisholm/nel#21 it's in Conll.py the offsets are calculated completely wrong I tried many different ways and also tried to apply your method there but it still doesn't calculate correctly. Many many thanks :) |
I tried running the
prepare-conll-coref
no file is generated.$ neleval prepare-conll-coref /path/to/AIDA-YAGO2-dataset.tsv
And no file is generated. I would like to know how to convert CoNLL-AIDA dataset format to neleval format?
Thanks in advance,
The text was updated successfully, but these errors were encountered: