Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare-conll-coref does not convert AIDA-YAGO2-dataset #45

Open
userofgithub1 opened this issue Jul 4, 2018 · 5 comments
Open

prepare-conll-coref does not convert AIDA-YAGO2-dataset #45

userofgithub1 opened this issue Jul 4, 2018 · 5 comments

Comments

@userofgithub1
Copy link

I tried running the prepare-conll-coref no file is generated.

$ neleval prepare-conll-coref /path/to/AIDA-YAGO2-dataset.tsv

And no file is generated. I would like to know how to convert CoNLL-AIDA dataset format to neleval format?

Thanks in advance,

@jnothman
Copy link
Member

jnothman commented Jul 4, 2018

CoNLL coref is not the same as CoNLL-AIDA. I have clarified this in the docs. Please remind me what the AIDA dataset looks like?

@userofgithub1
Copy link
Author

Oops sorry I missed that in the docs. Here is the format of CoNLL-AIDA:

-DOCSTART- (1 EU)
EU	B	EU	--NME--
rejects
German	B	German	Germany	http://en.wikipedia.org/wiki/Germany	/m/0345h
call
to
boycott
British	B	British	United_Kingdom	http://en.wikipedia.org/wiki/United_Kingdom	/m/07ssc
lamb
.

Peter	B	Peter Blackburn	--NME--
Blackburn	I	Peter Blackburn	--NME--

BRUSSELS	B	BRUSSELS	Brussels	http://en.wikipedia.org/wiki/Brussels	/m/0177z
1996-08-22

The
European	B	European Commission	European_Commission	http://en.wikipedia.org/wiki/European_Commission	/m/02q9k
Commission	I	European Commission	European_Commission	http://en.wikipedia.org/wiki/European_Commission	/m/02q9k
said
on
Thursday
it
disagreed

And this is the format of the system output which I believe is accepted by $ neleval evaluate:

1164testb RUGBY	1474	1491	en.wikipedia.org/wiki/Andrea_Castellani	1.0	PERSON
1164testb RUGBY	1452	1471	en.wikipedia.org/wiki/Alessandro_Moscardi	1.0	PERSON
1164testb RUGBY	1433	1449	en.wikipedia.org/wiki/Nicola_Mazzucato	1.0	ORG
1164testb RUGBY	1416	1430	en.wikipedia.org/wiki/Gianluca_Guidi	1.0	PERSON

Thank you so much,

@jnothman
Copy link
Member

jnothman commented Jul 5, 2018

def aida_to_neleval(f, iob_col=2, kbid_col=3):
    def emit():
        if 'start' not in cur:
            return
        kbid = cur['kbid']
        if kbid == '--NME--':
            kbid = 'NIL0000000'
        print(docid, cur['start'], offset, kbid, sep='\t')
        del cur['start']
        del cur['kbid']

    cur = {}
    for l in f:
        l = l.rstrip()
        if not l:
            continue
        elif l.startswith('-DOCSTART-'):
            emit()
            docid = l[len('-DOCSTART-'):].strip().replace(' ', '_')
            offset = 0
        else:
            offset += 1
            cols = l.split('\t')
            if len(cols) == 1:
                emit()
                continue
            if cols[iob_col] == 'B' or cols[kbid_col] != cur.get('kbid'):
                emit()
                cur['start'] = offset
                cur['kbid'] = cols[kbid_col]


if __name__ == '__main__':
    import io

    f = io.StringIO('''
-DOCSTART- (1 EU)
EU\tB\tEU\t--NME--
rejects
German\tB\tGerman\tGermany\thttp://en.wikipedia.org/wiki/Germany\t/m/0345h
call
to
boycott
British\tB\tBritish\tUnited_Kingdom\thttp://en.wikipedia.org/wiki/United_Kingdom\t/m/07ssc
lamb
.

Peter\tB\tPeter Blackburn\t--NME--
Blackburn\tI\tPeter Blackburn\t--NME--

BRUSSELS\tB\tBRUSSELS\tBrussels\thttp://en.wikipedia.org/wiki/Brussels\t/m/0177z
1996-08-22

The
European\tB\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
Commission\tI\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
said
on
Thursday
it
disagreed
    ''')
    aida_to_neleval(f)

outputs

(1_EU)	1	2	NIL0000000
(1_EU)	3	4	Germany
(1_EU)	7	8	United_Kingdom
(1_EU)	10	12	NIL0000000
(1_EU)	12	13	Brussels
(1_EU)	15	17	European_Commission

I'll look into making a new command out of it.

@userofgithub1
Copy link
Author

Thank you so much. Sorry for the late reply been super busy with other tasks. Will test your code as soon as I get back to this task.

Thanks again :)

@userofgithub1
Copy link
Author

Awesome. I just had to decode before splitting to resolve UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 6: ordinal not in range(128) so I changed cols = l.split('\t') to cols = l.decode('utf-8').split('\t')

I forgot to mention that in some rows the kbid actually has a mixed format which looks something like: People\u0027s_Republic_of_China while it should be People's_Republic_of_China complete rows would look like this:

.

But
Le	B	Le Matin	Le_Matin_\u0028France\u0029	http://en.wikipedia.org/wiki/Le_Matin_(France)	/m/03nrccn
Matin	I	Le Matin	Le_Matin_\u0028France\u0029	http://en.wikipedia.org/wiki/Le_Matin_(France)	/m/03nrccn
newspaper
,
quoting
witnesses

How can I fix the format if I've decoded in UTF-8 when splitting?

Also, could you take a look at this relatable issue andychisholm/nel#21 it's in Conll.py the offsets are calculated completely wrong I tried many different ways and also tried to apply your method there but it still doesn't calculate correctly. Many many thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants