-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to return the specified label for entities? #17
Comments
Hi @xyLinear , So what I would suggest, with the current state of this library, is to use the following snippet: # setting up
import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight')
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)
# see which entities have been recognised
print(doc.ents)
# ---> (Please Please Me, studio album, English, rock)
# see the complete details for the first entity
doc.ents[0]._.dbpedia_raw_result
# --> {'@URI': 'http://dbpedia.org/resource/Please_Please_Me', '@support': '224', '@types': 'Wikidata:Q482994,Wikidata:Q386724,Wikidata:Q2188189,Schema:MusicAlbum,Schema:CreativeWork,DBpedia:Work,DBpedia:MusicalWork,DBpedia:Album', '@surfaceForm': 'Please Please Me', '@offset': '0', '@similarityScore': '0.999998495059141', '@percentageOfSecondRank': '1.5049431293805952E-6'}
# extract the types for each entity, which are separated by commas
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]
for ent, types in ents_with_types:
print(ent, '\t', types)
# -->
# Please Please Me ['Wikidata:Q482994', 'Wikidata:Q386724', 'Wikidata:Q2188189', 'Schema:MusicAlbum', # 'Schema:CreativeWork', 'DBpedia:Work', 'DBpedia:MusicalWork', 'DBpedia:Album']
# studio album ['']
# English ['Wikidata:Q315', 'Schema:Language', 'DBpedia:Language']
# rock ['Wikidata:Q188451', 'DUL:Concept', 'DBpedia:TopicalConcept', 'DBpedia:Genre', 'DBpedia:MusicGenre'] Now as you can see it's not the most straightforward to do. But at the moment I cannot think of an easy solution that is compatible with spaCy. Do you think maybe having the list of types already in e.g. # this is not implemented
for ent in doc.ents:
print(ent, '\t', ent._.dbpedia_types) For your second question instead, let me just check if I understood correctly what you want to do:
If that's correct, this is what you can do: Step 1: you need to define which entity types you want to use from DBpedia-spotlight, and in which order of priority. For example, if an entity is both of types Example: # list of types of interests, order matters
priority_list = ['DBpedia:MusicGenre', 'DBpedia:Album', 'DBpedia:Song', 'DBpedia:MusicalArtist']
# example document as before
text = '''Please Please Me is the debut studio album by the English rock band the Beatles'''
doc = nlp(text)
# now get the types as before
ents_with_types = [(ent, ent._.dbpedia_raw_result['@types'].split(',')) for ent in doc.ents]
# now select a single type and only for the entities with type in priority_list
ents_selected_with_type = []
for ent, types in ents_with_types:
ent_type = next((t for t in priority_list if t in types), None)
if ent_type:
ents_selected_with_type.append((ent, ent_type))
# now we only have 2 entities and their type, this is our gold standard for training
print(ents_selected_with_type)
# convert to the training format of spaCy
TRAIN_DATA = [text, {'entities': [(ent.start_char, ent.end_char, ent_type) for ent, ent_type in ents_selected_with_type]}] Step 2: use the approach from this article: https://towardsdatascience.com/train-ner-with-custom-training-data-using-spacy-525ce748fab7 Best, |
Thank you so much! That helped a lot. |
Hi @xyLinear ,
Therefore, I would consider two things:
Good luck! |
I want to use this to annotate my training data. For example I specified DBpedia:Album, DBpedia:Song, DBpedia:MusicalArtist
If I print ent.label_, it would only return 'DBPEDIA_ENT' for all entities. Is there any way to actually retrieve Album, Song, etc?
Additional question, what's the format of text if I want to use this as training data input of spacy?
Thank you in advance
The text was updated successfully, but these errors were encountered: