-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build a Database with Known Entities #22
Comments
I very much like the idea of this, but I'm not convinced the code/data for so many domains should be rolled up into one super-pip install. Probably we want additional projects (like adapt-data-music), and possibly language specific versions of each. These data sets may be very large, and we want to be respectful of resources on dev boxes as well as end-user devices. I'd be happy to create an adapt-data-music-en repo for you to start playing in, and I'll see if I can find some time to make an adapt-data-weather-en repo to act as an example. |
True. But when fetching the entities from Wikidata, there could just be scripts that operate on the SPARQL endpoint and generate the dictionaries. |
You can check out a working prototype here: https://github.com/wolfv/adapt/tree/feature-numbers-dates/adapt/tools I've added the entity_fetcher script and a trie of almost all musicians and bands in wikidata. |
I had completely forgotten about NLTK's data management model! I definitely like that; We'd want to come up with a standardized way/location of storing the data so that it can be cached locally (as opposed to re-running queries unnecessarily). As for marisa trie; that looks like a pretty rockin' trie implementation, but it's missing one major feature from the adapt trie; gather. At least, that appears to be the case from my cursory reading of the marisa-trie python wrapper. I'm not gonna lie, that is some brutally dense code, and having been out of C++ for 5 years (and never writing cython bindings), I can't make any true claim of understanding the code. I can however explain my code! The purpose of gather is to allow us to make N passes on an utterance for entity tagging (one pass per token), as opposed to doing an N-Gram expansion on the utterance (which would be N! complexity). Maybe there's a clever way to reimplement (or reverse) that logic so we can use a standard trie implementation, but maintain the performance characteristics? I'm open to suggestions. |
Good to hear! Yes, definitly my idea would be to have a download option thing downloading the data from some other place than wikidata because hitting their server with these queries all the time will be quite expensive. Hmm, if I understand the Split all names into tokens (e.g. "Blues Brothers" -> "Blues", "Brothers") But on a related note, I think that 'in' queries, even with n-gram expansion, are so cheap with the Marisa Trie that it doesn't really matter. Another option might be to use the following function: trie.has_keys_with_prefix(u'fo') to iteratively build up the n-gram expansion. Let me know if this stuff made sense :) however, it will probably be a bit harder to implement the matching with edit distance I guess... |
FYI: I still think this is a really interesting idea! I don't believe there's been a ton of progress, but I may revive it in a post-1.0 world. thanks! |
It could be cool to have scripts that can build a Database with known entities from Wikidata.
E.g. one could use a SPARQL query like this (can be executed over here: https://query.wikidata.org)
to select all bands in Wikidata.
Then those entities could be stored in a Trie (as done currently) and the trie nodes could hold the query entity (e.g. Q215380) as well as the subject identifier (for example, Dire Straits are wd:Q50040)
For the intent matching they could be used as additional information for the probabilities (e.g.
optional(Adapt.MusicEntity)
)The text was updated successfully, but these errors were encountered: