Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A little question about Cluprocessor #240

Open
dvzhang opened this issue Jan 16, 2021 · 9 comments
Open

A little question about Cluprocessor #240

dvzhang opened this issue Jan 16, 2021 · 9 comments
Assignees

Comments

@dvzhang
Copy link

dvzhang commented Jan 16, 2021

A little question.
If I use Cluprocessor to make my index, I can't use the rule like:

  • name: pets_adoption
    type: event
    label: Adoption
    priority: 2 # will run in the second round of extraction, can reference priority 1 rules
    pattern: |
    trigger = [lemma=adopt]
    adopter = >nsubj [] # note: we didn't specify the label, so any token will work
    pet: Pet = >dobj []

to extract information?
what's more, I try the rule of document.event_queries.priorities, and the next.
http://gh.lum.ai/odinson/event_queries.html#priorities

all of them can't work if I use Cluprocessor to make my index.

If I use fastnlp, it takes really long time to annotate the text.

so, maybe, we can do some work to make it fit the syntactic parse of Cluprocessor .

@BeckySharp
Copy link
Contributor

New changes to cluprocessor may give us what we we lacking.

  • profile the annotation/indexing on a given corpus
  • switch to coreclu as default for extra/AnnotateText
  • add a mkPartialAnnotation method as default (configurable?), particularly since we can't currently use the expensive cross-sentence annotations anyway
  • add cool new fields to the defaults (e.g., enhanced SRLs)
  • profile again, celebrate or cry

This is the plan, I'll try to work on it as I can. PLEASE all (@myedibleenso , @marcovzla , @dvzhang ) feel free to comment/complain/add/subtract

@BeckySharp BeckySharp self-assigned this Mar 4, 2021
@myedibleenso
Copy link
Member

enhanced SRLs

@BeckySharp , are you talking about indexing multiple graphs or just some new token attributes? @marcovzla and I discussed indexing multiple graphs per sentence before, but opted not to do it at that time. Eventually, though, I think we'll want to do this and use a field in the rule to tell us which graph to match against. One proposal was to index one merged graph. I think both approaches (one vs many) have advantages and disadvantages.

@BeckySharp
Copy link
Contributor

are you talking about indexing multiple graphs or just some new token attributes?

Great point, I wasn't fully thinking it through, but now that you mention it I remember this and several discussions about how to approach it. I think the lowest hanging fruit is to merge (e.g., deps and SRLs), configurable of course, as it would also give access to mixed rules without needing to make a change to the rule syntax.
That said -- looking longer term, IMO we should be able to write rules against multiple graphs, to be able to add more complex constraints. However, IDK when that will be needed vs simple anticipated as useful...

How do you feel about the merged plan for now?
And really thanks @myedibleenso for thinking it through for/with me!

@myedibleenso
Copy link
Member

An option under the conf for extra related to graph merging is definitely the path of least resistance. Seems like it would be useful. It's straightforward to just combine and distinct edges: List[Edge] and roots: Set[Int] from each processors doc's Sentence to construct a single DirectedGraph and then replace the GraphMap for each processor Sentence before creating the Odinson doc.

Looks like there's already an entry in the processors GraphMap that merges a couple of the graphs (it looks as if it excludes basic):

I sort of think that we might as well merge them all, though...

@BeckySharp
Copy link
Contributor

I sort of think that we might as well merge them all, though...

i'd be ok merging them all unless there are overlapping labels between the basic and enhanced that point to diff things, any easy way to check that you think?

@myedibleenso
Copy link
Member

My guess is that it's most likely to happen with adpositions/case relations. Here is where the conversion from basic -> enhanced takes place in processors:

I didn't have a hand in that code, but I know it is a deterministic process.

I guess one strategy that still keeps things fairly simple is to just merge in a way that prefers enhanced (I assume that's what you want?) where any edge in basic that has the same source and relation label should be filtered out of the merged.

@BeckySharp
Copy link
Contributor

that strategy would account for one issue (that TBH I didn't think about), but I was thinking more that idk, something like >dep in basic would go to word_3 and in enhanced it would go to word_4, and when we wrote the rules we were thinking about enhanced (which we're more familiar with), and so we would expect/want word_4 not word_3....
maybe i'm worried about nothing tho

@myedibleenso
Copy link
Member

myedibleenso commented Mar 4, 2021

If we want to prefer X over Y, I think we can achieve that using the filtering step I described. The downside is we lose some info.

I think we're really just trying to come up with reasonable default behavior.

If a user wants something more customized, that person should create custom Odinson doc JSON and simply index. This annotation runnable is really just to get started quickly with something that is (hopefully) useful to many.

At this point, anyone who wants to index non-English docs or use custom components (ex. third-party sequence taggers for NER) must roll their own Odinson doc creation pipeline.

One of these days I'm going to provide an example that shows how to use spaCy to create Odinson doc JSON.

@BeckySharp
Copy link
Contributor

on it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants