Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get the position index of concepts and negated concepts #12

Open
quzhouxiachuan opened this issue Feb 8, 2019 · 1 comment
Open

get the position index of concepts and negated concepts #12

quzhouxiachuan opened this issue Feb 8, 2019 · 1 comment

Comments

@quzhouxiachuan
Copy link

Not really an issue but more a question. I am wondering if there is any function to get the keyword index position and negation index position on a sentence level and on the note level by any chance?

Thanks!

@quzhouxiachuan quzhouxiachuan changed the title get the position index of the keyword and negations. get the position index of concepts and negated concepts Feb 8, 2019
@g-cole
Copy link

g-cole commented May 28, 2019

I did something like this a while back (older version of pyConTextNLP - v0.6.2.0, so your mileage may vary).
The getSpan() function will return a tuple of the target or modifier's position in a sentence, but you need to also have a way of getting the sentence's position in the document. The included function in helpers.py that returns a list of sentences isn't the best for this purpose, so I modified it to output a list of named tuples (sentence.begin,sentence.end), interchangeable with PyRuSH. I ended up just using PyRuSH, but it's included it in my PyConTextPipeline project anyway.
You would then need something like this:

from PyRuSH.RuSH import RuSH
#import helpers_mod
sentence_splitter = RuSH('rush_rules.tsv') # or helpers_mod.sentenceSplitter()
sentences = sentence_splitter.segToSentenceSpans(document)
for sentence in sentences:
    sentence_text = document[sentence.begin:sentence.end]
    markup = pyConTextGraph.ConTextMarkup()
    markup.setRawText(sentence_text)
    markup.cleanText()
    ...rest of markup code...
    marked_targets = markup.getMarkedTargets()
    for marked_target in marked_targets:
        print("Target phrase:", marked_target.getPhrase())
        print("Target index:", sentence.begin+marked_target.getSpan()[0])
        modifiers = markup.getModifiers(marked_target)
        for modifier in modifiers:
            print("Modifier phrase:", modifier.getPhrase()) # negation modifier
            print("Modifier index:", sentence.begin+modifier.getSpan()[0])

Note that when you run "markup.cleanText()", it applies the regex rule "REG_CLEAN2 = re.compile(r"""\s+""", re.UNICODE)" from ConTextMarkup.py (or r2 in pyConTextGraph.py from the previous version), which replaces an arbitrary number of whitespaces/newlines with a single space causing the above method to be off by a few characters. I got around this by removing the '+', so the correct index is preserved though there will be odd spacing. I'm in the habit of writing my rules with \s+ (I.e. Regex: 'foo\s+bar'), so this is usually fine. I prefer to have correct indexes.

There may be a better way to do this and I haven't explored the newest pyConText version yet, so it may no longer be an issue. My solution for getting the exact spans of targets and modifiers in text is included in my pyConTextNLP pipeline tool, PyConTextPipeline if you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants