get the position index of concepts and negated concepts #12

quzhouxiachuan · 2019-02-08T04:36:25Z

Not really an issue but more a question. I am wondering if there is any function to get the keyword index position and negation index position on a sentence level and on the note level by any chance?

Thanks!

g-cole · 2019-05-28T22:53:02Z

I did something like this a while back (older version of pyConTextNLP - v0.6.2.0, so your mileage may vary).
The getSpan() function will return a tuple of the target or modifier's position in a sentence, but you need to also have a way of getting the sentence's position in the document. The included function in helpers.py that returns a list of sentences isn't the best for this purpose, so I modified it to output a list of named tuples (sentence.begin,sentence.end), interchangeable with PyRuSH. I ended up just using PyRuSH, but it's included it in my PyConTextPipeline project anyway.
You would then need something like this:

from PyRuSH.RuSH import RuSH
#import helpers_mod
sentence_splitter = RuSH('rush_rules.tsv') # or helpers_mod.sentenceSplitter()
sentences = sentence_splitter.segToSentenceSpans(document)
for sentence in sentences:
    sentence_text = document[sentence.begin:sentence.end]
    markup = pyConTextGraph.ConTextMarkup()
    markup.setRawText(sentence_text)
    markup.cleanText()
    ...rest of markup code...
    marked_targets = markup.getMarkedTargets()
    for marked_target in marked_targets:
        print("Target phrase:", marked_target.getPhrase())
        print("Target index:", sentence.begin+marked_target.getSpan()[0])
        modifiers = markup.getModifiers(marked_target)
        for modifier in modifiers:
            print("Modifier phrase:", modifier.getPhrase()) # negation modifier
            print("Modifier index:", sentence.begin+modifier.getSpan()[0])

Note that when you run "markup.cleanText()", it applies the regex rule "REG_CLEAN2 = re.compile(r"""\s+""", re.UNICODE)" from ConTextMarkup.py (or r2 in pyConTextGraph.py from the previous version), which replaces an arbitrary number of whitespaces/newlines with a single space causing the above method to be off by a few characters. I got around this by removing the '+', so the correct index is preserved though there will be odd spacing. I'm in the habit of writing my rules with \s+ (I.e. Regex: 'foo\s+bar'), so this is usually fine. I prefer to have correct indexes.

There may be a better way to do this and I haven't explored the newest pyConText version yet, so it may no longer be an issue. My solution for getting the exact spans of targets and modifiers in text is included in my pyConTextNLP pipeline tool, PyConTextPipeline if you're interested.

quzhouxiachuan changed the title ~~get the position index of the keyword and negations.~~ get the position index of concepts and negated concepts Feb 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get the position index of concepts and negated concepts #12

get the position index of concepts and negated concepts #12

quzhouxiachuan commented Feb 8, 2019

g-cole commented May 28, 2019 •

edited

Loading

get the position index of concepts and negated concepts #12

get the position index of concepts and negated concepts #12

Comments

quzhouxiachuan commented Feb 8, 2019

g-cole commented May 28, 2019 • edited Loading

g-cole commented May 28, 2019 •

edited

Loading