Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error parsing specific text within double quotes. #1328

Open
brochington opened this issue Jul 19, 2022 · 4 comments
Open

Error parsing specific text within double quotes. #1328

brochington opened this issue Jul 19, 2022 · 4 comments

Comments

@brochington
Copy link

I am running into a parsing discrepancy between very similar sentences:

Parsed: "Create a folder named "Right Here"."
Doesn't Parse "Create a folder named "Something Here"."

Note the only difference is replacing the word "Right" with "Something".

Thoughts?

@ampli
Copy link
Member

ampli commented Jul 23, 2022

This happens because "Something" is not marked in en/4.0.dict with <marker-common-entity>.
@linas, Will it be an improvement if words within quotes become "common-entity"?

@linas
Copy link
Member

linas commented Aug 4, 2022

Sorry for late reply. The correct fix is not obvious.

One possibility is would to run optional de-capitalization on the first word after an open-quote (because it is often the case that quoted text is a full-fledged sentence, including capitalization.)

However, in this case, "something Here" (with lower-case s but upper-case H) is not a valid sentence -- it seems that the entire sequence was meant to be a named entity. Usually, named entities have names like "Great Southern and Northern Railroad Bank Corporation" -- all caps, nouns and adjectives, and "Something Here Banking Corporation LLC" doesn't quite fit that pattern.

My calendar is very busy for the next few weeks. I'm not sure I can think clearly about this just right now. Perhaps one fix is to add an UNKNOWN_CAP_WORD regex, which would use the common-entity disjunct class. Perhaps this is the best fix? That way, anything that consists of All Cap Words and Other Stuff automatically parses as a single entity.

(My main desktop computer died ... I can't do any work just right now; I can't even run LG right now.)

@brochington
Copy link
Author

@linas Thanks for the response. For my use, basically determining a touch "Something Here", the text capitalization is not too relevant, and everything within the two double parentheses should be captured as a whole. I'm wondering if it's better to go off of a preceding called | named | titled.

Thinking a little ahead: Is there a way to pass in perhaps a "dictionary extension" when parsing occurs? I have a list of proper nouns that I would like to be identified, and that list can change based on context external of the parser. However, I would like to not have to dictionary_create on every call. Thoughts?

@ampli
Copy link
Member

ampli commented Aug 5, 2022

Without an extension to the library, maybe you can use the following solution (or maybe "solution"), that needs manipulation of the text before parsing:

  1. Identify words within double quotes (e.g. by regex).
  2. Replace the blanks with a special character not in the character set used in your text (e.g. a letter from another language).
    You can do this conditionally according to the particular words.

If needed, you can add in the 4.0.regex file a regex to identify strings with the said special character, and add its name to <UNKNOWN-WORD.a> and UNKNOWNB_WORD.n (or even add an additional entry for it with <marker-common-entity>).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants