Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pure Text extraction from HOCR is HTML entity encoded #81

Open
DiegoPino opened this issue Aug 2, 2023 · 2 comments
Open

Pure Text extraction from HOCR is HTML entity encoded #81

DiegoPino opened this issue Aug 2, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed Post processor Plugins The ones with a ->run() method Solr Indexing Putting things where they can be found
Milestone

Comments

@DiegoPino
Copy link
Member

What?

When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.

I (just theory) think this can be fixed here

$page_text = isset($output->searchapi['fulltext']) ? strip_tags(str_replace("<l>",
PHP_EOL . "<l> ", $output->searchapi['fulltext'])) : '';

Basically, we don't want this:

image

Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:

https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?

@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?

@DiegoPino DiegoPino self-assigned this Aug 2, 2023
@DiegoPino DiegoPino added this to the 0.6.0 milestone Aug 2, 2023
@DiegoPino DiegoPino added enhancement New feature or request help wanted Extra attention is needed Solr Indexing Putting things where they can be found Post processor Plugins The ones with a ->run() method labels Aug 2, 2023
@aksm
Copy link

aksm commented Aug 4, 2023

@DiegoPino I think I need more context/clarification to understand the issue. Can we discuss on the next team call?

@DiegoPino
Copy link
Member Author

DiegoPino commented Aug 4, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed Post processor Plugins The ones with a ->run() method Solr Indexing Putting things where they can be found
Projects
None yet
Development

No branches or pull requests

2 participants