Pure Text extraction from HOCR is HTML entity encoded #81
Labels
enhancement
New feature or request
help wanted
Extra attention is needed
Post processor Plugins
The ones with a ->run() method
Solr Indexing
Putting things where they can be found
Milestone
What?
When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.
I (just theory) think this can be fixed here
strawberry_runners/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php
Lines 355 to 356 in 9d3bf9e
Basically, we don't want this:
Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:
https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?
@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?
The text was updated successfully, but these errors were encountered: