Pure Text extraction from HOCR is HTML entity encoded #81

DiegoPino · 2023-08-02T14:56:35Z

What?

When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.

I (just theory) think this can be fixed here

strawberry_runners/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php

Lines 355 to 356 in 9d3bf9e

    
           $page_text = isset($output->searchapi['fulltext']) ? strip_tags(str_replace("<l>", 
        
             PHP_EOL . "<l> ", $output->searchapi['fulltext'])) : '';

Basically, we don't want this:

Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:

https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?

@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?

aksm · 2023-08-04T14:49:35Z

@DiegoPino I think I need more context/clarification to understand the issue. Can we discuss on the next team call?

DiegoPino · 2023-08-04T15:03:12Z

Please ingest an ADO with a PDF and see the OCR directly in Solr and in a view to see what I am stating here Thanks! Diego Pino Navarro Director of Digital Strategy Archipelago architect Metropolitan New York Library Council PO Box 2084 New York, NY 10108

…

On Aug 4, 2023, at 10:49 AM, Albert Min ***@***.***> wrote: @DiegoPino <https://github.com/DiegoPino> I think I need more context/clarification to understand the issue. Can we discuss on the next team call? — Reply to this email directly, view it on GitHub <#81 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABU7ZZ7B7I47XKHEDGNRMS3XTUDYXANCNFSM6AAAAAA3BNAGLI>. You are receiving this because you were mentioned.

DiegoPino self-assigned this Aug 2, 2023

DiegoPino added this to the 0.6.0 milestone Aug 2, 2023

DiegoPino added enhancement New feature or request help wanted Extra attention is needed Solr Indexing Putting things where they can be found Post processor Plugins The ones with a ->run() method labels Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pure Text extraction from HOCR is HTML entity encoded #81

Pure Text extraction from HOCR is HTML entity encoded #81

DiegoPino commented Aug 2, 2023

aksm commented Aug 4, 2023

DiegoPino commented Aug 4, 2023 via email

Pure Text extraction from HOCR is HTML entity encoded #81

Pure Text extraction from HOCR is HTML entity encoded #81

Comments

DiegoPino commented Aug 2, 2023

What?

aksm commented Aug 4, 2023

DiegoPino commented Aug 4, 2023 via email