How to get the words in the right order from the json result file? #511

piegu · 2021-09-29T20:03:25Z

piegu
Sep 29, 2021

From this video, we understand that we can parse the json result file in order to get the predicted words.

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
doc = DocumentFile.from_pdf(path_to_pdf).as_images()
result = model(doc)
json_result = result.export()

However, with a more complicated example (for example, a 2-column PDF document), the parsing of the json result file is not easy.

If we do it as showed in the video, it does not give the predicted words in the right order (words are printed by line and not by column). Clearly, we need a script that will be based on coordinates (xmin, ymin, xmax, ymax from the geometry key).

Could you provide this script? Thanks.

Note: I guess this script is already coded in the method result.synthesize() but I do not find it and I did not find any information about this issue in the documentation.

Answered by fg-mindee

Nov 17, 2021

For anyone looking for a solution, as mentioned by @charlesmindee earlier, we integrated line aggregation in #537. This should make its way to a release this week, but for now, you will need to install the developer version to enjoy the benefits on the high-level API.

It is enabled by default, so the basic usage snippet will work:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_pdf("path/to/your.pdf").as_images()
result = model(doc)
json_result = result.export()

Feel free to ask if you have any questions :)

View full answer

fg-mindee · 2021-09-30T22:46:23Z

fg-mindee
Sep 30, 2021

Hi @piegu 👋

Just to be clear, the video you mentioned was not created by any of the library authors, so we cannot guarantee that the behaviour will not change from what is shown in it 😅

For 2-columns pages, it is expected behaviour for now!
I'm not sure about what you're asking? A script that will dynamically determine whether there is 1 or several columns?
What is implemented for page synthesis does not require any ordering of predictions since their location is enough :)

0 replies

piegu · 2021-10-01T13:13:46Z

piegu
Oct 1, 2021
Author

Hi @fg-mindee

Thanks for your answer. I'm just searching a way to print the text found by DocTR.

For example, if I want to print in my notebook the text from an image with Tesseract and OpenCv, I run the following 3 lines:

img = cv2.imread(path_to_img)
text = pytesseract.image_to_string(img)
print(text)

With DocTR, it looks that the text is in the json result file:

{'pages': [{'blocks': [{'artefacts': [],
     'geometry': [[0.056640625, 0.0244140625], [0.94921875, 0.9677734375]],
     'lines': [{'geometry': [[0.056640625, 0.0244140625],
        [0.94921875, 0.9677734375]],
       'words': [{'I': 0.6669960618019104,
         'geometry': [[0.056640625, 0.0263671875],
          [0.1123046875, 0.0380859375]],
         'value': 'like'},
         ......
         ......
}

But how to print the words in my notebook?
I tried to read it through a loop but the order of printed words is not correct (it is worth when there are 2 columns but even with a text in one colum, the order is not perfect).

What do you think?

5 replies

fg-mindee Oct 1, 2021

Thanks for the clarification!

So it's not well documented, but using the output of your predictor, instead of using the .export() method to produce the JSON, you can use the .render() method to render the string you are looking for 😄 (if I understood correctly)

Though I have to warn you that the order is not necessarily checked as thoroughly as it might with tesseract, but that's a foreseeable improvement we could make!

dhea1323 Mar 2, 2022

Is it possible to render the string with .render() based on the selected page?. I want to render the string according to the page I want 😄

fg-mindee Mar 7, 2022

Hi @dhea1323 👋

Of course! Here is a short snippet:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Load doc & model
doc = DocumentFile.from_pdf('/path/to/doc.pdf')
model = ocr_predictor(pretrained=True)

# Inference
result = model(doc)

And now you can display any page:

result.pages[0]

Page(
  dimensions=(1584, 1224)
  (blocks): [
    Block(
      (lines): [Line(
        (words): [
          Word(value='CASH', confidence=1.0),
          Word(value='PAYMENT', confidence=0.99),
          Word(value='RECEIPT', confidence=0.99),
        ]
      )]
      (artefacts): []
    ),
    Block(
      (lines): [
        Line(
          (words): [
            Word(value='Company', confidence=1.0),
            Word(value='Name:', confidence=1.0),
          ]
        ),
        Line(
          (words): [
            Word(value='Street', confidence=1.0),
            Word(value='Address:', confidence=1.0),
          ]
        ),
        Line(
          (words): [
            Word(value='City,', confidence=0.99),
            Word(value='State,', confidence=1.0),
            Word(value='Zip:', confidence=1.0),
          ]
        ),
        Line(
          (words): [Word(value='Phone:', confidence=1.0)]
        ),
        Line(
          (words): [Word(value='Fax:', confidence=1.0)]
        ),
        Line(
          (words): [Word(value='Email:', confidence=1.0)]
        ),
        Line(
          (words): [Word(value='Website:', confidence=0.92)]
        ),
        Line(
          (words): [Word(value='Date:', confidence=1.0)]
        ),
      ]
      (artefacts): []
    ),
    Block(
      (lines): [Line(
        (words): [
          Word(value='Receipt', confidence=1.0),
          Word(value='#:', confidence=1.0),
        ]
      )]
      (artefacts): []
    ),
    Block(
      (lines): [Line(
        (words): [
          Word(value='Payment', confidence=0.94),
          Word(value='Information', confidence=0.99),
        ]
      )]
      (artefacts): []
    ),
    Block(
      (lines): [
        Line(
          (words): [
            Word(value='Paid', confidence=1.0),
            Word(value='By:', confidence=0.96),
          ]
        ),
        Line(
          (words): [
            Word(value='Amount', confidence=0.97),
            Word(value='Paid:', confidence=1.0),
          ]
        ),
        Line(
          (words): [
            Word(value='For', confidence=0.92),
            Word(value='Payment', confidence=1.0),
            Word(value='Of:', confidence=1.0),
          ]
        ),
      ]
      (artefacts): []
    ),
    Block(
      (lines): [Line(
        (words): [
          Word(value='Dollars', confidence=0.99),
          Word(value='($', confidence=0.95),
        ]
      )]
      (artefacts): []
    ),
    Block(
      (lines): [
        Line(
          (words): [
            Word(value='Subtotal:', confidence=0.34),
            Word(value='$', confidence=0.98),
          ]
        ),
        Line(
          (words): [
            Word(value='Tax', confidence=0.96),
            Word(value='Rate', confidence=1.0),
            Word(value='(%):', confidence=0.99),
          ]
        ),
        Line(
          (words): [
            Word(value='Total', confidence=0.8),
            Word(value='Tax:', confidence=1.0),
            Word(value='$', confidence=0.98),
          ]
        ),
      ]
      (artefacts): []
    ),
    Block(
      (lines): [
        Line(
          (words): [
            Word(value='Total', confidence=0.89),
            Word(value='Amount', confidence=1.0),
            Word(value='Due:', confidence=1.0),
            Word(value='$', confidence=0.98),
          ]
        ),
        Line(
          (words): [
            Word(value='Amount', confidence=0.99),
            Word(value='Paid:', confidence=1.0),
            Word(value='$', confidence=0.98),
          ]
        ),
        Line(
          (words): [
            Word(value='Remaining', confidence=1.0),
            Word(value='Balance:', confidence=0.72),
            Word(value='$', confidence=0.97),
          ]
        ),
      ]
      (artefacts): []
    ),
    Block(
      (lines): [
        Line(
          (words): [
            Word(value='Received', confidence=1.0),
            Word(value='By:', confidence=1.0),
          ]
        ),
        Line(
          (words): [
            Word(value='Authorized', confidence=1.0),
            Word(value='Signature', confidence=0.79),
          ]
        ),
      ]
      (artefacts): []
    ),
    Block(
      (lines): [Line(
        (words): [
          Word(value='Page', confidence=1.0),
          Word(value='1', confidence=1.0),
          Word(value='of', confidence=1.0),
          Word(value='1', confidence=1.0),
        ]
      )]
      (artefacts): []
    ),
  ]
)

or render it as a string:

result.pages[0].render()

'CASH PAYMENT RECEIPT\n\nCompany Name:\nStreet Address:\nCity, State, Zip:\nPhone:\nFax:\nEmail:\nWebsite:\nDate:\n\nReceipt #:\n\nPayment Information\n\nPaid By:\nAmount Paid:\nFor Payment Of:\n\nDollars ($\n\nSubtotal: $\nTax Rate (%):\nTotal Tax: $\n\nTotal Amount Due: $\nAmount Paid: $\nRemaining Balance: $\n\nReceived By:\nAuthorized Signature\n\nPage 1 of 1'

Let me know if you need extra help 👌

dhea1323 Mar 14, 2022

Thank you a lot, It's help 👍

fg-mindee Mar 14, 2022

You're welcome, glad it helped!

piegu · 2021-10-02T19:16:06Z

piegu
Oct 2, 2021
Author

Thank you @fg-mindee.

I did test the .render() method. It works but with some limitations as you said compared to Tesseract.

In order to test DocTR and Tesseract, I published a blog post and a notebook:

Question: is it possible with DocTR to use a recognition model in another language than English? (for example, o Portuguese)

1 reply

fg-mindee Oct 4, 2021

Hey @piegu,

Thanks a lot for the blogpost!
Actually #512 is also about this topic, so the .render method might become suitable rather soon 👍

About your question, DocTR is about visual text recognition for now so the language doesn't matter as long as it uses a supported vocab. Currently our model are trained on the french vocab which has only a few different accented characters compared to Portuguese. So if you're not too picky about accents, this should do :)

And we recently got a new entry in available vocabs for POrtuguese #464, so you can easily finetune our recognition models on this vocab if you have suitable data for training!

Let me know if that doesn't help 👌

piegu · 2021-10-04T12:20:53Z

piegu
Oct 4, 2021
Author

Thansk @fg-mindee. I will check these topics.

0 replies

Rob192 · 2021-10-08T09:10:21Z

Rob192
Oct 8, 2021

Hello everyone ! On my side, I am not getting the blocks nor lines segmentation. Meaning every word is inside a same block and the same line. Is it just me or is it because the feature is not yet implemented ? I am wondering what are your thinking regarding how you will implement this.
For the moment, and because I need to get the words inside the same line and in the correct order, I use JenksNaturalBreaks from jenkspy to 'cluster' the word boxes by lines. Once this is done I reorder the boxes inside in the correct order.

3 replies

fg-mindee Oct 8, 2021

Hey @Rob192,

So there are actually methods already implemented for this in the DocumentBuilder: https://github.com/mindee/doctr/blob/main/doctr/models/builder.py#L29

They're disabled by default, but you can do:

predictor.doc_builder.resolve_lines = True
predictor.doc_builder.resolve_blocks = True

But as mentioned in this discussion, this still is a work in progress :)

charlesmindee Oct 22, 2021
Maintainer

Hi @Rob192,
There was a huge bug in line resolution fixed by #537, when this PR will be merged it should work correctly (at least lines resolution)!

orlandito24 Sep 12, 2023

Hi @fg-mindee,

I'm trying to locate where I should be adding these lines to change the default.

fg-mindee · 2021-10-29T18:20:19Z

fg-mindee
Oct 29, 2021

Hi @piegu 👋

Would you mind marking the relevant message as answer for this discussion please? It will help potential future visitors to quickly identify the ins & outs of the topic :)

4 replies

piegu Oct 29, 2021
Author

Hi @fg-mindee,

I would love to do it but I'm not sure of what is the answer for this discussion as you wrote But as mentioned in this discussion, this still is a work in progress :)

fg-mindee Oct 29, 2021

I see! Have you managed to check out the behaviour now that #537 was merged? :)
I believe this would match your needs!

fg-mindee Nov 4, 2021

Any update? 😁

fg-mindee Nov 6, 2021

Also, this other discussion might help 👉 #531

fg-mindee · 2021-11-17T14:56:19Z

fg-mindee
Nov 17, 2021

For anyone looking for a solution, as mentioned by @charlesmindee earlier, we integrated line aggregation in #537. This should make its way to a release this week, but for now, you will need to install the developer version to enjoy the benefits on the high-level API.

It is enabled by default, so the basic usage snippet will work:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
doc = DocumentFile.from_pdf("path/to/your.pdf").as_images()
result = model(doc)
json_result = result.export()

Feel free to ask if you have any questions :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the words in the right order from the json result file? #511

{{title}}

Replies: 7 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to get the words in the right order from the json result file? #511

Replies: 7 comments · 13 replies

piegu Oct 1, 2021 Author

piegu Oct 2, 2021 Author

piegu Oct 4, 2021 Author

charlesmindee Oct 22, 2021 Maintainer

piegu Oct 29, 2021 Author

Replies: 7 comments 13 replies

piegu
Oct 1, 2021
Author

piegu
Oct 2, 2021
Author

piegu
Oct 4, 2021
Author

charlesmindee Oct 22, 2021
Maintainer

piegu Oct 29, 2021
Author