Extract only retrieves name of PDF? #140

clemenslinders · 2020-02-21T07:36:47Z

LS,

I am using Visual Studio C# and when I use:
SearchableText = new TextExtractor().Extract(MyActivePdf).Text;

The variable SearchAbleText is usually filled with the text/content of the PDF.

However there are some PDF's where the variable SearchAbleText only contains a lot op spaces, the name of the PDF and again a lot of spaces. When I open this same PDF in a browser I can select any part of the text, so Extract should be able to extract this text for me.

I can imagine that what basically happens is that a PDF where the text in a browser is selectable (and therefor extractable) that such a PDF has multiple datastreams and some will contain text located as specific locations in the PDF and perhaps one stream contains the filename.

There are many programs to create a PDF and possibly the program that creates these PDF's mixes something up so that TikaOnDotNet hooks in on the wrong stream.

Is there a way to connect to the correct stream of data so that I can also retrieve the text in this case?

PS: I do not get an error, I simply only get the name of the PDF with a lot of spaces and not the text.

Kind regards,

Clemens Linders

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract only retrieves name of PDF? #140

Extract only retrieves name of PDF? #140

clemenslinders commented Feb 21, 2020

Extract only retrieves name of PDF? #140

Extract only retrieves name of PDF? #140

Comments

clemenslinders commented Feb 21, 2020