Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract only retrieves name of PDF? #140

Open
clemenslinders opened this issue Feb 21, 2020 · 0 comments
Open

Extract only retrieves name of PDF? #140

clemenslinders opened this issue Feb 21, 2020 · 0 comments

Comments

@clemenslinders
Copy link

LS,

I am using Visual Studio C# and when I use:
SearchableText = new TextExtractor().Extract(MyActivePdf).Text;

The variable SearchAbleText is usually filled with the text/content of the PDF.

However there are some PDF's where the variable SearchAbleText only contains a lot op spaces, the name of the PDF and again a lot of spaces. When I open this same PDF in a browser I can select any part of the text, so Extract should be able to extract this text for me.

I can imagine that what basically happens is that a PDF where the text in a browser is selectable (and therefor extractable) that such a PDF has multiple datastreams and some will contain text located as specific locations in the PDF and perhaps one stream contains the filename.

There are many programs to create a PDF and possibly the program that creates these PDF's mixes something up so that TikaOnDotNet hooks in on the wrong stream.

Is there a way to connect to the correct stream of data so that I can also retrieve the text in this case?

PS: I do not get an error, I simply only get the name of the PDF with a lot of spaces and not the text.

Kind regards,

Clemens Linders

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant