You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using Visual Studio C# and when I use:
SearchableText = new TextExtractor().Extract(MyActivePdf).Text;
The variable SearchAbleText is usually filled with the text/content of the PDF.
However there are some PDF's where the variable SearchAbleText only contains a lot op spaces, the name of the PDF and again a lot of spaces. When I open this same PDF in a browser I can select any part of the text, so Extract should be able to extract this text for me.
I can imagine that what basically happens is that a PDF where the text in a browser is selectable (and therefor extractable) that such a PDF has multiple datastreams and some will contain text located as specific locations in the PDF and perhaps one stream contains the filename.
There are many programs to create a PDF and possibly the program that creates these PDF's mixes something up so that TikaOnDotNet hooks in on the wrong stream.
Is there a way to connect to the correct stream of data so that I can also retrieve the text in this case?
PS: I do not get an error, I simply only get the name of the PDF with a lot of spaces and not the text.
Kind regards,
Clemens Linders
The text was updated successfully, but these errors were encountered:
LS,
I am using Visual Studio C# and when I use:
SearchableText = new TextExtractor().Extract(MyActivePdf).Text;
The variable SearchAbleText is usually filled with the text/content of the PDF.
However there are some PDF's where the variable SearchAbleText only contains a lot op spaces, the name of the PDF and again a lot of spaces. When I open this same PDF in a browser I can select any part of the text, so Extract should be able to extract this text for me.
I can imagine that what basically happens is that a PDF where the text in a browser is selectable (and therefor extractable) that such a PDF has multiple datastreams and some will contain text located as specific locations in the PDF and perhaps one stream contains the filename.
There are many programs to create a PDF and possibly the program that creates these PDF's mixes something up so that TikaOnDotNet hooks in on the wrong stream.
Is there a way to connect to the correct stream of data so that I can also retrieve the text in this case?
PS: I do not get an error, I simply only get the name of the PDF with a lot of spaces and not the text.
Kind regards,
Clemens Linders
The text was updated successfully, but these errors were encountered: