-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text redaction support #24
Comments
@ansel1 Yes this should be possible with our current functionality, as we can search and manipulate the content stream. Can you clarify a few things,
Can you give an example with screenshots of what you would like to achieve? i.e. a before and after image? I realize that we also need to make sure that the original text is gone (e.g. not just hidden under a white rectangle). |
We're looking for the second option: placing a black box over the region where the redacted text was, as well as somehow removing or destroying the text data under the box. As an example: https://pdfexpert.com/how-to-redact-pdf There are other products which do something similar. I believe a similar function might be implemented in adobe's PDF library. Having the ability in (or building the ability on top of) an open source golang library would be ideal. |
The internal building blocks for supporting redacting are available since in v3.1.0. Can be built on top of the extractor's PageText and TextMarks. Either by filtering the page TextMarks through a coordinate filter and only pass the text outside the redacted area when creating the text with the creator, or find the textmarks to delete and reference back to the original stream (might be tricky). |
This is a useful feature. I hope to find time to build a prototype on top of PageText. As noted above, this will require references from the TextMarks back to the offsets of the corresponding text operators in the content stream. |
I am exploring how to achieve this and would love to get some second opinions on the approach that I think needs to be taken: There are two parts of this problem:
Part 1 is fairly straightforward: It is basically a 2D collision detection problem. There aren't many golang libraries that perfectly fit this use-case but it's not particularly difficult to write. Please let me know if there is any unidoc code that detects whether one bounding box fully contains another. Part 2 is trickier, as it requires modifying the content stream. I have some questions that I'm actively working on answering myself by reading the PDF standard and the unidoc docs, but I thought I'd throw them out here to see if any more knowledgeable person might be willing to help me. Given a set of text marks, how can I find the exact text objects in the PDF file's content streams, so I can precisely identify them for removal? This actually should happen at the individual letter or at least word level. I have seen extracted text marks that are individual letters in a few PDFs I've examined, is this always the case for the text marks returned by the extractor, or does it depend on the PDF? What is the "best" way to remove a text object (or part of a text object) in a content stream such that the layout of the surrounding text is not disturbed? Clearly contiguous text objects would need to be split in the content stream, and the precise starting point of the part that follows the redacted text needs to be a new text object with a precise starting position, such that its position does not change post redaction. Am I thinking about this problem correctly? Am I overlooking something? Please let me know. Thank you! |
Part 2 requires adding links from the extracted text back to the content stream during text extraction. It's the same principle as the links from extracted text back to bounding boxes on the PDF page. See how that's done in the current code and you will get the idea. I need redaction too and that's how I was thinking of doing it. |
Redact text example available here: https://github.com/unidoc/unipdf-examples/tree/master/redact |
Does the library support the primitives required to implement some kind of redaction function? I'm figuring it would require:
The text was updated successfully, but these errors were encountered: