Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text redaction support #24

Closed
ansel1 opened this issue Sep 7, 2018 · 7 comments
Closed

Text redaction support #24

ansel1 opened this issue Sep 7, 2018 · 7 comments
Labels
feature New feature

Comments

@ansel1
Copy link

ansel1 commented Sep 7, 2018

Does the library support the primitives required to implement some kind of redaction function? I'm figuring it would require:

  • producing a stream of text content which contains sufficient information to map text sequences back to their underlying tokens
  • a method to calculate a rectangle around a sequence of text (for the purpose of creating a rectangular annotation over the region on the page where the text is)
  • Some way of altering the original text it can't be recovered, but is replaced with something that doesn't disturb the layout of the rest of the page.
@gunnsth
Copy link
Contributor

gunnsth commented Sep 7, 2018

@ansel1 Yes this should be possible with our current functionality, as we can search and manipulate the content stream.

Can you clarify a few things,

  1. Are you looking to replace certain text, i.e. do a search and replace? or
  2. Are you specifying coordinates, specifying a rectangle where text needs to be removed?

Can you give an example with screenshots of what you would like to achieve? i.e. a before and after image?

I realize that we also need to make sure that the original text is gone (e.g. not just hidden under a white rectangle).

@ansel1
Copy link
Author

ansel1 commented Sep 7, 2018

We're looking for the second option: placing a black box over the region where the redacted text was, as well as somehow removing or destroying the text data under the box.

As an example: https://pdfexpert.com/how-to-redact-pdf

There are other products which do something similar. I believe a similar function might be implemented in adobe's PDF library. Having the ability in (or building the ability on top of) an open source golang library would be ideal.

@gunnsth gunnsth transferred this issue from unidoc/unidoc May 24, 2019
@gunnsth
Copy link
Contributor

gunnsth commented Jul 25, 2019

The internal building blocks for supporting redacting are available since in v3.1.0. Can be built on top of the extractor's PageText and TextMarks. Either by filtering the page TextMarks through a coordinate filter and only pass the text outside the redacted area when creating the text with the creator, or find the textmarks to delete and reference back to the original stream (might be tricky).

@peterwilliams97
Copy link
Contributor

This is a useful feature. I hope to find time to build a prototype on top of PageText. As noted above, this will require references from the TextMarks back to the offsets of the corresponding text operators in the content stream.

@AdamSLevy
Copy link

I am exploring how to achieve this and would love to get some second opinions on the approach that I think needs to be taken:

There are two parts of this problem:

  1. Identifying all text marks that fall within a given set of bounding boxes. For example, all Redact Annotations.
  2. Cleanly removing these text marks, without disturbing the layout of the surrounding text, or ideally any other aspect of the PDF file. What we might call "lossless redaction". (Of course then there is also optionally painting a redaction bar over these areas, but that is straightforward.)

Part 1 is fairly straightforward: It is basically a 2D collision detection problem. There aren't many golang libraries that perfectly fit this use-case but it's not particularly difficult to write. Please let me know if there is any unidoc code that detects whether one bounding box fully contains another.

Part 2 is trickier, as it requires modifying the content stream.

I have some questions that I'm actively working on answering myself by reading the PDF standard and the unidoc docs, but I thought I'd throw them out here to see if any more knowledgeable person might be willing to help me.

Given a set of text marks, how can I find the exact text objects in the PDF file's content streams, so I can precisely identify them for removal?

This actually should happen at the individual letter or at least word level. I have seen extracted text marks that are individual letters in a few PDFs I've examined, is this always the case for the text marks returned by the extractor, or does it depend on the PDF?

What is the "best" way to remove a text object (or part of a text object) in a content stream such that the layout of the surrounding text is not disturbed? Clearly contiguous text objects would need to be split in the content stream, and the precise starting point of the part that follows the redacted text needs to be a new text object with a precise starting position, such that its position does not change post redaction.

Am I thinking about this problem correctly? Am I overlooking something? Please let me know. Thank you!

@peterwilliams97
Copy link
Contributor

peterwilliams97 commented Mar 20, 2020

Part 2 requires adding links from the extracted text back to the content stream during text extraction. It's the same principle as the links from extracted text back to bounding boxes on the PDF page. See how that's done in the current code and you will get the idea.
Each text mark would internally keep a link to the content stream operator that it was created from.
There is plumbing to add and a bunch of edge cases such as redacting extracted words that are part of a text mark, but I hope this gives you some idea.
To your specific questions:
There is already a mapping from contiguous extracted text to text marks
// RangeOffset returns the TextMarks in `ma` that have `start` <= TextMark.Offset < `end`. func (ma *TextMarkArray) RangeOffset(start, end int) (*TextMarkArray, error)
Redacting only requires removing text marking operators. The graphics state can be left as is

I need redaction too and that's how I was thinking of doing it.

@gunnsth gunnsth changed the title redaction support Text redaction support Apr 8, 2020
@gunnsth gunnsth added the feature New feature label Jun 2, 2020
@ipod4g ipod4g closed this as completed Aug 6, 2024
@sampila
Copy link
Collaborator

sampila commented Aug 6, 2024

Redact text example available here: https://github.com/unidoc/unipdf-examples/tree/master/redact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
None yet
Development

No branches or pull requests

6 participants