Text redaction support #24

ansel1 · 2018-09-07T15:02:41Z

Does the library support the primitives required to implement some kind of redaction function? I'm figuring it would require:

producing a stream of text content which contains sufficient information to map text sequences back to their underlying tokens
a method to calculate a rectangle around a sequence of text (for the purpose of creating a rectangular annotation over the region on the page where the text is)
Some way of altering the original text it can't be recovered, but is replaced with something that doesn't disturb the layout of the rest of the page.

gunnsth · 2018-09-07T19:59:11Z

@ansel1 Yes this should be possible with our current functionality, as we can search and manipulate the content stream.

Can you clarify a few things,

Are you looking to replace certain text, i.e. do a search and replace? or
Are you specifying coordinates, specifying a rectangle where text needs to be removed?

Can you give an example with screenshots of what you would like to achieve? i.e. a before and after image?

I realize that we also need to make sure that the original text is gone (e.g. not just hidden under a white rectangle).

ansel1 · 2018-09-07T21:21:41Z

We're looking for the second option: placing a black box over the region where the redacted text was, as well as somehow removing or destroying the text data under the box.

As an example: https://pdfexpert.com/how-to-redact-pdf

There are other products which do something similar. I believe a similar function might be implemented in adobe's PDF library. Having the ability in (or building the ability on top of) an open source golang library would be ideal.

gunnsth · 2019-07-25T07:57:17Z

The internal building blocks for supporting redacting are available since in v3.1.0. Can be built on top of the extractor's PageText and TextMarks. Either by filtering the page TextMarks through a coordinate filter and only pass the text outside the redacted area when creating the text with the creator, or find the textmarks to delete and reference back to the original stream (might be tricky).

peterwilliams97 · 2020-02-28T20:31:00Z

This is a useful feature. I hope to find time to build a prototype on top of PageText. As noted above, this will require references from the TextMarks back to the offsets of the corresponding text operators in the content stream.

AdamSLevy · 2020-03-20T01:04:23Z

I am exploring how to achieve this and would love to get some second opinions on the approach that I think needs to be taken:

There are two parts of this problem:

Identifying all text marks that fall within a given set of bounding boxes. For example, all Redact Annotations.
Cleanly removing these text marks, without disturbing the layout of the surrounding text, or ideally any other aspect of the PDF file. What we might call "lossless redaction". (Of course then there is also optionally painting a redaction bar over these areas, but that is straightforward.)

Part 1 is fairly straightforward: It is basically a 2D collision detection problem. There aren't many golang libraries that perfectly fit this use-case but it's not particularly difficult to write. Please let me know if there is any unidoc code that detects whether one bounding box fully contains another.

Part 2 is trickier, as it requires modifying the content stream.

I have some questions that I'm actively working on answering myself by reading the PDF standard and the unidoc docs, but I thought I'd throw them out here to see if any more knowledgeable person might be willing to help me.

Given a set of text marks, how can I find the exact text objects in the PDF file's content streams, so I can precisely identify them for removal?

This actually should happen at the individual letter or at least word level. I have seen extracted text marks that are individual letters in a few PDFs I've examined, is this always the case for the text marks returned by the extractor, or does it depend on the PDF?

What is the "best" way to remove a text object (or part of a text object) in a content stream such that the layout of the surrounding text is not disturbed? Clearly contiguous text objects would need to be split in the content stream, and the precise starting point of the part that follows the redacted text needs to be a new text object with a precise starting position, such that its position does not change post redaction.

Am I thinking about this problem correctly? Am I overlooking something? Please let me know. Thank you!

peterwilliams97 · 2020-03-20T06:42:54Z

Part 2 requires adding links from the extracted text back to the content stream during text extraction. It's the same principle as the links from extracted text back to bounding boxes on the PDF page. See how that's done in the current code and you will get the idea.
Each text mark would internally keep a link to the content stream operator that it was created from.
There is plumbing to add and a bunch of edge cases such as redacting extracted words that are part of a text mark, but I hope this gives you some idea.
To your specific questions:
There is already a mapping from contiguous extracted text to text marks
// RangeOffset returns the TextMarks in `ma` that have `start` <= TextMark.Offset < `end`. func (ma *TextMarkArray) RangeOffset(start, end int) (*TextMarkArray, error)
Redacting only requires removing text marking operators. The graphics state can be left as is

I need redaction too and that's how I was thinking of doing it.

sampila · 2024-08-06T18:53:07Z

Redact text example available here: https://github.com/unidoc/unipdf-examples/tree/master/redact

gunnsth transferred this issue from unidoc/unidoc May 24, 2019

gunnsth changed the title ~~redaction support~~ Text redaction support Apr 8, 2020

gunnsth added the feature New feature label Jun 2, 2020

gunnsth mentioned this issue Jun 2, 2020

Support for search replace #9

Closed

ipod4g closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text redaction support #24

Text redaction support #24

ansel1 commented Sep 7, 2018

gunnsth commented Sep 7, 2018

ansel1 commented Sep 7, 2018

gunnsth commented Jul 25, 2019 •

edited

Loading

peterwilliams97 commented Feb 28, 2020

AdamSLevy commented Mar 20, 2020

peterwilliams97 commented Mar 20, 2020 •

edited

Loading

sampila commented Aug 6, 2024

Text redaction support #24

Text redaction support #24

Comments

ansel1 commented Sep 7, 2018

gunnsth commented Sep 7, 2018

ansel1 commented Sep 7, 2018

gunnsth commented Jul 25, 2019 • edited Loading

peterwilliams97 commented Feb 28, 2020

AdamSLevy commented Mar 20, 2020

peterwilliams97 commented Mar 20, 2020 • edited Loading

sampila commented Aug 6, 2024

gunnsth commented Jul 25, 2019 •

edited

Loading

peterwilliams97 commented Mar 20, 2020 •

edited

Loading