Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect omitted gaps (_____) in recent volumes #294

Open
joewiz opened this issue Dec 8, 2021 · 0 comments
Open

Detect omitted gaps (_____) in recent volumes #294

joewiz opened this issue Dec 8, 2021 · 0 comments

Comments

@joewiz
Copy link
Member

joewiz commented Dec 8, 2021

As described in #282, several recent volumes exhibit a problem where certain gaps—namely, a horizontal line under a segment of text that represents a word 'omitted' or 'to be filled in' as on a form—are omitted from TEI deliveries from our typesetter. The lines are present in the PDF but not in the TEI.

An omission like this is fiendishly difficult to detect.

That PR discovered a phenomenon that was commonly associated with this omission - a space preceding a punctuation character. It added Schematron rules to flag such cases. But this also flags false positives (sometimes simply typos), and isn't guaranteed to identify all such cases.

As an alternative to a page-by-page review, a post in the DH Slack alerted me to a utility, pdfplumber, described as follows:

Plumb a PDF for detailed information about each text character, rectangle, and line... Works best on machine-generated, rather than scanned, PDFs.

One of the objects that pdfplumber reports on is "lines". Running the utility on a volume known to have blanks, I was happy to find that pdfplumber identifies these lines—or rather, all lines in our volumes: lines beneath running heads, footnote separators, underlined text in table headings. The common feature of the gap lines we're looking for is that they appear to all have a length of "30". I ran the utility on all volumes with PDFs and wrote an XQuery report to reveal the instances:

Screen Shot 2021-12-08 at 1 19 02 PM

Selecting a volume, the report shows each page where a matching line was detected, alongside the corresponding TEI, to help us identify if the TEI needs to be fixed:

Screen Shot 2021-12-08 at 1 24 53 PM

Further testing will be needed to confirm if we can count on the value of "30" for the length of lines. But this appears to be a promising approach for identifying these gaps.

As with the FRUS XPath Explorer, the tool can craft links that open oXygen to the exact location of the page shown, to facilitate editing of the source TEI document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant