Skip to content

Commit

Permalink
refactor!: We are not a layout analyzer
Browse files Browse the repository at this point in the history
  • Loading branch information
dhdaines committed Oct 23, 2024
1 parent dd26002 commit db0c086
Show file tree
Hide file tree
Showing 8 changed files with 80 additions and 625 deletions.
23 changes: 13 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
# PLAYA is a LAYout Analyzer 🏖️
# PLAYA Ain't a LAYout Analyzer 🏖️

## About

This is not an experimental fork of
[pdfminer.six](https://github.com/pdfminer/pdfminer.six). Well, it's
kind of an experimental fork of pdfminer.six. The idea is to extract
just the part of pdfminer.six that gets used these days, namely the
layout analysis and low-level PDF access, see if it can be
reimplemented using other libraries such as pypdf or pikepdf, and make
its API more fun to use.
just the part of pdfminer.six that gets used by
[pdfplumber](https://github.com/jsvine/pdfplumber), namely the
low-level PDF access, optimize it for speed, see if it can be
reimplemented using other libraries such as pypdf or pikepdf,
benchmark it against those libraries, and improve its API.

There are already too many PDF libraries, unfortunately none of which
does everything that everybody wants it to do, and we probably don't
Expand All @@ -21,11 +22,13 @@ would be specifically one of these things and nothing else:
metadata.
2. Obtaining the absolute position and attributes of every character,
line, path, and image in every page of a PDF document.

Since most people *do not want to do these things*, ideally, this will
get merged into some other library, perhaps
[pypdf](https://github.com/py-pdf/pypdf). Did I mention this is
experimental?

Notably this does *not* include the largely undocumented heuristic
"layout analysis" done by pdfminer.six, because it is quite difficult
to understand due to a Java-damaged API based on deeply nested class
hierarchies, and because layout analysis is best done
probabilistically/visually. Also, pdfplumber does its own, much
nicer, layout analysis.

## Acknowledgement

Expand Down
8 changes: 1 addition & 7 deletions playa/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
)

from playa.layout import (
LAParams,
LTChar,
LTComponent,
LTCurve,
Expand Down Expand Up @@ -49,11 +48,9 @@ def __init__(
self,
rsrcmgr: PDFResourceManager,
pageno: int = 1,
laparams: Optional[LAParams] = None,
) -> None:
PDFTextDevice.__init__(self, rsrcmgr)
self.pageno = pageno
self.laparams = laparams
self._stack: List[LTLayoutContainer] = []

def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
Expand All @@ -66,8 +63,6 @@ def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
def end_page(self, page: PDFPage) -> None:
assert not self._stack, str(len(self._stack))
assert isinstance(self.cur_item, LTPage), str(type(self.cur_item))
if self.laparams is not None:
self.cur_item.analyze(self.laparams)
self.pageno += 1
self.receive_layout(self.cur_item)

Expand Down Expand Up @@ -286,9 +281,8 @@ def __init__(
self,
rsrcmgr: PDFResourceManager,
pageno: int = 1,
laparams: Optional[LAParams] = None,
) -> None:
PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno)
self.result: Optional[LTPage] = None

def receive_layout(self, ltpage: LTPage) -> None:
Expand Down
Loading

0 comments on commit db0c086

Please sign in to comment.