refactor!: We are not a layout analyzer

dhdaines · Oct 23, 2024 · db0c086 · db0c086
1 parent dd26002
commit db0c086
Show file tree

Hide file tree

Showing 8 changed files with 80 additions and 625 deletions.
diff --git a/README.md b/README.md
@@ -1,14 +1,15 @@
-# PLAYA is a LAYout Analyzer 🏖️
+# PLAYA Ain't a LAYout Analyzer 🏖️
 
 ## About
 
 This is not an experimental fork of
 [pdfminer.six](https://github.com/pdfminer/pdfminer.six).  Well, it's
 kind of an experimental fork of pdfminer.six.  The idea is to extract
-just the part of pdfminer.six that gets used these days, namely the
-layout analysis and low-level PDF access, see if it can be
-reimplemented using other libraries such as pypdf or pikepdf, and make
-its API more fun to use.
+just the part of pdfminer.six that gets used by
+[pdfplumber](https://github.com/jsvine/pdfplumber), namely the
+low-level PDF access, optimize it for speed, see if it can be
+reimplemented using other libraries such as pypdf or pikepdf,
+benchmark it against those libraries, and improve its API.
 
 There are already too many PDF libraries, unfortunately none of which
 does everything that everybody wants it to do, and we probably don't
@@ -21,11 +22,13 @@ would be specifically one of these things and nothing else:
    metadata.
 2. Obtaining the absolute position and attributes of every character,
    line, path, and image in every page of a PDF document.
-
-Since most people *do not want to do these things*, ideally, this will
-get merged into some other library, perhaps
-[pypdf](https://github.com/py-pdf/pypdf).  Did I mention this is
-experimental?
+
+Notably this does *not* include the largely undocumented heuristic
+"layout analysis" done by pdfminer.six, because it is quite difficult
+to understand due to a Java-damaged API based on deeply nested class
+hierarchies, and because layout analysis is best done
+probabilistically/visually.  Also, pdfplumber does its own, much
+nicer, layout analysis.
 
 ## Acknowledgement
 

diff --git a/playa/converter.py b/playa/converter.py
@@ -8,7 +8,6 @@
 )
 
 from playa.layout import (
-    LAParams,
     LTChar,
     LTComponent,
     LTCurve,
@@ -49,11 +48,9 @@ def __init__(
         self,
         rsrcmgr: PDFResourceManager,
         pageno: int = 1,
-        laparams: Optional[LAParams] = None,
     ) -> None:
         PDFTextDevice.__init__(self, rsrcmgr)
         self.pageno = pageno
-        self.laparams = laparams
         self._stack: List[LTLayoutContainer] = []
 
     def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
@@ -66,8 +63,6 @@ def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
     def end_page(self, page: PDFPage) -> None:
         assert not self._stack, str(len(self._stack))
         assert isinstance(self.cur_item, LTPage), str(type(self.cur_item))
-        if self.laparams is not None:
-            self.cur_item.analyze(self.laparams)
         self.pageno += 1
         self.receive_layout(self.cur_item)
 
@@ -286,9 +281,8 @@ def __init__(
         self,
         rsrcmgr: PDFResourceManager,
         pageno: int = 1,
-        laparams: Optional[LAParams] = None,
     ) -> None:
-        PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
+        PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno)
         self.result: Optional[LTPage] = None
 
     def receive_layout(self, ltpage: LTPage) -> None: