Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLamaparse is missing chunks of text when parsing PDF / How Do You Ensure Your Parser Fully Parses a Document Without Missing Content (Text/Tables/Information)? #566

Open
MuhammedTech opened this issue Dec 25, 2024 · 2 comments

Comments

@MuhammedTech
Copy link

I’ve been testing LlamaParse for PDF parsing, and I was surprised to find that when I manually checked the output, some text seemed to be missing. I’m wondering how others ensure that the parser truly processes the entire document and doesn't leave out or miss any important pieces of information (text, tables, etc.).

How do you guys test your parsers to make sure they parse the whole document without any omissions? Do you use any specific validation techniques or post-processing checks to ensure completeness?

I’d love to hear your experiences and recommendations for improving document parsing accuracy

@galvangoh
Copy link

I am interested to understand this as well. From my testing, I realized long and complicated parsing instructions tends to degrade the quality of output (e.g. table is being parsed but the contents within are being reduced, as though summarized to a minimal).

@rthomas67
Copy link

I had the same experience with a document-only / default-params call to the parsing/upload REST API endpoint, as described here, but there are quite a few parameters (documented here) that might help optimize the completeness of the output. If I find out anything helpful, I'll add another comment here with details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants