Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Mishandling of section title when chunking a PDF with tables #4447

Open
1 task done
predoctech opened this issue Jan 12, 2025 · 0 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@predoctech
Copy link

Is there an existing issue for the same bug?

  • I have checked the existing issues.

RAGFlow workspace code commit ID

1160b58

RAGFlow image version

demo.ragflow.io

Other environment information

No response

Actual behavior

This PDF contains various section title followed by tables. When parsing these structures it seems like the titles will be adjoined to the texts before and after, thus distorting the meaning (or spelling) of these important title texts. Thus during a chat involving this PDF if you ask a question whether the document contains a section with this title, the reply is "NO", which is outright misleading.
Look for the title "MANAGEMENT DISCUSSION AND ANALYSIS" in the following screenshot:
Screenshot 2025-01-12 at 1 10 47 PM

Expected behavior

Section and table title should keep their semantic meaning and spelling intact during chunking.

Steps to reproduce

Upload the file, and carry out parsing using the "General" template.

Additional information

I suppose the embedding model chosen is irrelevant for this issue, but FYI embedding model used was Gemini.

@predoctech predoctech added the bug Something isn't working label Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant