Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Website] Data from uploaded PDF files being ignored? #129

Open
prdgymx opened this issue Jul 28, 2023 · 8 comments
Open

[Website] Data from uploaded PDF files being ignored? #129

prdgymx opened this issue Jul 28, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@prdgymx
Copy link

prdgymx commented Jul 28, 2023

Hello!

We are running the bot via the website for now to test things out before deploying our own, however we are encountering an issue where the uploaded PDF files are showing as successful, however the specific information inside of them is not being referenced by the bot.

For example, one section of a PDF has an "answering machine script" - but if I ask the bot what the answering machine script is, it tells me it cannot answer questions out of context.

I am wondering if the initial prompt is preventing the bot from parsing or using the data held within the PDFS? Our initial prompt is as follows:

You are a helpful AI agent support agent. Use the following pieces of context to answer the question at the end.
If you don\'t know the answer, just say you don\'t know. DO NOT try to make up an answer.
If the question is not related to the context, politely respond that you are tuned to only answer questions that are related to the context.

[our company context]

Is there something I am missing here? Thanks for the great product so far! Looking forward to growing and expanding along with its progress.

@codebanesr
Copy link
Contributor

Is it happening for all prompts or just this one ?

@ingodahn
Copy link

ingodahn commented Aug 3, 2023

I seem to have the same problem with an indexed website. I can apparently only retrieve information from <rootURL>/index.html but not from subpages <rootURL>FolderName/index.html which is referenced in /index.htmlashref="FolderName/index.html"`.

I noted also that my Pinecone index says it has 0 vectors, though OpenChat says it has successfully indexed the website.

PS: I'd have a couple of other questions, in particular on how to adapt the bot to my requirements. Would you mind setting up a kind of community support forum?

@gharbat gharbat self-assigned this Aug 5, 2023
@gharbat gharbat added the bug Something isn't working label Aug 5, 2023
@codebanesr
Copy link
Contributor

@ingodahn
I noticed this situation before on large pdf file upload, and the reason was that the connection for uploading the file was lost. To avoid this, we should ensure the file finishes uploading before moving to the next page.

@ingodahn
Copy link

ingodahn commented Sep 2, 2023

Is there a timeline for correcting this bug? let me know if I can help with testing.

@codebanesr
Copy link
Contributor

@ingodahn I plan to have this resolved by the end of day Monday. It would help me out if you could provide a link to the PDF file or upload it. That way I can review it and try to fix the issue. Please let me know if you can send the file my way.

@ingodahn
Copy link

ingodahn commented Sep 2, 2023

Great. My use case requires crawling a web site, not a pdf document. The website I am working with is https://netmath.vcrp.de/downloads/Skripte/Vorkurs/HSWildau/. You can download it zipped here.
Only linked pages with links with URLs extending the root URL need to be indexed.

@codebanesr
Copy link
Contributor

Hi @ingodahn ,
Screenshot 2023-09-04 at 7 46 29 AM

The first 10 pages of your website were indeed scanned. But i do see the problem here
Screenshot 2023-09-04 at 7 47 05 AM
Perhaps it's related to the fact that the website is in German. However, I'm not entirely certain at this time. I will look further into this

@ingodahn
Copy link

ingodahn commented Sep 4, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants