-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hitting input token limit on local language models #22
Comments
The example https://news.ycombinator.com actually runs into this. I get a |
We can try and use the Accessibility feature on playwright |
Also getting this on GPT-4-Turbo on some web pages. Only seems to hit the context length when |
I use Error message: Bad control character in string literal in JSON at position 1624
at safeParseJSON (file:///D:/Code/test/node_modules/@ai-sdk/provider-utils/dist/index.mjs:252:63)
at generateObject (file:///D:/Code/test/node_modules/ai/dist/index.mjs:680:23)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async generateAISDKCompletions (file:///D:/Code/test/node_modules/llm-scraper/dist/models.js:22:20)
at async file:///D:/Code/test/main.js:48:18 {
cause: [SyntaxError: Bad control character in string literal in JSON at position 1624],
text: '{"top":[{"name":"Eric Alm","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/eric-alm","fields":["Biophysics","Computational Modeling","Energy","Macromolecular Biochemistry","Microbial Systems","Omics","Synthetic Biology","Systems Biology"]},{"name":"Mark Bathe","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/mark-bathe","fields":["Biological Imaging","Biomolecular Engineering","Biophysics","Computational Modeling","Drug Delivery","Energy","Nanoscale Engineering","Neurobiological"]},{"name":"Angela Belcher","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/angela-belcher","fields":["Biomaterials","Biomolecular Engineering","Energy","Nanoscale Engineering","Synthetic Biology"]},{"name":"Prerna Bhargava","rank":"Research/Teaching Staff","email":"NA","url":"https://be.mit.edu/directory/prerna-bhargava","fields":["NA"]},{"name":"Michael Birnbaum","rank":"Associate Professor","email":"NA","url":"https://be.mit.edu/directory/michael-birnbaum","fields":["Biomolecular Engineering","Biophysics","Infectious Disease","Macromolecular Biochemistry"]},{"name":"Paul Blainey","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/paul-blainey","fields":["Biological Imaging","Biophysics","Drug Delivery","Infectious Disease","Macromolecular Biochemistry","Microbial Pathogenesis","Microbial Systems","Nanoscale Engineering","Omics"]},{"name":"Ed Boyden","rank":"Professor","email":"NA","url":"https://be.mit.edu/directory/ed-boyden","fields":["Biological Imaging","Biomolecular Engineering","Computational Modeling","Drug Delivery","Nanoscale Enginee...\n' +
'"]}'
} |
Hey @beiyanpiki this is not an issue with llm-scraper, but with Vercel AI SDK. You can report the issue here: |
When scraping fairly large websites, we hit the token limit and receive the
GGML_ASSERT
error:For smaller websites this isn't an issue.
We should think about decomposing the website into chunks if it hits a certain length threshold, summarising each chunk using the local language model, and then stitch together these summaries coherently using the model once more.
Another thought I've had is to take screenshots instead using
playwright
, and get some text recognition in there. Or perhaps even better, if there is aplaywright
method to only extract the text content, and leave the html entirely.The text was updated successfully, but these errors were encountered: