-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluate extraction progress in separate call #110
Conversation
…ss separately from main extraction prompt
…and into complete-cond-extract
evals/index.eval.ts
Outdated
@@ -281,7 +281,9 @@ const extract_last_twenty_github_commits = async () => { | |||
|
|||
try { | |||
await stagehand.page.goto("https://github.com/facebook/react"); | |||
await stagehand.waitForSettledDom(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be wrong here, but i thought we're deprecating this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, wasn't aware—removed. Just included since I saw it in some other tests—I figure we need to remove it there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, if you don't mind also cleaning up other places where we wait for settled dom that would be awesome
why
Changes for improvements to completion checking and progress updating.
The original issue was that in a single extraction step, we provide previously extracted content, progress so far, completed/not completed, DOM elements, etc. all at once and ask the model to give us newly extracted content from DOM elements, new progress, completed/not completed. The issue here is that the model needs to simultaneously decide what is important in the DOM and evaluate its progress based on the information it extracts.
what changed
This PR turns this into three steps: 1. extracting the content from the DOM (without previous progress, previous extraction content, etc.), 2. Refining content to ensure that we do not add redundant/duplicate data and updating the total response, and 3. deciding based on the total extracted content what the new progress is and whether the job is completed. The main idea here is that breaking this up rather than doing it all at once shows improvements and a reduction in hallucinations.
test plan
Running evals. Shows improvement with awareness of the number of outputs in
arxiv
eval required since progress tracking is improved. Is able to do both of the Github evals.This shows improvement on both the GitHub tests.