-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pagination for QueryTableChanges #354
Conversation
31c8e18
to
388be39
Compare
388be39
to
ecd6bf4
Compare
server/src/main/scala/io/delta/standalone/internal/DeltaSharedTableLoader.scala
Outdated
Show resolved
Hide resolved
server/src/main/scala/io/delta/standalone/internal/DeltaSharedTableLoader.scala
Show resolved
Hide resolved
// Return early if we already have enough files in the current page | ||
if (pageSizeOpt.contains(numSignedFiles)) { | ||
actions.append(tokenGenerator(v, idx)) | ||
return actions.toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: It seems first we will generate all the changes, and then apply pagination? So pagination will not save any processing on the server side right? Is the same true for DS server in UC as well? @linzhou-db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're saving some progress: the start
will be equal to the startingVersion
in the page token, so we won't unpack delta logs for versions that are already processed in previous pages.
However, we still generate all changes until the endingVersion
each time. I guess this part can be optimized in UC. Two options I can think of:
- we could push down
actionListener
togetChanges()
, so we read logs and process actions together and it'll return in the middle - we don't change the interface of
getChanges()
, we call it to unpack one version at a time and process version by version.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both options should work; the first one is similar to what we do for normal query, so may be preferable for consistency reason.
In the future, we will likely process json files in parallel as well (similar to checkpoint parquet).
Let's open a ticket for now to address this in UC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have a separate discussion about how to handle this in UC.
ecd6bf4
to
c7271be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps wait for Lin's review as well.
Support pagination for QueryTableChanges in reference server.
This is part 3 for issue #351.