Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination for QueryTableChanges #354

Merged
merged 1 commit into from
Jul 31, 2023

Conversation

charlenelyu-db
Copy link
Collaborator

@charlenelyu-db charlenelyu-db commented Jul 24, 2023

  • Part 1: Pagination for QueryTable at snapshot
    • Part 2: Pagination for QueryTable from starting_version
      • Part 3: Pagination for QueryTableChanges

Support pagination for QueryTableChanges in reference server.

This is part 3 for issue #351.

// Return early if we already have enough files in the current page
if (pageSizeOpt.contains(numSignedFiles)) {
actions.append(tokenGenerator(v, idx))
return actions.toSeq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: It seems first we will generate all the changes, and then apply pagination? So pagination will not save any processing on the server side right? Is the same true for DS server in UC as well? @linzhou-db

Copy link
Collaborator Author

@charlenelyu-db charlenelyu-db Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're saving some progress: the start will be equal to the startingVersion in the page token, so we won't unpack delta logs for versions that are already processed in previous pages.

However, we still generate all changes until the endingVersion each time. I guess this part can be optimized in UC. Two options I can think of:

  1. we could push down actionListener to getChanges(), so we read logs and process actions together and it'll return in the middle
  2. we don't change the interface of getChanges(), we call it to unpack one version at a time and process version by version.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both options should work; the first one is similar to what we do for normal query, so may be preferable for consistency reason.
In the future, we will likely process json files in parallel as well (similar to checkpoint parquet).

Let's open a ticket for now to address this in UC.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have a separate discussion about how to handle this in UC.

Copy link
Collaborator

@chakankardb chakankardb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps wait for Lin's review as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants