Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Select Queries with Unchecked Offset #365

Merged

Conversation

nicholas-mainardi
Copy link
Contributor

This PR introduces a circuit to expose the results of simple SELECT queries without aggregation functions, avoiding the need to build a results tree.

@nicholas-mainardi nicholas-mainardi marked this pull request as ready for review September 18, 2024 13:31
@nicholas-mainardi nicholas-mainardi changed the base branch from main to feat/tabular-queries September 19, 2024 13:48
parsil/src/executor.rs Show resolved Hide resolved
@nikkolasg nikkolasg changed the title Simple Select Queries with Unproven Offset Simple Select Queries with Unchecked Offset Sep 23, 2024
Copy link
Collaborator

@nikkolasg nikkolasg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work !! Left a few comments regarding API.

planner: &mut QueryPlanner<'a>,
results: Vec<PgSqlRow>,
) -> Result<()> {
let mut exec_query = generate_query_execution_with_keys(&mut parsed, &planner.settings)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make sure this query ALWAYS has limit and offset set by default if not set already and have limit < MAX_LIMIT that we can support.
Not sure it's there already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Do you think it should be done inside Parsil when parsing the query or it's ok to let the integration test enforce it? Doing it in Parsil should be preferable I guess but we will need to somehow provide the MAX_LIMIT constant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 994a115. I added checks for all the upper bounds and modify the expander component of Parsil to always add LIMIT MAX_LIMIT if the user didn't specify LIMIT in the query. Tagging also @delehef to take a look at this commit.

Comment on lines +86 to +93
let row_tree_info = RowInfo {
satisfiying_rows: matching_rows
.iter()
.map(|(key, _, _)| key)
.cloned()
.collect(),
tree: &planner.table.row,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it's ok to use the tree now assuming we have this maximum small LIMIT enforced. Otherwise, we need to use the cache,the wide_lineage stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially tried to do it with wide_lineage but I thought it was kind of useless because it will yield only the keys of the nodes, but then for each node I would still have to call get_node_info to build the path. So the code complexity didn't seem really worthy to me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 184 to 193
let sibling_hash = array::from_fn(|i| {
siblings
.get(i)
.and_then(|sibling| {
sibling
.clone()
.and_then(|node| Some(node.compute_node_hash(index_id.to_field())))
})
.unwrap_or(*empty_poseidon_hash())
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why requiring the sibling nodes if you only compute the hash ? We can access the hash of any node in the tree, so i can do let left_payload = tree.fetch(node.left); let left_hash = left_payload.hash;
Creating the NodeInfo requires much more calls to SQL since we need to load the children and grand children even.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized this when dealing with the integration test, but I didn't change these APIs to have the PR ready as soon as possible. But can definitely do it now, thanks for pointing it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 1d05c47

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a "template" to follow for dist-query ppl, we should at least show the optimized way when possible !

Comment on lines 133 to 138
let pis_hash = QueryCircuitInput::ids_for_placeholder_hash(
&planner.pis.predication_operations,
&planner.pis.result,
&planner.query.placeholders,
&planner.pis.bounds,
)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to call this ? It seems unecessary from user perspective - can't we do it internally ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I added this method initially was to avoid having to pass all these inputs to RevelationCircuitInput::new_revelation_no_results_tree, to avoid having too many parameters in the interface of the method. So, though new_revelation_unproven_offset can already compute this internally (given that all these inputs need to be provided anyhow), I though to leave the computation of this outside so that there is no difference in this regard between the 2 types of queries. If you prefer to remove it, I can add these input values to new_revelation_no_results_tree and compute everything internally. Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm thinking is we should pass directly the pis in RevelationCircuitInput::new_revelation_unproven_offset( so user don't need to do anything on top ? wdyt ?
the PI or a local struct in vdb that contains all the elements that pis needs to give.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 5fd774f

Comment on lines 141 to 153
let input = RevelationCircuitInput::new_revelation_unproven_offset(
indexing_proof,
matching_rows_input,
&planner.pis.bounds,
&planner.query.placeholders,
pis_hash,
&column_ids,
&planner.pis.predication_operations,
&planner.pis.result,
planner.query.limit.unwrap(),
planner.query.offset.unwrap(),
false,
)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, the planner is like more than 50% of the argument, seems like it would just be better to skip the previous ids_for_placeholder_hash, give the planner inside here and thus reduce a lot of complexity on both of these calls, wdyt ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified revelation circuits APIs in commit 5fd774f to skip computation of ids_for_placeholder_hash outside of the APIs

index_tree_path,
index_tree_siblings,
);
matching_rows_input.push(MatchingRow::new(row_proof, path, result));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why do you require to pass in the results for the revelation proof ? We usually recursion, public_inputs to pass information to upper layers, so why can't we do it now ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sorry, forgot that our circuits are actually computing digest of result when doing non aggregated functions, so we need the raw values to recompute digest in final revelation.

Comment on lines +416 to +420
b.select_hash(
is_row_node_leaf[i],
&row_proof.tree_hash_target(),
&row_node_hash,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why enforcing that select if we anyway compute it and we have the data ? Can't we just use the one computed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need to use the hash fetched from the row proof, generated with the universal circuit. As an optimization for the other type of queries, this circuit will already compute the hash of the node if the row is stored in a leaf node, otherwise the hash exposed as public input will be the hash of the embedded cells tree. So, when processing this proof, we need to enforce that we are recomputing the hash of the node correctly from the hash exposed as public input by the row proof

Comment on lines +457 to +468
max_result = if let Some(res) = &max_result {
let current_result: [UInt256Target; S] =
get_result(i).to_vec().try_into().unwrap();
let is_smaller = b.is_less_than_or_equal_to_u256_arr(res, &current_result).0;
// flag specifying whether we must enforce DISTINCT for the current result or not
let must_be_enforced = b.and(is_matching_row, distinct);
let is_smaller = b.and(must_be_enforced, is_smaller);
b.connect(is_smaller.target, must_be_enforced.target);
Some(current_result)
} else {
Some(get_result(i).to_vec().try_into().unwrap())
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered using the digest of the results and just making sure they're all different ? Since checking the digest would only entail 1 digest selection accross the S rows, instead of checking all the results for each row.
Might be worse or better depending on the constants used i guess ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I considered it but I didn't do it eventually because I would have needed to add a comparison gadget for digests, while for vector of u256 it was already available. So, given that we had a strict timeline, I decided to cut the scope and do like this. But should be actually more efficient to implement this logic over digests rather than all the results, so if you think it's ok to spend some time doing it (i.e., should take like half a day at most), I can definitely do it. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I realized that to have a comparison for digests, we would need to split each of the 11 field elements of the digest in 32 bit limbs, to employ the same order relationship we currently employ for slices of 32 bit limbs. This split would be done with this method, which however requires about 2 rows per field element (due to range-checks). So, to split a digest into 32 bit limbs, we would need 22 Plonky2 rows. Then, computing the comparison with another digest (already split in 32 bit limbs) will require other 8 rows, so 30 rows in total.
On the other hand, with the current comparison of the results, we need about 3 rows per each result item, so with the same number of rows we would be able to compare up to 10 result items, which looks more than enough to me.
So, TLDR, comparing digests instead of the actual results would become convenient only if we have more than 10 result items per row (i.e., more than 10 items specified in the SELECT statement of the query), so it doesn't seem really worthy to me, given that it will also require additional work to implement this comparison. Wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's problem solved :) THanks !

// First, we compute the digest of the results corresponding to this row, as computed in the universal
// query circuit, to check that the results correspond to the one computed by that circuit
let cells_tree_hash =
build_cells_tree(b, &get_result(i)[2..], &ids[2..], &is_item_included[2..]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the results could be stuff like balance * 10 for example, it wouldn't match the cell tree as stored in the tree, would it ? A bit lost on why are we re-creating that cell tree hash here since we pass in the results and not the raw values stored in our db.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the universal circuit, which is employed to generate the proof for a single row, is still assuming that a results tree will be built for these type of queries. So it accumulates all the results related to this row in a cells tree, and the digest is computed from this cells tree. So here we are recomputing the cells tree in order to correctly recompute the digest. This allows to avoid introducing another variant of the universal circuit just for these type of queries, and instead to re-use the variant we will employ once we will have the results tree integrated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh makes sense thanks for the explanation. Could you put that quickly in a comment above ? Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit c89fc0f

Copy link
Collaborator

@nikkolasg nikkolasg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks ! Just asked for an additional small comment.

Let's maybe hold off for merging until we get #362 in main and integrated already ?

@nicholas-mainardi nicholas-mainardi merged commit 5e2df85 into feat/tabular-queries Oct 29, 2024
1 of 4 checks passed
@nicholas-mainardi nicholas-mainardi deleted the feat/unproven-limit-offset-queries branch October 29, 2024 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants