Simple Select Queries with Unchecked Offset #365

nicholas-mainardi · 2024-09-13T15:18:42Z

This PR introduces a circuit to expose the results of simple SELECT queries without aggregation functions, avoiding the need to build a results tree.

…ueries

parsil/src/executor.rs

nikkolasg

Nice work !! Left a few comments regarding API.

nikkolasg · 2024-09-23T07:21:45Z

mp2-v1/tests/common/cases/query/simple_select_queries.rs

+    planner: &mut QueryPlanner<'a>,
+    results: Vec<PgSqlRow>,
+) -> Result<()> {
+    let mut exec_query = generate_query_execution_with_keys(&mut parsed, &planner.settings)?;


I think we need to make sure this query ALWAYS has limit and offset set by default if not set already and have limit < MAX_LIMIT that we can support.
Not sure it's there already.

Good point. Do you think it should be done inside Parsil when parsing the query or it's ok to let the integration test enforce it? Doing it in Parsil should be preferable I guess but we will need to somehow provide the MAX_LIMIT constant

Done in commit 994a115. I added checks for all the upper bounds and modify the expander component of Parsil to always add LIMIT MAX_LIMIT if the user didn't specify LIMIT in the query. Tagging also @delehef to take a look at this commit.

nikkolasg · 2024-09-23T07:24:58Z

mp2-v1/tests/common/cases/query/simple_select_queries.rs

+    let row_tree_info = RowInfo {
+        satisfiying_rows: matching_rows
+            .iter()
+            .map(|(key, _, _)| key)
+            .cloned()
+            .collect(),
+        tree: &planner.table.row,
+    };


i think it's ok to use the tree now assuming we have this maximum small LIMIT enforced. Otherwise, we need to use the cache,the wide_lineage stuff.

I initially tried to do it with wide_lineage but I thought it was kind of useless because it will yield only the keys of the nodes, but then for each node I would still have to call get_node_info to build the path. So the code complexity didn't seem really worthy to me.

nikkolasg · 2024-09-23T07:31:29Z

verifiable-db/src/query/merkle_path.rs

+        let sibling_hash = array::from_fn(|i| {
+            siblings
+                .get(i)
+                .and_then(|sibling| {
+                    sibling
+                        .clone()
+                        .and_then(|node| Some(node.compute_node_hash(index_id.to_field())))
+                })
+                .unwrap_or(*empty_poseidon_hash())
+        });


Why requiring the sibling nodes if you only compute the hash ? We can access the hash of any node in the tree, so i can do let left_payload = tree.fetch(node.left); let left_hash = left_payload.hash;
Creating the NodeInfo requires much more calls to SQL since we need to load the children and grand children even.

Realized this when dealing with the integration test, but I didn't change these APIs to have the PR ready as soon as possible. But can definitely do it now, thanks for pointing it out

Done in commit 1d05c47

Since it's a "template" to follow for dist-query ppl, we should at least show the optimized way when possible !

nikkolasg · 2024-09-23T07:34:30Z

mp2-v1/tests/common/cases/query/simple_select_queries.rs

+    let pis_hash = QueryCircuitInput::ids_for_placeholder_hash(
+        &planner.pis.predication_operations,
+        &planner.pis.result,
+        &planner.query.placeholders,
+        &planner.pis.bounds,
+    )?;


Do we really need to call this ? It seems unecessary from user perspective - can't we do it internally ?

The reason why I added this method initially was to avoid having to pass all these inputs to RevelationCircuitInput::new_revelation_no_results_tree, to avoid having too many parameters in the interface of the method. So, though new_revelation_unproven_offset can already compute this internally (given that all these inputs need to be provided anyhow), I though to leave the computation of this outside so that there is no difference in this regard between the 2 types of queries. If you prefer to remove it, I can add these input values to new_revelation_no_results_tree and compute everything internally. Wdyt?

What I'm thinking is we should pass directly the pis in RevelationCircuitInput::new_revelation_unproven_offset( so user don't need to do anything on top ? wdyt ?
the PI or a local struct in vdb that contains all the elements that pis needs to give.

Done in commit 5fd774f

nikkolasg · 2024-09-23T07:35:25Z

mp2-v1/tests/common/cases/query/simple_select_queries.rs

+    let input = RevelationCircuitInput::new_revelation_unproven_offset(
+        indexing_proof,
+        matching_rows_input,
+        &planner.pis.bounds,
+        &planner.query.placeholders,
+        pis_hash,
+        &column_ids,
+        &planner.pis.predication_operations,
+        &planner.pis.result,
+        planner.query.limit.unwrap(),
+        planner.query.offset.unwrap(),
+        false,
+    )?;


Same here, the planner is like more than 50% of the argument, seems like it would just be better to skip the previous ids_for_placeholder_hash, give the planner inside here and thus reduce a lot of complexity on both of these calls, wdyt ?

Modified revelation circuits APIs in commit 5fd774f to skip computation of ids_for_placeholder_hash outside of the APIs

nikkolasg · 2024-09-23T07:37:01Z

mp2-v1/tests/common/cases/query/simple_select_queries.rs

+            index_tree_path,
+            index_tree_siblings,
+        );
+        matching_rows_input.push(MatchingRow::new(row_proof, path, result));


Can you explain why do you require to pass in the results for the revelation proof ? We usually recursion, public_inputs to pass information to upper layers, so why can't we do it now ?

ok sorry, forgot that our circuits are actually computing digest of result when doing non aggregated functions, so we need the raw values to recompute digest in final revelation.

nikkolasg · 2024-09-23T08:00:57Z

verifiable-db/src/revelation/revelation_unproven_offset.rs

+                    b.select_hash(
+                        is_row_node_leaf[i],
+                        &row_proof.tree_hash_target(),
+                        &row_node_hash,
+                    )


why enforcing that select if we anyway compute it and we have the data ? Can't we just use the one computed ?

Because we need to use the hash fetched from the row proof, generated with the universal circuit. As an optimization for the other type of queries, this circuit will already compute the hash of the node if the row is stored in a leaf node, otherwise the hash exposed as public input will be the hash of the embedded cells tree. So, when processing this proof, we need to enforce that we are recomputing the hash of the node correctly from the hash exposed as public input by the row proof

nikkolasg · 2024-09-23T08:19:05Z

verifiable-db/src/revelation/revelation_unproven_offset.rs

+                max_result = if let Some(res) = &max_result {
+                    let current_result: [UInt256Target; S] =
+                        get_result(i).to_vec().try_into().unwrap();
+                    let is_smaller = b.is_less_than_or_equal_to_u256_arr(res, &current_result).0;
+                    // flag specifying whether we must enforce DISTINCT for the current result or not
+                    let must_be_enforced = b.and(is_matching_row, distinct);
+                    let is_smaller = b.and(must_be_enforced, is_smaller);
+                    b.connect(is_smaller.target, must_be_enforced.target);
+                    Some(current_result)
+                } else {
+                    Some(get_result(i).to_vec().try_into().unwrap())
+                };


have you considered using the digest of the results and just making sure they're all different ? Since checking the digest would only entail 1 digest selection accross the S rows, instead of checking all the results for each row.
Might be worse or better depending on the constants used i guess ?

Yes, I considered it but I didn't do it eventually because I would have needed to add a comparison gadget for digests, while for vector of u256 it was already available. So, given that we had a strict timeline, I decided to cut the scope and do like this. But should be actually more efficient to implement this logic over digests rather than all the results, so if you think it's ok to spend some time doing it (i.e., should take like half a day at most), I can definitely do it. Wdyt?

Actually I realized that to have a comparison for digests, we would need to split each of the 11 field elements of the digest in 32 bit limbs, to employ the same order relationship we currently employ for slices of 32 bit limbs. This split would be done with this method, which however requires about 2 rows per field element (due to range-checks). So, to split a digest into 32 bit limbs, we would need 22 Plonky2 rows. Then, computing the comparison with another digest (already split in 32 bit limbs) will require other 8 rows, so 30 rows in total.
On the other hand, with the current comparison of the results, we need about 3 rows per each result item, so with the same number of rows we would be able to compare up to 10 result items, which looks more than enough to me.
So, TLDR, comparing digests instead of the actual results would become convenient only if we have more than 10 result items per row (i.e., more than 10 items specified in the SELECT statement of the query), so it doesn't seem really worthy to me, given that it will also require additional work to implement this comparison. Wdyt?

Then it's problem solved :) THanks !

nikkolasg · 2024-09-23T08:24:56Z

verifiable-db/src/revelation/revelation_unproven_offset.rs

+                // First, we compute the digest of the results corresponding to this row, as computed in the universal
+                // query circuit, to check that the results correspond to the one computed by that circuit
+                let cells_tree_hash =
+                    build_cells_tree(b, &get_result(i)[2..], &ids[2..], &is_item_included[2..]);


But the results could be stuff like balance * 10 for example, it wouldn't match the cell tree as stored in the tree, would it ? A bit lost on why are we re-creating that cell tree hash here since we pass in the results and not the raw values stored in our db.

Because the universal circuit, which is employed to generate the proof for a single row, is still assuming that a results tree will be built for these type of queries. So it accumulates all the results related to this row in a cells tree, and the digest is computed from this cells tree. So here we are recomputing the cells tree in order to correctly recompute the digest. This allows to avoid introducing another variant of the universal circuit just for these type of queries, and instead to re-use the variant we will employ once we will have the results tree integrated

Ahh makes sense thanks for the explanation. Could you put that quickly in a comment above ? Thanks.

Done in commit c89fc0f

nikkolasg

LGTM thanks ! Just asked for an additional small comment.

Let's maybe hold off for merging until we get #362 in main and integrated already ?

…ueries

nicholas-mainardi added 15 commits September 11, 2024 16:21

Add Merkle-path verification gadget

719a0c0

Add revelation circuit unproven offset

ac3443e

Add APIs for simple select

ff8039c

Update general query APIs

5079597

Fix build integration test

3fe4b55

Fix build groth-16 crate

f8a79d1

Refactor query test cases code

dbb8e7a

Merge with main

799a60f

WiP: keep both queries types in a single test

dfaa7d7

Merge branch 'feat/tabular-queries' into feat/unproven-limit-offset-q…

c0bb9a6

…ueries

Working intergation test for simple select queries

28a3383

fmt

c3c199d

Run also commented test cases

170ad44

Bind matching row keys and results in a single query

676dfec

Remove dead code

34d88ec

nicholas-mainardi marked this pull request as ready for review September 18, 2024 13:31

nicholas-mainardi added 4 commits September 19, 2024 10:58

Add distinct to circuit and Parsil

689b929

Support DISTINCT and SELECT * queries in integration test

56e0dde

Comment out test cases

17cd47f

More complex test case with wildcards

cd59ea3

nicholas-mainardi requested review from nikkolasg and delehef September 19, 2024 13:34

nicholas-mainardi changed the base branch from main to feat/tabular-queries September 19, 2024 13:48

delehef approved these changes Sep 19, 2024

View reviewed changes

parsil/src/executor.rs Show resolved Hide resolved

delehef approved these changes Sep 19, 2024

View reviewed changes

nikkolasg changed the title ~~Simple Select Queries with Unproven Offset~~ Simple Select Queries with Unchecked Offset Sep 23, 2024

nikkolasg reviewed Sep 23, 2024

View reviewed changes

nicholas-mainardi added 3 commits September 23, 2024 18:20

Optimize SQL queries to get Merkle-path

1d05c47

Check query upper bounds in Parsil + add LIMIT to query if missing

994a115

Refactor revelation APIs to avoid calling ids_for_placeholder_hash

5fd774f

nikkolasg approved these changes Sep 26, 2024

View reviewed changes

nicholas-mainardi added 6 commits September 27, 2024 10:06

Add comment for cells tree hash construction

c89fc0f

Merge branch 'feat/tabular-queries' into feat/unproven-limit-offset-q…

9002398

…ueries

Add LIMIT by default only in simple SELECT queries

c87856f

Avoid running wildcard queries on merged tables

c7bd811

Fix parsil test

a5a6962

Merge branch 'feat/tabular-queries' into feat/unproven-limit-offset-q…

791ba8a

…ueries

nicholas-mainardi merged commit 5e2df85 into feat/tabular-queries Oct 29, 2024
1 of 4 checks passed

nicholas-mainardi deleted the feat/unproven-limit-offset-queries branch October 29, 2024 21:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Select Queries with Unchecked Offset #365

Simple Select Queries with Unchecked Offset #365

nicholas-mainardi commented Sep 13, 2024

nikkolasg left a comment

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nikkolasg Sep 24, 2024

nicholas-mainardi Sep 26, 2024

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nikkolasg Sep 24, 2024

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nikkolasg Sep 24, 2024

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nikkolasg Sep 24, 2024

nicholas-mainardi Sep 26, 2024

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 26, 2024

nikkolasg Sep 23, 2024

nikkolasg Sep 23, 2024

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nicholas-mainardi Sep 24, 2024

nikkolasg Sep 26, 2024

nikkolasg Sep 23, 2024

nicholas-mainardi Sep 23, 2024

nikkolasg Sep 26, 2024

nicholas-mainardi Sep 27, 2024

nikkolasg left a comment

Simple Select Queries with Unchecked Offset #365

Simple Select Queries with Unchecked Offset #365

Conversation

nicholas-mainardi commented Sep 13, 2024

nikkolasg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikkolasg left a comment

Choose a reason for hiding this comment