Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested field filtering #47

Merged
merged 48 commits into from
Sep 23, 2024
Merged

Conversation

codingkarthik
Copy link
Collaborator

@codingkarthik codingkarthik commented Sep 12, 2024

Introduction

P.S: This PR is actually not that big, the changes in package-lock.json take about ~400 lines of the diff.

This PR is the first incremental PR towards adding support for filtering rows by nested fields for Azure Cosmos DB NoSQL, allowing to filter rows based on nested fields within documents. This feature enhances query flexibility and precision, particularly for complex document structures.

Consider the following example,

{
    "year": 2023,
    "category": "chemistry",
    "laureates": [
        {
            "id": "1030",
            "firstname": "Louis",
            "surname": "Brus",
            "motivation": "\"for the discovery and synthesis of quantum dots\"",
            "share": "3",
            "birth": {
                "year": 1973,
                "country": {
                    "city": "Sydney",
                    "state": "Tokyo"
                }
            },
            "awards": [
                {
                    "award_name": "Turing Award",
                    "year": 2020,
                    "details": {
                        "category": "physics",
                        "organization": "Wolf Foundation"
                    }
                }
            ]
        },
        {
            "id": "1031",
            "firstname": "Aleksey",
            "surname": "Yekimov",
            "motivation": "\"for the discovery and synthesis of quantum dots\"",
            "share": "3",
            "birth": {
                "year": 1967,
                "country": {
                    "city": "Sydney",
                    "state": "Ile-de-France"
                }
            },
            "awards": [
                {
                    "award_name": "Lasker Award",
                    "year": 2007,
                    "details": {
                        "category": "physics",
                        "organization": "Fields Medal Committee"
                    }
                },
                {
                    "award_name": "Fields Medal",
                    "year": 2015,
                    "details": {
                        "category": "physics",
                        "organization": "Wolf Foundation"
                    }
                },
                {
                    "award_name": "Lasker Award",
                    "year": 2016,
                    "details": {
                        "category": "chemistry",
                        "organization": "Turing Award Committee"
                    }
                }
            ]
        }
    ],
    "prize_id": 1,
    "id": "acd26fd5-894c-4536-8dfe-39be74abdb3a",
    "_rid": "i4BNAJxQVspZAgAAAAAAAA==",
    "_self": "dbs/i4BNAA==/colls/i4BNAJxQVso=/docs/i4BNAJxQVspZAgAAAAAAAA==/",
    "_etag": "\"38006bec-0000-0300-0000-66e2a8de0000\"",
    "_attachments": "attachments/",
    "_ts": 1726130398
}

With the help of nested fields filtering, the user will be able to filter using the $.laureates.awards.award_name which is pretty useful because Azure Cosmos for NoSQL DB doesn't have the concept of relationships, so the related fields are stored as nested fields.

How it works?

Nested filtering works in the following manner:

An NDC request's predicate looks like the following:

{
  "predicate": {
      "type": "binary_comparison_operator",
      "column": {
        "type": "column",
        "name": "laureates",
        "field_path": [
          "awards",
          "name"
        ],
        "path": []
      },
      "operator": "eq",
      "value": {
        "type": "scalar",
        "value": "Lasker Award"
      }
    
}

The field_path is the path to the nested field by which the rows need to be filtered.

The field_path will be parsed and a new AST will be generated that will annotate the type (array / object / scalar ) .

The AST generated will look like the following below:

{
  "name": "laureates",
  "prefix": "root_TestNobelLaureates",
  "type": {
    "type": "array",
    "elementType": {
      "type": "named",
      "name": "Laureate",
      "kind": "object"
    }
  },
  "nestedField": {
    "kind": "array",
    "nestedField": {
      "kind": "array",
      "field": "awards",
      "nestedField": {
        "kind": "scalar",
        "field": "year",
        "type": "Number"
      }
    },
    "field": "laureates"
  }
}

Once the above AST is generated, the buildNestedFieldPredicate function takes the above AST and will convert it into an SQL expression and viola, the query is executed!

add script to run arbitrary SQL

remove unused env vars

use the linux binary directly

remove checking out ndc-spec

install libssl3

generate the correct nested fields filtering predicate

generate the WHERE predicate for nested fields

get the recursion correct [WIP]

get the binary comparison operator for a Column

pass around the rootContainerAlias

rename visitNestedField1 to visitNestedField

refactor visitNestedField

change the file structure

fix the syntax of the query

remove redundant function

minor no-op refactors

fix tests

update README according to the latest template (#37)

Fix NDC tests (#40)

* fix NDC tests

* checkout v0.1.5

* update typescript ndc version

* npm audit fix

add sql generation tests

Add script to run ndc tests (#41)

* add a npm script to run ndc-tests

* generate configuration before starting the server

* test out github workflow

* try with ubuntu-latest

* add debug statement

* debug

* remove the loging into GHCR

Seed data into test container (#42)

* fix test

* setup data only when --setup-data option is provided

* set up the azure cosmos emulator

* fix syntax errors

* use the emulator values

* setup emulator data

* fix issues

* use await

* disable tls

* fix typo

* fix syntax

* add indexing bit while creating the container

* fail the process when encountering any error

figure out where the bug is

one more attempt at debugging

add debug statement

revert back the cli changes

fix the imports

fix import 1

add nested array object field filtering test case

modify the failing tests according to the new data

add nested filtering tests
src/connector/sql/sqlGeneration.ts Outdated Show resolved Hide resolved
src/connector/sql/sqlGeneration.ts Show resolved Hide resolved
}
}

function getNestedScalarType(nestedField: NestedField): string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this function well, could you please explain it. What I'm reading here is:
For nestedField of type object/array, we always assume the underlying type to be a scalar, and call the getNestedScalarType function. I'm not clear on why we are assuming that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nestedField of type object/array, we always assume the underlying type to be a scalar, and call the getNestedScalarType function.

Yes, we first validate that the last element of the nested field should be a scalar because you can apply a binary/unary comparison operator value only on a scalar value. So, if we don't find a scalar type at the end, the request is deemed as an invalid request.

src/connector/sql/sqlGeneration.ts Show resolved Hide resolved
src/connector/sql/sqlGeneration.ts Outdated Show resolved Hide resolved
src/connector/sql/sqlGeneration.ts Show resolved Hide resolved
@@ -656,18 +796,47 @@ function visitExpression(
} else {
return `${expression.dbOperator.name}(${containerAlias}.${expression.column}, ${comparisonValue}) `;
}

default:
throw new sdk.InternalServerError("Unknown expression type");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add ${expression.type} to the error message for help with debugging if we ever run into this error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it turns out that we will never run into that case because all the cases have already been covered. I will remove the default case then.

@@ -382,30 +382,30 @@ export const scalarComparisonOperatorMappings: ScalarOperatorMappings = {
export function getScalarType(column: Column): string {
if (column.nestedField) {
return getNestedScalarType(column.nestedField);
}

if (column.type === undefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have if (!column.type) here instead? Takes care of column.type being both null or undefined. (more here)

The issue with this is, it's strict checking which means this case is checking only for undefined and not null. If you want to keep the check explicit, maybe use if (column.type == null) (notice only 2 = signs instead of three). More about this here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks, I didn't know about this. I will make the change.


let requestedFields: sql.SelectColumns = {};

if (collectionDefinition === undefined)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (
queryRequest.query.order_by != null &&
queryRequest.query.order_by != undefined
) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@m-Bilal m-Bilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@codingkarthik codingkarthik changed the title Nested filtering (Incremental PR-I) Nested filtering Sep 23, 2024
@codingkarthik codingkarthik changed the title Nested filtering Nested field filtering Sep 23, 2024
@codingkarthik codingkarthik merged commit 9697e04 into main Sep 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants