Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialize-elasticsearch: handle fields with multiple types, and array fields #2272

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Jan 14, 2025

Description:

Adds handling for fields that have multiple types & fields that are arrays so that they can be materialized to Elasticsearch.

Arrays where the array items are known to have a single type (not including null) can use an index mapping of that type. For example, a mapping of type text will work with values that are "some value", ["some value", "another value"], ["yet another value", null], etc. The only exception is an array field where the items are singly-typed as an array: Although you can actually send a value like [["first", "second"], ["third"]] to a text mapping like this and have it work, and the connector protocol does not provide "recursive" inference like this. All that can be known about such a field is that it is an array containing arrays, and so there is no "leaf" type to use for the mapping type.

Support for a few other things - fields with multiple types, arrays with unknown or non single-typed items, and the aforementioned singly-typed arrays where the type is itself array - is added here by wrapping the values in an object. An object type itself is already able to be materialized to a flattened mapping type, which basically stores arbitrary JSON data (as long as it is an object) in the mapping. We will construct a synthetic object to wrap these values in, in the form of:

{
  "json": "whatever the JSON representation of the value is"
}

...where the value of the "json" field is obviously the JSON representation of the value, which may itself be an object, and array, or anything else.

Closes #2271

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

N/A since we don't mention anything about fields that may or may not be allowed to be materialized in the docs. This may be an area for future documentation improvement.

Notes for reviewers:


This change is Reviewable

With array inference information available in the connector protocol,
materializations like Elasticsearch can create appropriately typed "array" index
mappings for schematized arrays where there is known to be a single type of
item.
… non singly-typed arrays

Expands handling for fields that have multiple types, and arrays that do not
have a single type of item.

This is done by materializing the values for these fields as a synthetic object
and putting that in a "flattened" index mapping, the same way as other objects
are materialized. The generated object has a key of "json" and its value is the
JSON representation of the original value.
@@ -148,13 +147,16 @@ func (constrainter) NewConstraints(p *pf.Projection, deltaUpdates bool) *pm.Resp
switch {
case p.IsPrimaryKey:
constraint.Type = pm.Response_Validated_Constraint_LOCATION_REQUIRED
constraint.Reason = "Primary key locations are required"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This NewConstraints function is now exactly the same as the one used in materialize-sql:

func (constrainter) NewConstraints(p *pf.Projection, deltaUpdates bool) *pm.Response_Validated_Constraint {
_, isNumeric := boilerplate.AsFormattedNumeric(p)
var constraint = pm.Response_Validated_Constraint{}
switch {
case p.IsPrimaryKey:
constraint.Type = pm.Response_Validated_Constraint_LOCATION_REQUIRED
constraint.Reason = "All Locations that are part of the collections key are required"
case p.IsRootDocumentProjection() && deltaUpdates:
constraint.Type = pm.Response_Validated_Constraint_LOCATION_RECOMMENDED
constraint.Reason = "The root document should usually be materialized"
case p.IsRootDocumentProjection():
constraint.Type = pm.Response_Validated_Constraint_LOCATION_REQUIRED
constraint.Reason = "The root document must be materialized"
case len(p.Inference.Types) == 0:
constraint.Type = pm.Response_Validated_Constraint_FIELD_FORBIDDEN
constraint.Reason = "Cannot materialize a field with no types"
case p.Field == "_meta/op":
constraint.Type = pm.Response_Validated_Constraint_LOCATION_RECOMMENDED
constraint.Reason = "The operation type should usually be materialized"
case strings.HasPrefix(p.Field, "_meta/"):
constraint.Type = pm.Response_Validated_Constraint_FIELD_OPTIONAL
constraint.Reason = "Metadata fields are able to be materialized"
case p.Inference.IsSingleScalarType() || isNumeric:
constraint.Type = pm.Response_Validated_Constraint_LOCATION_RECOMMENDED
constraint.Reason = "The projection has a single scalar type"
case slices.Equal(p.Inference.Types, []string{"null"}):
constraint.Type = pm.Response_Validated_Constraint_FIELD_FORBIDDEN
constraint.Reason = "Cannot materialize a field where the only possible type is 'null'"
case p.Inference.IsSingleType() && slices.Contains(p.Inference.Types, "object"):
constraint.Type = pm.Response_Validated_Constraint_FIELD_OPTIONAL
constraint.Reason = "Object fields may be materialized"
default:
// Any other case is one where the field is an array or has multiple types.
constraint.Type = pm.Response_Validated_Constraint_LOCATION_RECOMMENDED
constraint.Reason = "This field is able to be materialized"
}
return &constraint
}

Copy link
Member

@Alex-Bair Alex-Bair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

materialize-elasticsearch: support arrays with singly-typed items
2 participants