Status: Accepted Type: Feature Created: 2022-08-12 Authors: Victor Petrovykh <[email protected]>
This RFC outlines a generic approach to full-text search (also referred to as FTS in this document) functionality, keeping the core features the same across various specific implementations.
Advanced text search functionality is important when dealing with text fields
containing natural human language. When filtering such fields simpler tools
such as LIKE
or regular expressions are not flexible enough. Full-text
search tools can be built-in (based on Postgres FTS capability) or external
(e.g. Elasticsearch). External tools would likely be implemented in the form
of extensions. However, we want the EdgeQL queries that use FTS to be the
same, regardless of how the search is implemented. This means that this
functionality should use the same built-in functions and even the query
sub-language should be the same. Additionally, we want to make it easier to
look for entire objects matching a particular search query based on all/some
of the str
fields of that object type.
It is useful to consider how Postgres handles full-text search (generally
other FTS implementations also support similar features, so this is useful
baseline). Postgres introduces two types for this purpose: tsvector
and
tsquery
.
The tsvector
type stores a processed version of the text. It basically
contains information about the tokens making up the text as well as their
positions. This is the version of the text that is actually targeted by the
FTS. Postgres also provides a few different ways to convert plain text into a
tsvector
. There's also a way to create an index that converts a text
column into a tsvector
and enables fast searches. Functions that convert
text into tsvector
can also take an argument that determines which parser
will be used for the conversion.
The tsquery
type encapsulates the search conditions, such as which lexemes
should or must appear and their relative order. Postgres also has several ways
of converting a string into a tsquery
.
Postgres provides a way to assign relative weights to different lexemes in the
tsvector
which then can be used to rank FTS matches.
There's a lot of functionality involved in supporting FTS, but it's a fairly
niche thing, so it makes sense to create a module std::fts
for all of the
FTS tools.
Indexes can be used to declare which properties should be subject to FTS functionality. They can also be used to annotate properties with search groups, language information, etc. Search groups can then be used to fine tune results by assigning different weights to each group during searches. Language information would determine what data is targeted based on the language of the search query. The proposed index definitions would thus have a semantic effect determining how exactly FTS should be carried out for a given object type.
In EdgeDB we don't need an explicit "document" scalar that users interact
with, because we don't allow arbitrary strings to be processed by FTS
functions, only indexed properties. However, we still need a type similar to
tsvector
to record things like weight category and language that
correspond to a particular part of the indexed string. This type should be
opaque and not valid as a result of a user query. Instead it can be used in
index expressions and returned by functions that can only be used in index
expressions. The type being opaque makes it possible to add more FTS functions
that affect how the string is interpreted without having to change all the
signatures.
This processed document type and a function that use it are:
type fts::document { # This is hidden implementation, so it's more of a concept # than an actual object type. Under the hood it might be a # record or not even have instances at all (see Implementation). } fts::with_options( text: str, named only language: anyenum, named only weight_category: optional fts::Weight = fts::Weight.A, ) -> fts::document
Since document
is opaque and is not valid in regular queries, we should
really make it some sort of "internal" or "system" type. It doesn't need to be
part of schema introspection and likely will only appear as part of signature
of the special index function. Incidentally, this special function also
doesn't need to be exposed to introspection.
Language is an important part of FTS and we generally assume it will
be part of the index
declaration as well as of the search functions.
Postgres passes the language as part of the index configuration: CREATE
INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));
.
Elasticsearch has language-specific analyzers which can be used in the mapping corresponding to a given index.
Algolia has an indexLanguages
setting for its indexes.
Typesense does not appear to have any language-specific index settings.
We define some built-in enums to denote available languages. The most general
is fts::Language
and it is listing languages that are supported across
backends. There are also backend-specific enums such as fts::PGLanguage
and fts::ElasticLanguage
. We can add more as needed. The fts::Language
contains ISO 639-3 language codes or special codes prefixed with edb_
for
languages that don't map neatly onto ISO 639-3, such as edb_Brazilian
and
edb_ChineseJapaneseKorean
.
The fts::Weight
enum is used to specify weight categories:
scalar type fts::Weight extending enum<A, B, C, D>;
In order to simplify FTS query language, we will use a str
to represent
queries rather than introducing a new scalar type. This string is expected to
contain FTS search queries in a format similar to how they might be typed into
a search field.
We assume that each FTS implementation will supply its own index
with the
particular parameters that are configurable for that implementation (which
could be some backend-specific JSON config, etc.). The general index
definition, however has no parameters and will be of the following form:
abstract index fts::index()
Additionally there's sometimes a need to index prefixes to facilitate
prefix-based searches (such as for autocomplete). This can be accomplished by
adding a parameter to the abstract index
definition:
abstract index fts::index( named only index_prefixes: bool = false )
Setting up a prefix-based index, wouldn't really do much for Postgres.
In Elasticsearch that would translate into index_prefixes
with default
minimum and maximum prefix values.
In Algolia prefix searches are by default enabled for all, but can be disabled
via disablePrefixOnAttributes
, which would correspond to not turning on
index_prefixes
.
In Typesense prefix searching is entirely controlled by the query and no special index settings are necessary one way or another, much like in Postgres.
Therefore, if we introduce a generic prefix search FTS functionality we can
add index_prefixes
parameter to the generic abstract index definition.
As mentioned earlier concrete indexes on types will specify which properties
the FTS functions should be targeting. When targeting multiple properties they
can be specified as separate index on
parameters or just concatenated into
a single expression. Keeping the parameters separate makes it possible to
assign different weights to them.
For example:
type Document { title: str; body: str; internal: str; index fts::index on (( fts::with_options(.title, language := fts::Language.eng), fts::with_options(.body, language := fts::Language.eng), )); }
The above schema declares that only title
and body
properties are
subject to FTS functions, while the internal
property will be ignored by
searches. These properties must be wrapped in a with_options
special
function that specifies the language of the corresponding parts (which affects
the indexing).
In order to add weights, we pass weight_category
to
``fts::with_options``special function:
type Document { title: str; body: str; internal: str; index fts::index on ( fts::with_options( .title, language := fts::Language.eng, weight_category := fts::Weight.A ), fts::with_options( .body, language := fts::Language.eng, weight_category := fts::Weight.B ), ); }
In index specification we associate a weight group with a particular indexed
expression (typically just a property). Using fts::Weight
enum to denote
weight groups makes it less likely that the users confuse the group name
with the actual associated weight value.
The most basic naming scheme for weight groups could be similar to Postgres: just using the alphabet starting with 'A'. We can accommodate different number of groups in the future by adding backend-specific enums.
Since fts::index
index is semantic and has effect on how FTS
functions work, we cannot have multiple versions of this index defined on a
single object type. So at most one FTS index can be defined on any type
(subject to ??
once the extension enabling/disabling is implemented). In
particular this means that when an FTS index is inherited by a type and a new
index definition is provided in the extending type, the new definition
completely overrides the inherited one.
For example:
type Document { body: str; index fts::index on ( fts::with_options(.body, language := fts::Language.eng) ); } type InternalDoc extending Document { internal: str; } type TitledDoc extending Document { title: str; index fts::index on ( fts::with_options(.title, language := fts::Language.eng) ); }
The InternalDoc
type has no FTS index of its own, so it inherits the index
from Document
. Thus only the .body
is indexed. Conversely,
TitledDoc
defines a new fts::index
index on .title
, but the
body
property would no longer be indexed for TitledDoc
objects.
The reason for this design is that we're limited to having no more than one FTS index per type as it covers all search functionality. This means that we must have a way of overriding an existing index with type-specific one. The least surprising way of doing that is for the explicitly specified index to override the inherited one.
In case of multiple inheritance indexes may conflict with one another, that is considered an error and the user is required to explicitly provide an FTS index to resolve this (to minimize ambiguity).
All of these extra rules apply only to FTS indexes and not indexes in general.
An important feature of the search function is that it actually performs the
job typically done by the FILTER
clause. The function is supposed to
filter a bunch of objects based on the FTS query and indexed fields. In
addition to filtering, the results may be annotated with relevance (ranked)
and potentially with highlighted matches:
function std::fts::search( object: anyobject, query: str, named only language: str = <str>fts::Language.eng, named only weights: optional array<float64> = {}, ) -> optional tuple< object: anyobject, score: float64 >
The object
input provides the object that should be searched. The
details of which properties should be searched and other FTS parameters will
be provided by the FTS index on the specified type. Searching an unindexed
type should simply produce no matches (resulting in an {}
).
The language
parameter indicated which language the query is using and
therefore allows to target only the relevant documents if there are multiple
languages. In order to simplify the search function and make it play nice with
Postgres indexes, we will use a str
to represent languages instead of the
enum. This string is expected to be representing the fts::Language
enum
values, though.
The optional weights
array provides relevance multipliers corresponding to
each of the weight groups indicated by the FTS index. The weights have to be
in the range from 0 to 1. If the weights are outside of the valid range an
error is produced. There should be a default weights array in case no weights
are provided. By default weights should start with 1
and be halved for
every next group, indicating diminishing relevance. The weights
array of
values provided explicitly overrides the corresponding defaults. This means
that first 4 groups would by default have the following weights: 1
,
0.5
, 0.25
, 0.125
. On the other hand if the match
was called
with weights := [1, 0.7]
as an argument, then the first 4 groups would
have these weights: 1
, 0.7
, 0
, 0
. So only the first 2 groups
would be searched.
It is worth considering implementing a search function that returns a free object instead of the named tuple (with the same structure). A free object would interact more naturally and smoothly with shapes used to get the right data for the output.
Another type of search functionality involves "autocomplete"-style search for things matching a certain prefix. This is similar to regular matching, but typically the last word is treated as a prefix, while the start of the query matches exactly. Although it's possible to use one of the arguments to switch to this mode, it's probably better to have a dedicated function instead, since the actual FTS implementation may need a different call to be compiled to access the backend:
function std::fts::autocomplete( object: set of anyobject, query: str, named only language: str, named only weights: optional array<float64>, ) -> optional tuple< object: anyobject, rank: float64 >
One of the implementation concerns is how inheritance interacts with FTS indexes and search functionality. The natural interpretation is that all the different fields should be targeted polymorphically as per individual index specification. This means that searching the root type of some documents would be a good way to target all different document varieties without having to explicitly specify them one by one.
A couple of different approaches can be used to implement this:
- A hidden
_document
column can be added for purposes of Postgres-backed FTS functionality. This makes it possible to resolve all the nuances in the index specification and provide a uniform way of accessing the searchable parts of the documents for FTS functions. The downside is that this is duplicating the data that's already present as actual properties and indexes. - Expand a single search call into several calls, each of them targeting only one table. This avoid data duplication, but is much harder to implement.
The document
type is needed to declare the signatures of special function
like with_options
, but is never meant to be used directly in queries. In
fact, in all likelihood, it may never have any instances of actual values,
because these values are used exclusively by the compiler and generally
decomposed into their individual components and then the components are
compiled (often as literals) into the specific underlying indexes.
There's no way to encompass all the features of many different FTS
implementations in a single clear search language spec. Instead we want to
provide a simple search language to cover many common use-cases and to cover
searches that are driven by user input. One of the limitations is that the
search query must be expressible as a str
, ideally similar to actual
user-input.
We will follow the format similar to Google search queries:
- Search terms can appear as plain words. These are acceptable terms.
- Terms quoted by
"..."
must be treated as a phrase (preserving word order). These are highly desirable terms. - Terms prefixed by
-
are excluded terms and they must not appear in the matching document at all. OR
may appear between any terms. It doesn't affect any acceptable terms, but it downgrades any adjacent highly desirable terms to be now acceptable. When appearing next to an excluded term, it makes that exclusion optional (which usually negates its usefulness).AND
may appear between any terms. It doesn't affect any highly desirable or excluded terms, but it upgrades an acceptable term to be highly desirable.- Ideally all highly desirable terms must appear in the matching document.
- At least some of the acceptable terms must appear in the matching document. The more the better, but there's no strict preference for which ones get matched.
For example:
- The search string
quick brown fox jumps
indicates that as long as the document contains any of the three search terms. Sobrown sugar
orrunning foxes
are valid matches, butthe quick brown dog jumps over the fox
is a much better match (more matched terms), which should be reflected in the rankings. - The search string
"quick brown" fox jumps
indicates that the document must contain the phrase "quick brown" and possibly some of the words "fox" and "jump" (or their variants). Sothe quick brown dog jumps over the fox
is a valid match, but notthe fox is quick and brown
(phrase not matched). - The search string
quick AND brown fox jumps
indicates that the document must contain the words "quick" and "brown" and and possibly some of the words "fox" and "jump". Thusthe fox is quick and brown
is a valid match and so isjump to the quick recipe for brown sugar
.
To map this kind of search query to Postgres backend websearch_to_tsquery
can be used. The main caveat might be that each term is linked with &
in
Postgres, so we may need to inject explicit or
to avoid this.
Elasticsearch has operator
values and
and or
available to specify
whether all terms in a query should be matched or just any of them. There's a
must_not
, and match_phrase
operation in addition to match
. The
query can be nested and have complex structure. Generally the str
query
will need to be parsed and split into this nested JSON structure.
Algolia with advancedSyntax
turned on uses a very similar syntax for
search queries (quoted phrases and -
for negative matches). By default it
looks for all the query terms, but we can use optionalWords
instead to
make searches that don't have to match everything.
Typesense, much like Algolia has a very similar query syntax to the one proposed above. By default all terms are optional, but the more of them are matched the better the ranking of the result.
As a general rule FTS implementation will adhere to the above rules on best-effort basis. The specifics of each backend may affect how strict are the matches and phrases.
There are no backwards compatibility issues because we now have nested modules.
There are no security implications.
We change the return of match
function from FTSResult to a named tuple.
This simplifies the spec and implementation without loss of functionality.
We rejected the idea of using annotations for specifying the FTS language settings. This is due to the fact that it's difficult for functions to properly interface with that and it may result in unnecessary compiler-level magic.
In general using annotations for fine-tuning FTS functionality impacts the complexity of the implementation potentially requiring recompiling any user functions that actually use FTS functionality inside of them. So, at least for the moment, this is undesirable.
Rename result "boost" into "weight" and make it part of the function signature rather than an annotation that is magically picked up by the compiler. Additionally, make the valid range of "weight" to be [0, 1] rather than being arbitrary and relying on automatic scaling of these values relative to each other. The benefit of the fixed valid range is that it makes it more clear whether a particular "weight" value is large or small without needing a larger context.
The search functions only take objects, not arbitrary str
expressions.
This means that we sacrifice flexibility of arbitrary searches for not having
to worry whether the search expression actually matches the index expression
and therefore whether the FTS index is going to be actually used.
We rejected the test
and match
function duality because they are
low-level and hard to use with filters and pagination, while still using the
index efficiently. In addition boolean test
seems to be a very limited in
terms of usefulness as compared to match
which allows ranking results. A
single search
function can definitely perform the duties of both of them.
Given that we no longer allow searching arbitrary strings and completely rely
on indexes to mark the parts of documents that are subject to FTS, the niche
usefulness of test
is further reduced.
We rejected element-wise search
in favor of a function operating on a set
of objects. This approach is in line with the general operation flow in
EdgeDB, where a query is a pipeline that processes a set gradually
transforming it into the result. Having this as a set function makes it less
awkward for the user to associate ranking and the actual objects and then
filter based on that, by removing the necessity for the user to manually
inject the ranking into the object shape, in order to filter and paginate
based on that. Instead the user now applies the search function to the desired
set of objects and has an annotated result set ready for filtering and
pagination.
The language
parameter should not be omitted from search functions because
there might be multiple indexed languages and the backend needs to know which
one the query targets (and therefore which stemmer to use, etc.).
We rejected set of
documents parameter for the search function, largely
because it's harder to implement and we don't rely of operating on a set of
objects. We considered making the search
function return an ordered set
or results, but in the end opted out of it and thus we no longer need this
function to operate on sets.
Replace fts::weight
and fts::language
special functions with a single
fts::with_options
special function. The weight_category
and
language
are just named only arguments for this function. This is to avoid
unnecessary nesting when supplying both weight and language.
For the moment we don't include highlighted text in the search results. We don't yet have a clear general way of formatting that and using HTML-like strings seems very hacky.
Use an fts::Language
and fts::Weight
enums instead of str
to limit
what languages and weights are supported. We can customized and extended these
enums if needed in the future.