-
Notifications
You must be signed in to change notification settings - Fork 242
DATAWAVE Query
The main query logic in DATAWAVE makes use of 4 main tables in Accumulo: shard, shardIndex, shardReverseIndex, and DatawaveMetadata. The format of these tables are as follows:
Table | Purpose | Row | Column Family | Column Qualifier | Value |
---|---|---|---|---|---|
Shard | Fields | ShardId (yyyymmdd_#) | Datatype \0 UID | Fieldname \0 Fieldvalue | n/a |
Field Index | ShardId | fi \0 Fieldname | Fieldvalue’ \0 Datatype \0 UID | n/a | |
Term Frequencies | ShardId | tf | Datatype \0 UID \0 Fieldname | Term Frequencies (protobuf) | |
Document | ShardId | d | Datatype \0 UID \0 Viewname | Document content | |
ShardIndex | Global Index | Fieldvalue’ | Fieldname | Shardid \0 Datatype | UID List (up to 20) |
Shard Reverse Index | Global Index | Reverse-Fieldvalue’ | Fieldname | Shardid \0 Datatype | UID List (up to 20) |
DatawaveMetadata | Metadata | Fieldname | MetadataType | Metadata | Metadata |
The basic components that uniquely identify a document is the ShardId + Datatype + UID. So the goal of a query is to derive those components and match all of the fields in the document. A query is composed of a JEXL query or LUCENE like query, a date range, the query logic name, and potentially other query parameters. The general process for the main query logic (EventQuery) is as follows:
- Fieldvalue’ + Fieldname → Shardid + Datatype + UID*
- Note that we may not have a UID if the UID list is empty because there are more than 20 documents with the same fieldname, fieldvalue’, shardid, and datatype.
- If we do not have a UID, then scan the field index
- Shardid + Datatype + Fieldvalue’ + Fieldname → UID
- Shardid + Datatype + UID → Fieldnames + Fieldvalues
- Fieldnames + Fieldvalues + Query → match (or not)
The first thing that happens is query planning which means turning the query into one that can be run against the index is composed of these main steps:
- apply query model (found in DatawaveMetadata table)
- FIELD=value → (FIELD1==value || FIELD2==value)
- expand functions
- function() → (function() && (index query))
- unfielded expansion
- value → (FIELD1==value || FIELD2==value)
- regex → (FIELD1==value1 || FIELD2==value2)
- regex → ((ExceededValueThreshold=true) && (FIELD1=~regex)) (NOTE: This is the query form that results in an Ivarator. This happens when the number of values matching the regex exceeds a configured threshold.)
- normalize
- FIELD=value → FIELD=value’
- pushdown
- FIELD=value → ((ASTDelayed=true) && (FIELD==value))
- FIELD=~regex → ((ASTDelayed=true) && (FIELD=~regex))
- expand regex/ranges
- FIELD=~regex → (FIELD==value1 || FIELD==value2)
- (FIELD>value1 && FIELD<value2) → (FIELD==value1 || FIELD==value2)
- FIELD=~regex → ((ExceededValueThreshold=true) && (FIELD=~regex))
- (FIELD>value1 && FIELD<value2) → ((ExceededValueThreshold=true) && (FIELD>value1 && FIELD<value2))
- pullup
- ((ASTDelayed=true) && (FIELD==value)) → FIELD==value
- ((ASTDelayed=true) && (FIELD=~regex)) → FIELD=~regex
- expand regex/ranges (same as above)
For all of the indexed fieldname=fieldvalue’ terms in the query which are not in a ASTDelayed or ExceededValueThreshold expression, scan the global index. Given the structure of the global index, this will get us shardid+datatypes (shard range) and optionally shardid+datatype+UID (document range). The boolean structure of those terms are used to determine the final ranges to scan in the shard table.
Given the following query:
(FIELD1==value1 && FIELD2==value2) || FIELD3==value3
This will result in three scans:
- FIELD1==value1 → shard1+dt1, shard1+dt2, shard2+dt1+uid1, shard2+dt1+uid1
- FIELD2==value2 → shard1+dt1, shard2+dt1
- FIELD3==value3 → shard1+dt1, shard3+dt1+uid3
If we intersect the first two terms (because of the && boolean) we get:
- shard1+dt1, shard2+dt1+uid1, shard2+dt1+uid1
Then if we union with the third term (because of the || boolean) we get:
- shard1+dt1, shard2+dt1+uid1, shard2+dt1+uid1, shard3+dt1+uid3
So this gives us one shard range and three document ranges. The shard range will require going to the field index to get the UIDs, but the document ranges can go directly to pulling out the documents for evaluation.
Scanning the field index is basically identical to scanning the global index except that in this case we can handle the ExceededValueThreshold expressions (Ivarator). Given the following query:
(FIELD1==value1 && FIELD2==value2) || ((ExceededValueThreshold=true) && (FIELD3=~value3))
This will result in three scans which result in UID lists in sorted order:
- FIELD1==value1 → uid1, uid2, uid3
- FIELD2==value2 → uid1, uid3
- FIELD3==value3 → uid2, uid4, uid5 (see Ivarator section)
If we intersect the first two terms (because of the && boolean) we get:
- uid1, uid3
Then if we union with the third term (because of the || boolean) we get:
- uid1, uid2, uid3, uid4, uid5
Since we are already scanning the shard, we can go directly to those documents within the same iterator/scan for the next step
The reason an Ivarator is required to scan the field index is that when the expression can match multiple values, then the UIDs being pulled out of the field index will not be in sorted order given the structure of the field index. In order to do the intersections and unions as required by the boolean logic, each of the streams of uids being pulled out must be in sorted order. The field index is constructed as follows:
shardId | fi \0 Fieldname | Fieldvalue’ \0 Datatype \0 UID |
---|
So the scan of the field index may result in the following stream of UIDs: uid2, uid5, uid3, uid10, … These uids are pulled into a buffer and are sorted. As the buffer is filled up with uids, they are stored in separate files either on disk (i.e. local) or in hdfs. Note that this all has to be done before the first union or intersection is performed. Once the scan is complete, then the files are read back out doing a merge sort in the process:
- file1: uid1, uid5, uid3, uid10, uid2
- file2: uid3, uid8, uid9, uid13
- final merge sort: uid1, uid2, uid3, uid5, uid8, uid9, uid10, uid13
Given the shardid, datatype, and uid we can directly scan for all of the fieldname, fieldvalue pairs that correspond to the document. Note that if we scanned the field index first then the document will already be filled with the “index only” fieldname, fieldvalue pairs. If we did not scan the field index first and we have “index only “ fields in the query, then we will pull those values directly out of the field index as well. Everything exists in the same shard, and hence is located in the same table so we can bounce between the field index and the document in the same scan. In then end we will have a set of fieldname, fieldvalue pairs that can be used for the final evaluation
Evaluating the document is running the full set of boolean logic, functions, etc against the set of fieldname, fieldvalue pairs. In the end we will either have a match or we will not. If we have a match, then the document as a set of fieldname, fieldvalue pairs are returned back to the client.