Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pattern acceleration #87

Merged
merged 20 commits into from
Sep 30, 2024
Merged

Pattern acceleration #87

merged 20 commits into from
Sep 30, 2024

Conversation

elliVM
Copy link
Contributor

@elliVM elliVM commented Sep 12, 2024

Pattern acceleration feature to activate bloom filtering on set regex pattern. Goal is to limit bloom filtering to certain patterns like UUID.

Flow:

  • Walker walks XML and for each search term regex checks tokenized term against tables in bloomdb excluding the filtertype table
  • Collect list of tables with a pattern match record from SQL metadata
  • For each table a temp table is created
  • For each temp table generate condition that compares temp table filter against parent table filter using bloommatch UDF
  • Combine conditions from each table, if all filters were null then also true
  • Combine conditions from each search term
  • Left join tables to NestedTopNQuery and use the final combined condition in query

Notes: Tokenizer max token count is set to 0 to get only major tokens since that is what dpf_03 currently uses to tokenize the bloom filter tables

@kortemik kortemik requested a review from eemhu September 12, 2024 08:05
@eemhu eemhu requested a review from kortemik September 12, 2024 12:52
}

public ConditionConfig(DSLContext ctx, boolean streamQuery, boolean bloomEnabled, boolean withoutFilters) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this removing the feature that allows to search files without a filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option was not used in github repo. Added option with implementation and tests

* Inserts a filter for each filtertype inside parent table records. Filter is filled with search term token set,
* filter size is selected using parent table filtertype expected and fpp values.
*/
private void insertFilters() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please elaborate more what happens here, from the description it sounds like select all from specific filtertype into temp table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method was poorly named. After refactoring TableFilters is responsible of category tables filters and inserting all filter types. class TableFilterTypesFromMetadata is used to fetch the tables filter types from metadata

}
}

private void insertFilterFromRecord(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please elaborate more what happens here, it sounds like select filter from record into xyz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored into method filterBytesFromRecord in class TableFilters that creates a bloom filter with correct size and tokens.

}

/**
* Generates a condition that returns true if this temp tables search term tokens might be contained in the parent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a parent table in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parent table was the bloom filter table that the temp table was created to match, renamed as originTable in new class CategoryTableImpl


import static com.teragrep.pth_06.jooq.generated.bloomdb.Bloomdb.BLOOMDB;

public class PatternMatch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the responsibility of objects in this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responsibility was to get bloom filter tables that had record with a filtertype pattern that matched with any of the search term tokens. Refactored into two classes PatternMatchTables thet gets the tables from metadata and PatternMatchCondition that generates the condition used to do the pattern matching

@elliVM
Copy link
Contributor Author

elliVM commented Sep 18, 2024

Refactoring

New classes:

Class Responsibility
PatternMatchTables Finds bloomdb Tables that match a pattern condition
CategoryTableImpl Temp table from a bloom filter table that can return a CategoryTableCondition
Created CategoryTable that is created to database
WithFilterTypes CategoryTable with its filters inserted
TableFilters Inserts filters of a CategoryTable
TableFilterTypesFromMetadata Fetches different filter types of a table from metadata
CategoryTableCondition Condition that compares category tables filter bytes against boom filter table filter bytes with bloommatch UDF, selects the same size and bloom term id
PatternMatchCondition Condition that check if any of given tokens match with bloomdb.filtertype.pattern

Other changes:

  • Many method naming changes
  • Added interfaces CategoryTable, TableRecords, BloomQueryCondition
  • Added missing withoutFilters option and implementation to IndexStatementCondition
  • Equality methods for all new classes
  • Tests for all new classes

@kortemik kortemik requested review from eemhu and kortemik September 18, 2024 17:11

/**
* Decorator that creates category table
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this object is not super clear to me, why not just call create() on the origin categoryTable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed decorators in refactoring and just use CategoryTable interface methods create() and insertFilters().

@@ -97,9 +97,9 @@ public Condition condition() {

for (final Table<?> table : tableSet) {
// create category temp table for this pattern match table
final CategoryTable categoryTable = new Created(
new WithFilterTypes(new CategoryTableImpl(config, table, tokenizedValue))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, perhaps the older way was better to reduce temporal coupling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kortemik what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, just with better naming.

@@ -97,9 +97,9 @@ public Condition condition() {

for (final Table<?> table : tableSet) {
// create category temp table for this pattern match table
final CategoryTable categoryTable = new Created(
new WithFilterTypes(new CategoryTableImpl(config, table, tokenizedValue))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, just with better naming.

@elliVM elliVM requested a review from eemhu September 24, 2024 12:25
}
newCondition = config.withoutFilters() ? combinedNullFilterCondition : combinedTableCondition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as much as I like the ternary operator i don't think it is as clear as the if-statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to if-statement

…inserted to category table, some code cleanup
@elliVM
Copy link
Contributor Author

elliVM commented Sep 30, 2024

Enabled token entanglement in tokenizer

@elliVM elliVM requested a review from kortemik September 30, 2024 05:12
@kortemik kortemik merged commit c1fbd98 into teragrep:main Sep 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants