Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the performance of IN operator for indexed properties #8698

Open
nalexn opened this issue Oct 16, 2024 · 13 comments
Open

Optimize the performance of IN operator for indexed properties #8698

nalexn opened this issue Oct 16, 2024 · 13 comments

Comments

@nalexn
Copy link

nalexn commented Oct 16, 2024

Problem

The issue was first reported over 6 years ago on stack overflow, but is still actual for latest realm versions.

When you use the predicate key IN %@ and supply a collection with a substantially large number of values (over a thousand), the filtering happens SO slowly that doing filtering outside the query element-by-element outperforms it by 50x, despite being inefficient on its own.

let values = Set(["value_1", "value_2", ..., "value_40000"])

// 1. Complete query with IN operator: over 25secs

let results = realm.objects(MyObject.self)
       .filter("key IN %@", values)
       // same for new syntax:
       // .where { $0.key.in(values) }

// 2. Filter outside realm: under 0.45secs

let results = realm.objects(MyObject.self)
      .filter { values.contains($0.key) }

The use case when this is needed:
We run a complex query on a large data set, which is done on a background thread so the UI isn't blocked. It passes a set of resulting objects' ids to the main thread, where we can quickly query objects by ids without blocking the UI.

Solution

It looks like the operator <key> IN <collection> is translated into <key> == value_1 OR <key> == value_2 OR ..., which works well on small sets of values, but quickly degrades when the collection is large enough.

I hope there is a way to optimize this part of the algorithm, at least when the supplied collection supports O(1) element presence check, like for Set<String>.

Alternatives

There is really no alternative to this API other than hacks like supplying a subset of the collection for values, or filtering manually after the query

How important is this improvement for you?

Would be a major improvement

Feature would mainly be used with

Local Database only

Copy link

sync-by-unito bot commented Oct 16, 2024

➤ PM Bot commented:

Jira ticket: RCOCOA-2443

@Jaycyn
Copy link

Jaycyn commented Oct 17, 2024

I am having difficulty reproducing this issue.

We created a PersonClass with a number of properties including a key property of type String

class PersonClass: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted var key = ""
     ....

We then have a function that creates 100,000 PersonClass with the key property populated with "person" + an index from 0 to 99,999. e.g. the Realm file will contain PersonClass objects with key values of person0, person1, person2....person99999.

We then use the code posted in your report, to query the key property for a random selection of 50,000 PersonClass objects. So values will be 50,000 array objects: [person12, person87453, person4533...]

func handleButton1() {
    guard let realm = gGetRealm() else { return }

    let num = 49999
    var values = [String]()
    for _ in 0...num {
        let randomNum = (0...num).randomElement() ?? num
        let v = "person\(randomNum)"
        values.append(v)
    }

    print("performing query")
    let startTime = Date()
    let results = realm.objects(PersonClass.self).where { $0.key.in(values) }
    print("Load completed of \(results.count)")
    let elapsed = Date().timeIntervalSince(startTime)
    print("elapsed time of \(elapsed)")
}

and the output to console is:

Load completed of 31621
elapsed time of 1.6968810558319092

In the above test, an in query using 50,000 values was performed on 100,000 objects in 0.096 seconds.

@nalexn
Copy link
Author

nalexn commented Oct 17, 2024

Ok, I'll see if I can hand you a better code snippet that repros this issue

@nalexn
Copy link
Author

nalexn commented Oct 22, 2024

@Jaycyn Ok, I found what was missing - the key property has to be indexed:

class PersonClass: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: true) var key = ""
}

this alone increases the query time from 0.24 seconds without index to 12.3 seconds when indexed.

But, if you add another indexed string property and query with && like so:

class PersonClass: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: true) var key1 = ""
    @Persisted(indexed: true) var key2 = ""
}

realm.objects(PersonClass.self).where {
      $0.key1.in(values) && $0.key2 == "<some_value>"
}

... the query takes several minutes to complete!

@Jaycyn
Copy link

Jaycyn commented Oct 22, 2024

The code I provided above has no indexing at all and as you can see completes in approximately 1.5s.

While describing the issue is important, being able to duplicate it is equally important - that will help eliminate other issues; environmental, additional code running etc.

Can you provide a minimal, duplicatable example?

@nalexn
Copy link
Author

nalexn commented Oct 22, 2024

@Jaycyn that's the duplicatable example - add indexing to your sample and the performance would be much worse

@nalexn nalexn changed the title Optimize the performance of IN operator Optimize the performance of IN operator for indexed properties Oct 22, 2024
@Jaycyn
Copy link

Jaycyn commented Oct 22, 2024

Hmm. I changed the key to being indexed

@Persisted(indexed: true) var key = ""

and the result is still an avg of 1.5s

@nalexn
Copy link
Author

nalexn commented Oct 22, 2024

Ok, I'll just copy-paste the entire sample so we don't miss anything:

// Object definition
class PersonClass: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: true) var key1 = ""
    @Persisted(indexed: true) var key2 = ""
}

let num = 49999

// Object generation
let objects = (0...num).map { index in
    let obj = PersonClass()
    obj._id = .generate()
    obj.key1 = "person\(index)"
    obj.key2 = "salt\(index % 3)"
    return obj
}
try! realm.write {
    realm.add(objects, update: .all)
}

var values = [String]()
for _ in 0...num {
    let randomNum = (0...num).randomElement() ?? num
    let v = "person\(randomNum)"
    values.append(v)
}

// Testing the query performance
if #available(iOS 16.0, *) {
    let elapsedTime = ContinuousClock().measure {
        print("performing query")
        let results = realm.objects(PersonClass.self).where {
            $0.key1.in(values) && $0.key2 == "salt0"
        }
        let total = realm.objects(PersonClass.self).count
        print("Load completed of \(results.count) objects (\(total) total)")
    }
    print("\(elapsedTime)")
}

@Jaycyn
Copy link

Jaycyn commented Oct 22, 2024

Well, that's a very odd thing; I was able to duplicate the issue using that code.

A few observations and more info:

The issue only occurs when the object has two indexed properties and a query is performed using .where on one property using .in, and the query also includes && on the other property querying for a specific string.

In other words, the query works perfect if it's one OR the other

let results = realm.objects(PersonClass.self).where { $0.key1.in(values) } //works fine: 1.5s average

let results = realm.objects(PersonClass.self).where { $0.key2 == "salt0"  } //works fine: .03s average

let results = realm.objects(PersonClass.self).where { $0.key1.in(values) && $0.key2 == "salt0" } //does NOT work fine

Even splitting it up and performing two separate queries does not work

let results1 = realm.objects(PersonClass.self).where { $0.key1.in(values)  } //does not work fine
let results2 = results1.where { $0.key2 = "salt0" }

Note after the fact: this is not resolved to "separate queries" Realm combines them into one, so it's just one NSPredicate

@Jaycyn
Copy link

Jaycyn commented Oct 22, 2024

Update that due to realm being lazy-loading, the query is not actually executed until the print statement, and the issue is duplicated leaving that in so this original post was inaccurate and edited.

--- below is incorrect ---
Take your code and comment out the print statement and run it

        let elapsedTime = ContinuousClock().measure {
            print("performing query")
            let results = realm.objects(PersonClass.self).where { $0.key.in(values) && $0.key2 == "salt0" }
            let total = realm.objects(PersonClass.self).count
                   //print("Load completed of \(results.count) objects (\(total) total)")  //commented
        }
        print("\(elapsedTime)")

This is the issue

print(results.count)

This code now works correctly

print("performing query")
let startTime = Date()
let results = realm.objects(PersonClass.self).where { $0.key.in(values) && $0.key2 == "salt0" }
let elapsed = Date().timeIntervalSince(startTime)
print("elapsed time of \(elapsed)")
performing query
elapsed time of 0.11012506484985352

@tgoyne
Copy link
Member

tgoyne commented Oct 22, 2024

We have an optimized lookup for IN and chained OR EQUALS that constructs a set and does a table scan, and an optimized lookup to find rows which have a specific value in the index. Which one is faster depends on the size of the table, the number of values in the IN, and what portion of the table matches the query, so we have some heuristics there. I'm guessing that adding the second property is switching from the table scan to the index lookup, which in this case is slower.

Without the print(results.count) the query is never actually run and the elapsed time is just the time it takes to construct a NSPredicate and turn that into our internal Query type. Doing .where { condition 1 }.where { condition2 } and .where { condition 1 && condition2 } do the same thing; chaining Results constructs a single compound query that is run rather than running the query on the results of the previous query.

@nalexn
Copy link
Author

nalexn commented Oct 22, 2024

I've run the query on these similar classes to find out how indexing is affecting the performance:

class OneKeyNoIndex: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: false) var key1 = ""
}

class OneKeyIndexed: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: true) var key1 = ""
}

class TwoKeysNoIndex: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: false) var key1 = ""
    @Persisted(indexed: false) var key2 = ""
}

class TwoKeysIndexed: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: true) var key1 = ""
    @Persisted(indexed: true) var key2 = ""
}

class TwoKeysMixIndex1: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: false) var key1 = ""
    @Persisted(indexed: true) var key2 = ""
}

class TwoKeysMixIndex2: Object {
    @Persisted(primaryKey: true) var _id: ObjectId
    @Persisted(indexed: true) var key1 = ""
    @Persisted(indexed: false) var key2 = ""
}

And here are the times:

OneKeyNoIndex 0.238098 seconds
OneKeyIndexed 12.283518041999999 seconds
TwoKeysNoIndex 0.227876334 seconds
TwoKeysIndexed 122.770053084 seconds
TwoKeysMixIndex1 0.279000875 seconds
TwoKeysMixIndex2 66.472070625 seconds

Observations:

  1. Classes that have @Persisted(indexed: false) var key1 are fast, regardless of the conditions
  2. Classes that have @Persisted(indexed: true) var key1 are from 50 to 500 times slower
  3. Comparing TwoKeysIndexed (both keys indexed) and TwoKeysMixIndex2 (key 2 not indexed) shows that indexing the second property slows down the query by 2x, even though it's not participating in the IN operation.

There is clearly an issue with indexing, it dramatically slows things down!

@Jaycyn
Copy link

Jaycyn commented Oct 23, 2024

Without the print(results.count) the query is never actually run

Ah, ah of course @tgoyne. We've not being using Realm as we migrate to another platform and forgot about how lazy it is! A great feature that really nobody else offers.

However, that doesn't appear to account for both queries being very fast on their own, but not together... or does it?

So on this

an optimized lookup to find rows which have a specific value in the index....the second property is switching from the table scan to the index lookup, which in this case is slower.

Our testing shows these two queries are roughly equivalent in performance. Running each query over 10 iterations has just minor difference: 1.72s average vs .03s average

let results = realm.objects(PersonClass.self).where { $0.key1.in(values) } //works fine

let results = realm.objects(PersonClass.self).where { $0.key2 == "salt0"  } //works fine

The first being a .in query with a table scan and the second being an index lookup. Combined together with an && turns in 42 seconds, per the original post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants