pyqtree.Index.intersect returns duplicate ids after serialization #18

Erotemic · 2019-06-14T15:56:16Z

I've encountered what seems to be a serialization bug. If I create a dummy Index with some random boxes and perform a query, everything works fine. However, if I use pickle to serialize and de-serialize the Index, it starts to return duplicate node ids. This isn't a huge problem because I can just remove duplicates, but it may be indicative of a serialization issue that should be fixed.

I've constructed a minimal working example and tested a few variations. First here is the MWE that demonstrates the problem:

import numpy as np
import pyqtree

# Populate a qtree with a set of random boxes
aid_to_tlbr = {779: np.array([412, 404, 488, 455]),
               781: np.array([127, 429, 194, 517]),
               782: np.array([459, 282, 517, 364]),
               784: np.array([404, 160, 496, 219]),
               785: np.array([336, 178, 367, 209]),
               786: np.array([366, 459, 451, 527]),
               788: np.array([491, 434, 532, 504]),
               789: np.array([251, 185, 322, 248]),
               790: np.array([266, 104, 387, 162]),
               791: np.array([ 65, 296, 138, 330]),
               792: np.array([331, 241, 368, 347])}
orig_qtree = pyqtree.Index((0, 0, 600, 600))
for aid, tlbr in aid_to_tlbr.items():
    orig_qtree.insert(aid, tlbr)

# Issue a query and inspect results
query = np.array([0, 0, 300, 300])
original_result = orig_qtree.intersect(query)

# We see that everything looks fine
print('original_result = {!r}'.format(sorted(original_result)))

# Serialize and unserialize the Index, and inspect results
import pickle
serial = pickle.dumps(orig_qtree)
new_qtree = pickle.loads(serial)

# Issue the same query on the reloaded Index, the result now
# contains duplicate items!!
new_result = new_qtree.intersect(query)
print('new_result = {!r}'.format(sorted(new_result)))

This results in the following output:

original_result = [789, 790, 791]
new_result = [789, 789, 790, 790, 791, 791]

As you can see the new result has duplicate node ids.

This bug has some other interesting properties. First, serializing a second time doesn't make anything worse, so that's good.

        third_qtree = pickle.loads(pickle.dumps(new_qtree))
        third_result = third_qtree.intersect(query)
        print('third_result = {!r}'.format(sorted(third_result)))

The output is the same as new_result.

third_result = [789, 789, 790, 790, 791, 791]

Something really weird is that the specific node-ids seem to impact if this bug happens. If I reindex the nodes to use 0-10 instead of the 700ish numbers in the node ids the problem goes away!

        aid_to_tlbr = {0: np.array([412, 404, 488, 455]),
                       1: np.array([127, 429, 194, 517]),
                       2: np.array([459, 282, 517, 364]),
                       3: np.array([404, 160, 496, 219]),
                       4: np.array([336, 178, 367, 209]),
                       5: np.array([366, 459, 451, 527]),
                       6: np.array([491, 434, 532, 504]),
                       7: np.array([251, 185, 322, 248]),
                       8: np.array([266, 104, 387, 162]),
                       9: np.array([ 65, 296, 138, 330]),
                       10: np.array([331, 241, 368, 347])}
        qtree3 = pyqtree.Index((0, 0, 600, 600))
        for aid, tlbr in aid_to_tlbr.items():
            qtree3.insert(aid, tlbr)
        query = np.array([0, 0, 300, 300])
        result3 = qtree3.intersect(query)
        print('result3 = {!r}'.format(sorted(result3)))
        qtree4 = pickle.loads(pickle.dumps(qtree3))
        result4 = qtree4.intersect(query)
        print('result4 = {!r}'.format(sorted(result4)))

Results in:

result3 = [7, 8, 9]
result4 = [7, 8, 9]

Results were obtained using pyqtree.version = '1.0.0' and python 3.6 in Ubuntu 18.04.

The text was updated successfully, but these errors were encountered:

karimbahgat · 2019-07-15T12:50:20Z

Thanks for raising this issue, I hadn't realized it was possible to serialize the index, which is indeed useful.

After some digging, the issue arises because one item can have a bbox that spans multiple subtrees. To avoid returning the same item multiple times if the query region also spans those same subtrees, the code checks for duplicates, and does so with the id() of each item. In the original tree, when an item is placed in multiple subtrees they are all pointing to the same item and thus have the same id(). When the tree is recreated however, it has no way of knowing that the items were originally the same instance.

The solution here boils down to a choice:
A) Should items be considered unique by the uniqueness of their value, ie the number 191 or any object that tests equal to another object is considered equal? This would be ok if the inserted item is an id of sorts, but may work unexpectedly if the inserted item is some other object that may test equal to another object, for instance a geometry with same coordinates as another.
B) The current situation is that we allow items of identical value to be considered separate items, only if they truly are the same object via id() are they considered the same. This necessarily leads to unexpected pickling as you have noticed.

Rtree seems to take a stance here, but not sure what they mean:

Entries that are inserted into the index are not unique in either the sense of the id or of the bounding box that is inserted with index entries. If you need to maintain uniqueness, you need to manage that before inserting entries into the Rtree.

Rtree also seems to distinguish between inserting an "id", and an "object", whereas pyqtree only inserts an "item" which I guess could be either. Maybe pyqtree needs to make that distinction too or just choose one. Feedback is welcome.

Erotemic · 2019-07-15T14:31:53Z

does so with the id() of each item

That makes a lot of sense and gives me a much better understanding of this issue. Thanks for looking into this.

The unfortunately named id function is only mean to encode identity during the lifetime of an object (e.g. the memory address), so it can lead to some real funky behaviors:

>>> for i in range(3):
>>>     print(id(pickle.loads(pickle.dumps(100123210))))
140302921669040
140302921670416
140302921669040

The id of the integer changes because a new instance of it is created in CPython memory. Note this behavior wont show up for small integers like 1, 2, 3, which have special handling in the CPython implementation, which explains why this issue is so easy to miss. Regardless of the direction we choose to take, I think its a mistake to keep the current behavior for this reason.

There does seem to be a design decision that needs to be made. Is the root Index object list-like or set-like? Currently it is neither, but it seems as if the API is more set-like than list-like. The following example demonstrates this.

import numpy as np
import pyqtree
orig_qtree = pyqtree.Index((0, 0, 600, 600))
tlbr = (10, 10, 20, 20)
aid = 1
orig_qtree.insert(aid, tlbr)
orig_qtree.insert(aid, tlbr)
orig_qtree.intersect(tlbr)

This example adds the item 1 multiple times, but only returns [1], not [1, 1]. Also from an intuitive perspective I would expect a spatial index API to require a uniquely valued item for each entry in the data structure.

We can push the previous example a bit further by inserting the "same" item twice with different disjoint bounding boxes:

import numpy as np
import pyqtree
orig_qtree = pyqtree.Index((0, 0, 600, 600))
tlbr1 = (10, 10, 20, 20)
tlbr2 = (40, 40, 80, 80)
aid = 1
orig_qtree.insert(aid, tlbr1)
orig_qtree.insert(aid, tlbr2)
orig_qtree.intersect((0, 0, 600, 600))

This example still returns [1], which would be expected from a set-like implementation, but not from a list-like one, in which case I would expect [1, 1] to be returned.

Thus, based on existing behavior and my initial biases I'm more in favor of (A). I think that items should be considered unique. This could be easily implemented by just removing the id operator and allow the uniq set to handle object identity via its hash value (which doesn't necessarily correspond to maintaining identity via the __eq__ operator).

But this does cause some things to break. Unhashable items like dicts and sets can no longer be inserted as an item into the tree, but I doubt there are that many existing use cases where this really breaks something. Also, users that rely on old functionality could work around this change by changing their code from insert(item, bbox) to insert(id(item), bbox) and get the old behavior. They just have to be careful to use id(x) everywhere where they would have used x in the past.

The alternative is abandon the existing mostly set-like behavior and try and achieve list-like behavior, but IMHO that would be less useful, and I'm not sure the changes would be so straight forward.

If you agree that the API should be modified as per (A) there is one more change you may consider making. If items are unique, the bbox argument in the current remove(self, item, bbox) API is not strictly necessary (although it is useful if you only want to remove a box if it falls within a region). What you could do instead is maintain a top-level registry (ie dictionary) of all items that have been inserted into the tree. This would also make it easy for users to inspect the contents of the data structure without recusing into it, but it does have a very small memory overhead.

I outlined the changes here:

class Index(_QuadTree):
    def __init__(self, bbox=None, x=None, y=None, width=None, height=None, max_items=MAX_ITEMS, max_depth=MAX_DEPTH):
        # Maintain a top-level registry of all existing items.
        self._items = {}
        if bbox is not None:
            x1, y1, x2, y2 = bbox
            width, height = abs(x2-x1), abs(y2-y1)
            midx, midy = x1+width/2.0, y1+height/2.0
            super(Index, self).__init__(midx, midy, width, height, max_items, max_depth)

        elif None not in (x, y, width, height):
            super(Index, self).__init__(x, y, width, height, max_items, max_depth)

        else:
            raise Exception("Either the bbox argument must be set, or the x, y, width, and height arguments must be set")

    def insert(self, item, bbox):
        self._items[item] = bbox
        self._insert(item, bbox)

    def remove(self, item, bbox=None):
        if bbox is None:
            bbox = self._items[item]
        self._items.pop(item)
        self._remove(item, bbox)

    def intersect(self, bbox):
        return self._intersect(bbox)

This might not be the correct final implementation depending on what the expected behavior for adding an item that already exists / removing a non-existing item is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyqtree.Index.intersect returns duplicate ids after serialization #18

pyqtree.Index.intersect returns duplicate ids after serialization #18

Erotemic commented Jun 14, 2019

karimbahgat commented Jul 15, 2019

Erotemic commented Jul 15, 2019

pyqtree.Index.intersect returns duplicate ids after serialization #18

pyqtree.Index.intersect returns duplicate ids after serialization #18

Comments

Erotemic commented Jun 14, 2019

karimbahgat commented Jul 15, 2019

Erotemic commented Jul 15, 2019