-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyqtree.Index.intersect returns duplicate ids after serialization #18
Comments
Thanks for raising this issue, I hadn't realized it was possible to serialize the index, which is indeed useful. After some digging, the issue arises because one item can have a bbox that spans multiple subtrees. To avoid returning the same item multiple times if the query region also spans those same subtrees, the code checks for duplicates, and does so with the id() of each item. In the original tree, when an item is placed in multiple subtrees they are all pointing to the same item and thus have the same id(). When the tree is recreated however, it has no way of knowing that the items were originally the same instance. The solution here boils down to a choice: Rtree seems to take a stance here, but not sure what they mean:
Rtree also seems to distinguish between inserting an "id", and an "object", whereas pyqtree only inserts an "item" which I guess could be either. Maybe pyqtree needs to make that distinction too or just choose one. Feedback is welcome. |
That makes a lot of sense and gives me a much better understanding of this issue. Thanks for looking into this. The unfortunately named >>> for i in range(3):
>>> print(id(pickle.loads(pickle.dumps(100123210))))
140302921669040
140302921670416
140302921669040 The id of the integer changes because a new instance of it is created in CPython memory. Note this behavior wont show up for small integers like 1, 2, 3, which have special handling in the CPython implementation, which explains why this issue is so easy to miss. Regardless of the direction we choose to take, I think its a mistake to keep the current behavior for this reason. There does seem to be a design decision that needs to be made. Is the root import numpy as np
import pyqtree
orig_qtree = pyqtree.Index((0, 0, 600, 600))
tlbr = (10, 10, 20, 20)
aid = 1
orig_qtree.insert(aid, tlbr)
orig_qtree.insert(aid, tlbr)
orig_qtree.intersect(tlbr) This example adds the item We can push the previous example a bit further by inserting the "same" item twice with different disjoint bounding boxes: import numpy as np
import pyqtree
orig_qtree = pyqtree.Index((0, 0, 600, 600))
tlbr1 = (10, 10, 20, 20)
tlbr2 = (40, 40, 80, 80)
aid = 1
orig_qtree.insert(aid, tlbr1)
orig_qtree.insert(aid, tlbr2)
orig_qtree.intersect((0, 0, 600, 600)) This example still returns [1], which would be expected from a Thus, based on existing behavior and my initial biases I'm more in favor of (A). I think that items should be considered unique. This could be easily implemented by just removing the But this does cause some things to break. Unhashable items like The alternative is abandon the existing mostly If you agree that the API should be modified as per (A) there is one more change you may consider making. If items are unique, the I outlined the changes here: class Index(_QuadTree):
def __init__(self, bbox=None, x=None, y=None, width=None, height=None, max_items=MAX_ITEMS, max_depth=MAX_DEPTH):
# Maintain a top-level registry of all existing items.
self._items = {}
if bbox is not None:
x1, y1, x2, y2 = bbox
width, height = abs(x2-x1), abs(y2-y1)
midx, midy = x1+width/2.0, y1+height/2.0
super(Index, self).__init__(midx, midy, width, height, max_items, max_depth)
elif None not in (x, y, width, height):
super(Index, self).__init__(x, y, width, height, max_items, max_depth)
else:
raise Exception("Either the bbox argument must be set, or the x, y, width, and height arguments must be set")
def insert(self, item, bbox):
self._items[item] = bbox
self._insert(item, bbox)
def remove(self, item, bbox=None):
if bbox is None:
bbox = self._items[item]
self._items.pop(item)
self._remove(item, bbox)
def intersect(self, bbox):
return self._intersect(bbox) This might not be the correct final implementation depending on what the expected behavior for adding an item that already exists / removing a non-existing item is. |
I've encountered what seems to be a serialization bug. If I create a dummy Index with some random boxes and perform a query, everything works fine. However, if I use pickle to serialize and de-serialize the Index, it starts to return duplicate node ids. This isn't a huge problem because I can just remove duplicates, but it may be indicative of a serialization issue that should be fixed.
I've constructed a minimal working example and tested a few variations. First here is the MWE that demonstrates the problem:
This results in the following output:
As you can see the new result has duplicate node ids.
This bug has some other interesting properties. First, serializing a second time doesn't make anything worse, so that's good.
The output is the same as new_result.
Something really weird is that the specific node-ids seem to impact if this bug happens. If I reindex the nodes to use 0-10 instead of the 700ish numbers in the node ids the problem goes away!
Results in:
Results were obtained using pyqtree.version = '1.0.0' and python 3.6 in Ubuntu 18.04.
The text was updated successfully, but these errors were encountered: