Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries getting slower during progress #45

Open
juliandehne opened this issue Nov 1, 2021 · 13 comments
Open

Queries getting slower during progress #45

juliandehne opened this issue Nov 1, 2021 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@juliandehne
Copy link

juliandehne commented Nov 1, 2021

Thank you for providing this nice library. I am using it in a scientific project for twitter analysis. However, the db inserts seem to slow down a lot after a couple of days running it in production mode:

[treenode] update delab.models.Tweet tree: executed 0 queries in 71.44206510693766s.
[treenode] update delab.models.Tweet tree: executed 0 queries in 71.87405565101653s.
[treenode] update delab.models.Tweet tree: executed 0 queries in 66.6588648010511s.
[treenode] update delab.models.Tweet tree: executed 0 queries in 71.47152532404289s.
[treenode] update delab.models.Tweet tree: executed 0 queries in 79.63660701399203s.

I opened the issue also within my project, if you are interested in the way the library is used:
juliandehne/delab#15

Probably, I will write a unit test to verify it is an issue with the treenode library.

Any ideas?

Fund with Polar
@fabiocaccamo
Copy link
Owner

@juliandehne how many nodes are there in the tree?

@juliandehne
Copy link
Author

Up to 2000 at times. With maybe 50 trees.

@MichaelAkvo
Copy link

MichaelAkvo commented Nov 24, 2021

You might be running into the same problem I'm running to: nodes without parents are considered as siblings --> each time you add a new node without a parent, all siblings will have to be updated --> O(n) runtime and memory+DB usage.

This is also creating a problem on existing DBs that start using django-treenode with many objects (we have 9k rows and it takes 115s to set_parent())

@juliandehne could you run Tweet.objects.filter(tn_parent__isnull=True).count() and len(Tweet.objects.filter(tn_parent__isnull=True).first().get_siblings_count) ?

@fabiocaccamo
Copy link
Owner

fabiocaccamo commented Nov 24, 2021

@MichaelAkvo thank you for pointing to the possible cause.

  • Do you think that the root nodes should not be considered as siblings?
  • Do you think that avoiding to consider the root nodes as siblings would fix the problem?
  • Could you do a quick test by changing these lines with the following ones (there is just a new condition on top)?
# update siblings
if obj_data['tn_parent_pk']:
    siblings_parent_key = str(obj_data['tn_parent_pk'])
    obj_data['tn_siblings_pks'] = list(
        objs_pks_by_parent.get(siblings_parent_key, []))
    obj_data['tn_siblings_pks'].remove(obj_data['pk'])
    obj_data['tn_siblings_count'] = len(obj_data['tn_siblings_pks'])

The tree is automatically updated at the end of the set_parent method, if you need set multiple parents at once you should avoid to use set_parent, instead you should set tn_parent field manually and call YourModel.update_tree() when you have set all parents you need.
If this helps you, I could consider to add an option to the set_parent method to skip the (expensive) update_tree call.

@MichaelAkvo
Copy link

MichaelAkvo commented Nov 25, 2021

Hi @fabiocaccamo,

Do you think that the root nodes should not be considered as siblings?

I'm a little split because it depends on the point of view. Personally, I do not consider them siblings.

Do you think that avoiding to consider the root nodes as siblings would fix the problem?

Could you do a quick test by changing these lines with the following ones (there is just a new condition on top)?

I tested it and it fixed part of the performance issue I was facing, however I still had to turn off signals and call .update_tree() at the end. Inserting new root nodes however, still faces an O(n) problem. With some test data and a glance at the queries, inserting a new node still generates a slew of 'UPDATE "tests_category" SET "tn_order" = 1, "tn_index" = 1 WHERE "tests_category"."id" = 101' queries

image

That was after inserting the 101rst node.

Maybe a bulk_update is needed somewhere to update many rows at once? It might even allow you to keep the root node siblings (even though imo they shouldn't be siblings) nvm, bulk_update(...) does the same loop as `.update_tree(). Unfortunate...

@fabiocaccamo
Copy link
Owner

@MichaelAkvo thank you for the great feedback.

From a logical point of view neither me I consider them siblings, but let's imagine the scenario where root nodes represents pages or main menu entries, aren't them siblings?


For turning off signals there is already a context manager, but I've not documented it in the README...

from treenode.signals import no_signals

with no_signals():
    # do stuff
    pass

YourModel.update_tree()

Concerning the O(n) problem when inserting a root I don't know how we could avoid it.


I think that these could be some improvements:

  • Change setParent method (add possibility to do it without calling update_tree).
  • Adding setRoot method (with possibility to do it without calling update_tree).

@MichaelAkvo
Copy link

MichaelAkvo commented Nov 25, 2021

Concerning menus, if there were a Menu model, then personally I would create a "main menu" tree. That would allow me to also create other menus e.g "side bar menu", "footer menu", etc . They wouldn't be siblings, imo.

However, I noticed that even removing siblings from root nodes (which should still be done, imo) the problem would persist for wide trees / tree levels with many siblings.
Maybe a bulk_update is needed here. Django will then make use of the WHEN clause to send of batches of updates.


It looks like tn_order and tn_index are frequently recalculated triggering updates to the DB. Therefore, a few questions/suggestions:

  • Is tn_index supposed to be the "tree index"? If so, could it be transformed into tn_tree_id aka a unique tree identifier? Unfortunately BigAutoField won't work, but maybe a UUID or other something time-based could work.

  • I'm not entirely sure why tn_order is necessary. Could ordering instead be done with ['tn_index', 'tn_level', 'pk'] ?

With these changes, hopefully update_tree will become much less expensive. Additionally having a param to disable update_tree does seem like a good option right now.

@fabiocaccamo
Copy link
Owner

I agree with you regarding the menu scenario, however this would be surely a breaking change.


  • tn_index is the child index relative to its parent, it is not a unique value / identifier.
  • tn_order is a value that is computed for doing query ordering on a single field and without joins.

If you are interested, there is already a similar discussion on tn_index / tn_order on #34

@fabiocaccamo fabiocaccamo added the enhancement New feature or request label Dec 7, 2021
@fabiocaccamo
Copy link
Owner

@juliandehne @MichaelAkvo have you any news on this?

@MichaelAkvo
Copy link

Hi @fabiocaccamo,

Unfortunately there wasn't enough time to continue looking into this, so I went with a postgresql ltree solution. My branch still exists, so if you want to, you can use what you deem useful.

Basically it uses bulk_update call in update_tree and gets rid of tn_index and tn_order as they are the most frequently updated fields. They are constantly recalculated for some reason and keep generating DB requests (maybe a bug?).

Lots of tests are broken though since the ordering is now different.

Modifying a tree should also update all nodes in the tree because the ancestors and descendants are kept track of in all nodes of that tree - however that doesn't seem to be happening, so either it's a bug or I'm misunderstanding the code.

In any case, good luck with the performance improvements.

@fabiocaccamo
Copy link
Owner

@MichaelAkvo thank you very much for the feedback!

@fabiocaccamo fabiocaccamo self-assigned this Mar 15, 2023
@fabiocaccamo fabiocaccamo moved this to Todo in Open Source Mar 15, 2023
@Nathan-Cohen
Copy link
Contributor

Nathan-Cohen commented Nov 3, 2023

Hi @fabiocaccamo,
I think I have a similar problem, the following function calls all nodes to update the cache when I do a node.get_ancestors() or as soon as there is a signal for example :
def update_cache(cls):
ls, d = _get_cached_collections()
ls[cls] = list(cls.objects.all()) # here
d[cls] = {str(obj.pk): obj for obj in ls[cls]}
_set_cached_collections(ls, d)

it should only call nodes linked to the same tree, to avoid loading taking too long.

The other thing you'd need to do is disable the cache so you don't have to update it anymore, to avoid overloading the database when you do a simple node.get_ancestors() or whatever.

Do you have any workarounds for this problem?

@fabiocaccamo
Copy link
Owner

@Nathan-Cohen I understand the problem and I like the idea of implementing an update_branch instance method that doesn't update the whole tree when a node changes, this is a change that requires a good amount of time/work, if you want to work on a PR (that passes current tests) I will be very glad to review and merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

4 participants