Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[idea] Thread-local incr/decr #28

Open
pepijndevos opened this issue Jun 6, 2016 · 3 comments
Open

[idea] Thread-local incr/decr #28

pepijndevos opened this issue Jun 6, 2016 · 3 comments

Comments

@pepijndevos
Copy link

pepijndevos commented Jun 6, 2016

I just watched your PyCon talk about this project, and I had an idea to reduce cache misses due to atomic incr/decr.

What if you introduced a thread-local refcount and a thread-global thread-refcount?

The idea would be that the number of threads that access an object rarely changes. So if a thread wants to change the refcount it can do so locally and quickly, and only when its refcount drops to zero does it need to do an atomic decr on the thread-refcount and free if it's 0. If it's non-zero it's the job of the remaining threads to clean up once their local refcount drops to 0.

As you mentioned, most objects are only ever used in one thread. So in addition to the above concept, which would still require 2 atomic operations at creation and destruction, the thread id of the object creator could be stored so that touching the thread-global counter can be deferred until another thread incr's the object, in which case it'd need to be set to 2.

[edit] Actually local storage in C works nothing like threading.local which is implemented using a dict. That makes it a lot slower I guess. There is probably a lot more I overlooked.

@JustAMan
Copy link

JustAMan commented Jun 9, 2016

I think someone at the sprints was already working on this, and @larryhastings was working on making buffered refcounts (again to rid of atomics, but in another way).

Just to increase the priority of this issue - I have made some profiling (running x.py benchmark), and it looks like roughly 66% of time spent in PyEval_EvalFrameEx itself (excluding its callees) is spent in atomic inc- or decrefs! If that could be made zero that would speed up stuff greatly.

Same profiling shows that atomic refcounting takes at least as much percentage from call_function, fast_function, PyFrame_New and frame_dealloc (for some of those atomic operations take more than 90% of time!).

@JustAMan
Copy link

@larryhastings - what's going on about reworking refcounts? This seems to be the biggest slowdown so far...

@mysteryjeans
Copy link

I have came here to post this idea and seems validate as concept.

Just to be clear, when going to incre objrefcount on local threaf first if its zero, if so first do atomic incre of threadrefcount on object. Similarly when did decr objrefcount on local thread then check if its become zero? If so do atomic decr threadrefcount and than also check threadrefcount be comes zero & called destructor in same theard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants