Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python core dump or memcached/redis I/O error can cause other-process apply_async to loop for thundering_herd_timeout #66

Open
kylegibson opened this issue Sep 22, 2016 · 4 comments

Comments

@kylegibson
Copy link
Member

See here: https://github.com/PolicyStat/jobtastic/blob/master/jobtastic/cache/base.py#L91

The timeout doesn't mean that add will block and wait. It just means that if the key can be set, it will be set with that timeout. So if the key already exists, this results in a busy loop.

@winhamwr
Copy link
Contributor

I think this is actually #9. I'm going to write up a potential solution to this problem on that issue, based on our discussion.

@winhamwr
Copy link
Contributor

This is actually different from #9. I was wrong. Because of the way we do lock.acquire in both memcached and redis with no timeout possibility, if for some reason the lock.release doesn't happen in another thread/process, the lock aquisition will wait forever.

I think that's probably low impact, since it would require the python code to have core dumped or the call to lock.release() or self.cache.delete to have failed because of network or memcached/redis problems. In that case, the other processes waiting on the lock would be in an infinite loop.

@winhamwr winhamwr changed the title Busy waiting in cache.base Python core dump or memcached/redis I/O error can cause other-process apply_async to loop for thundering_herd_timeout Sep 22, 2016
@winhamwr
Copy link
Contributor

Probably the behavior we want is for those other processes to fail very quickly if they're stuck waiting for a lock. thundering_herd_timeout is about how long we should wait for a queued task to complete, which would depend on how busy your workers are and how long the task takes to run.

This timeout is more about how long you would expect it to take for a process to run through the apply_async method, which does no work other than checking the cache and possibly queueing a task. My 20-minute reporting/analytics/backup task should fail in a matter of seconds waiting on the task to actually be queued. It shouldn't wait in a blocking (probably web) process.

@winhamwr
Copy link
Contributor

I created #68 to document how I think a feature might look to do simultaneous execution prevention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants