You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description: When there are many workers, ES gets overloaded and can not assign a new task fast enough. This reduces the efficiency so much that having more workers after a certain point hurts performance. This happens because ES will be busy saving the results of the tasks and won't be available to assign new tasks.
Steps to reproduce:
Add ~80 workers
Make many submissions (~25x number of workers)
Rejudge all submissions to fill the queue (best to recompile so all tasks will take roughly the same amount of time)
Expected: A worker should be assigned a new task immediately after it finishes one.
Actual: Workers will end up waiting for a noticeable amount of time for the next task. This can be seen from admin page. One second all workers will be busy and the next all of them will be idle while the queue is still full.
Describe the solution you'd like
In IOI 2017, we used a separate service to maintain the queue. The queue remains responsible for assigning new tasks to the workers and getting the results from them. However, when it receives task results, it sends them separately to ES to save them on the database. This allows the worker to be reused immediately and removes any bottleneck on the queue. Also, in this way, ES can be scaled which is currently not possible since it is also handling the queue. I suggest adapting the code from 2017.
Describe alternatives you've considered
One alternative is that we will have workers connected to different ESes. The ES would be the bridge between the worker and the queue. It would obtain the results, save it to the database and then ask for a new job from the queue for the worker. However, in this case, workers for an ES would still be idle while it is saving results. While each ES gets less workers, still database can be stressed and so the save operation can be slower which makes the idleness longer. Also we would need to handle what would happen if one ES goes down (Workers would need to connect to a different ES, etc.). The approach is also different from the current approach as it relies on ES asking from the queue for new jobs. Right now, the queue assigns the jobs to idle workers.
Workers can be responsible for writing the results directly to the database. But this might be undesirable since it would require workers to have direct access to the database.
ES can prioritize assigning tasks over saving the results. But this would still not allow it to scale which I think is a very important problem. Also, when it is in the middle of saving a task it can not abandon it (though probably not that important).
We can also have several queues and put tasks in one of them when enqueuing (for example in a round-robin fashion). However, this means that if workers for one of the ESes die for some reason, the tasks will be stuck (or some tasks might get done slower). Also, it would make recovering from failure of one ES harder (now we can just add all the remaining tasks to the queue but then we would have to check what is in other ESes, etc.)
Additional context
I am writing the details based on what I remember from IOI 2017. I believe the problem is still there since it was recently added to IOI tech guide (https://ioi.github.io/checklist/).
I have opened this issue to see if there is a consensus on doing the separation of ES and the queue.
The text was updated successfully, but these errors were encountered:
Was this code included in the CMS version used at IOI2017?
In general, I also agree that we could separate ES and the queue, taking advantage of one of the many available asynchronous queue libraries out there. It could also be possible (in theory) to re-write ES in another language that is faster / or has a faster runtime than Python.
This is a high level topic that would need some exploration to see how feasible it is.
Batching can help a bit but the problem is still there. At full load, workers still reach out to ES almost at the same time and ES becomes a bottleneck again. I also don't think there is a need to implement ES in a different language as simply separating the queue and ES resolved the issue completely in 2017. I am not sure but I think the bottleneck might actually be the time needed to store results in the database which would not depend on the language.
If we want to go forward with this (which I suggest we do), I might be able to allocate some time to port the code from 2017 to the current version and send a PR.
That's very interesting Amir 😄 it would definitely be great if you could port the Queue service to the current CMS version, I will also try to take a look at that branch in the next days.
Description: When there are many workers, ES gets overloaded and can not assign a new task fast enough. This reduces the efficiency so much that having more workers after a certain point hurts performance. This happens because ES will be busy saving the results of the tasks and won't be available to assign new tasks.
Steps to reproduce:
Expected: A worker should be assigned a new task immediately after it finishes one.
Actual: Workers will end up waiting for a noticeable amount of time for the next task. This can be seen from admin page. One second all workers will be busy and the next all of them will be idle while the queue is still full.
Describe the solution you'd like
In IOI 2017, we used a separate service to maintain the queue. The queue remains responsible for assigning new tasks to the workers and getting the results from them. However, when it receives task results, it sends them separately to ES to save them on the database. This allows the worker to be reused immediately and removes any bottleneck on the queue. Also, in this way, ES can be scaled which is currently not possible since it is also handling the queue. I suggest adapting the code from 2017.
Describe alternatives you've considered
Additional context
I am writing the details based on what I remember from IOI 2017. I believe the problem is still there since it was recently added to IOI tech guide (https://ioi.github.io/checklist/).
I have opened this issue to see if there is a consensus on doing the separation of ES and the queue.
The text was updated successfully, but these errors were encountered: