-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make gush handle wokflows with 1000s of jobs #55
Comments
You are right that it's slow. This is a known issue because (tl;dr) Gush was never intended for such big workflows. A possible solution would be to allow multiple backends instead of just Redis. Ideally a graph database, since it's a graph. Then knowing if node can be executed would be a single query. |
@pokonski yeah, i am agree that some-kind-of graph solution would be ideal At this moment i am seeing 10x performance boost on my current solution by changing was of data is stored. Do you have a chance to review my changes? I can submit a PR if it does make sense for you. |
10x boost is great! Can you create a PR with those changes? The downside is, it will break compatibility so will require a major version release. |
@Saicheg I am very interested in your 10x boost PR as well 👍 |
Scratch that, I didn't see that the PR was already merged! So instead let me say thank you for the PR :) |
i tested this merged version and the job times stayed the same for ~1500 jobs, instead of runtimes increasing with each job. I am creating highcharts images for ~300.000 charts ranging from 1-8 secs per chart (depending on db calls, remote chart server request times). Before with v 1.1.2 the times went up to 90secs per job, making Gush pretty useless. Btw. I know that gush was not made to handle 1000' of jobs. But where is the reason to use such a lib just for small things, which could be handled in a single dedicated sidekiq job class. Obviously there are reasons, but you really really need a batch/workflow handling for big jobs. I dont see me paying sidekiq for the rest of my apps lifetime(10+ yrs), just bcs of one or two batch jobs. So yes i can live with some drawbacks and code fiddling. |
I do have some idea about making it way faster for huge workflows but this would require running a separate process, I'll definitely experiment with it first. Ideally having a graph database instead of Redis (or something like https://oss.redislabs.com/redisgraph/ ) would make it trivial to query for jobs which have fulfilled dependencies (this is the slowest part), but that is a big requirement for most users who don't need 1000s of jobs.
I developed it for quite big, but still static workflows where parts could fail often (fetching from various APIs) but not block other jobs that don't depend on that particular one. |
@Saicheg @xtagon @schorsch I have a WIP branch with an experimental feature using RedisGraph for dependency resolution. Its downside is it requires a custom module, but if you have time, please do check it out:
I am super curious if it will help with your cases of huge workflows, I'll prepare benchmarks on my side, too. |
Did a benchmark based on @Saicheg example workflow and results are in:
|
GREAT! Two things to mention: When reloading and getting the status of a running job i get an error: flow.reload & flow.status At the end of the run i seem to have the optimize / cache clean executed multiple time BUT i dont know since i did not log / puts anything. I do only see the the job times in the sidekiq console with its job id. See those 30sec jobs
Is there any (easy) method of finding out wich job the Gush::Worker JID-xy executed? i looked at the source but did not find anything directly. Update: i added the good old puts into the jobs and in fact the jobs scheduled after the img creation are executed multiple times |
Thanks for testing this @schorsch! I'll have a look at why they are executed multiple times, can you provide the workflow you have used? Just the workflow definition will do :) |
class CreateImagesWorkflow < Gush::Workflow
def configure(klass_name)
klass = "RI::Chart::Cmd::#{klass_name}".constantize
# finds all codes and schedule single jobs ~3000-30.000 per kind
cnt = 0
img_jobs = klass.codes.map do |code|
run RI::Chart::Job::CreateImage, params: {code: code, klass_name: klass_name}
end
# schedule optimizer, which figures out the files from given class
run RI::Chart::Job::OptimizeImages, after: img_jobs, params: {klass_name: klass_name}
run RI::Chart::Job::CleanUploadCache, after: RI::Chart::Job::OptimizeImages
end
end The called jobs all descend from Gush::Job, expect their params as string and are pretty small in terms of loc. Btw. I am visualizing german health-care data, here is an example of a public view, where you can see some of the generated chart images https://app.reimbursement.info/icds/F10 |
Please see the discussion I started about potential way to improve the bottlenecks mentioned here: #95 |
Hello!
First of all, thank you for this great library. We've using this for couple of our project and it's been really great.
Issue i am facing right now is that my workflow with 1000s of jobs is dramatically slow because of gush. Let's put some example here:
Playing around gush i found out that after each job completed it has to visit all dependent jobs ( which is
Reduce
in our case ) and try to enqueue them. But in order to understand if this job can be enqueued gush needs to understand that all dependent (Map
in our case ) jobs are finished.https://github.com/chaps-io/gush/blob/master/lib/gush/job.rb#L90
Problem with this code right now is that for each
Map
job after everyMap
jobs is finished it will callGush::Client#find_job
.This will produce massive number of
SCAN
operations and dramatically decrease a performance because of this line of code.https://github.com/chaps-io/gush/blob/master/lib/gush/client.rb#L119
I am not sure what is best solution in this case. I've tried to solve this problem by changing way of how gush stores serialized jobs. Idea is instead of storing jobs individually to store them on hash for every workflow/job type. Already have my own implemenation, but have to play around with benchmarks around this:
rubyroidlabs@4ee1b15
@pokonski what do you think here?
The text was updated successfully, but these errors were encountered: