-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move App Systemd Tasks to Active Jobs #552
base: master
Are you sure you want to change the base?
Conversation
This introduces substantial changes to email pulling. * Email pulling is no longer managed with systemd and rake. * Email pulling job no longer manages its own lifecycle. If we are no longer using systemd, we no longer have its ability to recover from unexpected failure by restarting the task. Additionally, we no longer need to manage an IMAP connection lifecycle, and avoid rate-limiting errors. * IMAP failures are now logged as errors, as we do not expect rate-limiting or dropped connection errors on a short-lived connection.
raise "No acceptable authentication mechanisms" | ||
end | ||
rescue Net::IMAP::NoResponseError, SocketError, Faraday::ConnectionFailed => error | ||
logger.error("Could not authenticate for #{config[:email]}, error: #{error.message}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we no longer anticipate rate-limits or connection drops, I changed this (and one below) to log an error. I figure that while it is an error, it is not likely to be one resulting from our app logic, but should it raise an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes it should raise an error so that the job is properly marked as failed. We can then set retry_on
or discard_on
depending on how it errored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Main thing it is missing is something to initially start these jobs and then some before_perform
to keep them going.
Alternatively, we could use something like activejob-scheduler or sidekiq-scheduler. The Sidekiq one seems more popular, and the schedules aren't really needed for development (which would use the async backend instead of Sidekiq).
raise "No acceptable authentication mechanisms" | ||
end | ||
rescue Net::IMAP::NoResponseError, SocketError, Faraday::ConnectionFailed => error | ||
logger.error("Could not authenticate for #{config[:email]}, error: #{error.message}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes it should raise an error so that the job is properly marked as failed. We can then set retry_on
or discard_on
depending on how it errored.
waiting = Thread.start do | ||
sleep(20.minutes) | ||
|
||
imap.idle_done | ||
end | ||
|
||
imap.idle do |response| | ||
if response.respond_to?(:name) && response.name == 'EXISTS' | ||
waiting.kill | ||
imap.idle_done | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting--it seems that we force it to wait for 20 minutes, and then the imap.idle
command will return when there are new inbox items. I'm wondering if we should have a separate task to spawn jobs with this method. We could do imap.idle(20*60) do |res| PullEmailJob.perform_later
which would cause it to run every 20 minutes, or earlier if there is a notification. That job would have a after_perform
to restart it.
https://ruby-doc.org/stdlib-2.5.3/libdoc/net/imap/rdoc/Net/IMAP.html#method-i-idle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On some level I'd rather have more frequent short-lived jobs than long-running ones. Sidekiq at least can run multiple threads, but it still hogs one of them. However, with Docker scaling plus threading, I don't think it will matter to us, so I'm happy to re-roll the idling.
Another concern of mine is that sounds like if we immediately reschedule, it would be also possible to end up in a fail-loop and get rate-limited. I think if we do idle, we should just resume at the next scheduled time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could experiment with a shorter idle delay--maybe 60 seconds to 5 minutes. That way a job never runs longer than 5 minutes. If it fails, it seems there is a wait
parameter on the retry_on
method so we can have it wait 10 minutes if it fails. That should cover the looping-rate-limit issue. It could then re-queue itself or rely on another scheduler (although we would need to scheduler to not run if a retry is currently waiting).
https://edgeapi.rubyonrails.org/classes/ActiveJob/Exceptions/ClassMethods.html#method-i-retry_on
So 1) initial job starts, 2) job ends and it goes into idle (either launches an idle job or just continues current job to idle), 3) idle job ends and re-queue or scheduler starts another.
I'm not really worried about thread hogging. Even though the task is running, this is blocking I/O so it won't consume anything else on the system. We can just treat our scaling as n + 1
where 1 is a worker dedicated to email. But also Sidekiq might recognize the blocking I/O and not lock up that worker (I have not tested this).
class PullEmailJob < ApplicationJob | ||
queue_as :default | ||
|
||
def perform(*args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can have no arguments if you aren't using any.
STDOUT.sync = true | ||
|
||
logger = Logger.new(STDOUT) | ||
logger.level = Logger::INFO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we still want these
reconnectSleep = 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reconnectSleep = 1 | |
begin | ||
imap.select(config[:name]) | ||
|
||
while true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want this while true
anymore since we will control the frequency by running the job many times (not a single long-lasting job).
class SendEventSlackNotificationsJob < ApplicationJob | ||
queue_as :default | ||
|
||
def perform(*args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, unsure that we have to specify args if not used.
STDOUT.sync = true | ||
|
||
logger = Logger.new(STDOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want these still for STDOUT
@@ -0,0 +1,102 @@ | |||
class SendEventSlackNotificationsJob < ApplicationJob | |||
queue_as :default | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This job probably needs a before_perform
to schedule the job occurring on the next hour interval.
A later PR could break this into two jobs: One to send notifications for each event and one to schedule a notification job for each event (so that it isn't restricted to on the hour). This would also need to keep track of events that have already been notified (and clear that record if the time changes to a later time).
require 'logger' | ||
|
||
class PullEmailJob < ApplicationJob | ||
queue_as :default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This job probably needs a before_perform
to schedule the job occurring on the next 20 minute interval. Or, see other comment below we can have a separate job scheduling this job that incorporates the IMAP idle notifications.
We can run them from the same container (but separately) and just change the command. This is how I run Sidekiq (and multiple scaled instances of it) in the same Compose instance as Rails: sidekiq:
image: ghcr.io/example/example:main
restart: always
environment:
DATABASE_URL: "mariadb://example:example@postgresql/example_production"
REDIS_URL: "redis://redis:6379/1"
RAILS_MAILER_DEFAULT_URL_HOST: "example"
RAILS_MAILER_DEFAULT_URL_PORT: "443"
RAILS_MAILER_DEFAULT_URL_PROTOCOL: "https"
RAILS_MAILER_SMTP_HOST: "example"
RAILS_MAILER_SMTP_PORT: "587"
command: bundle exec sidekiq
networks:
- app-mariadb
- app-redis
depends_on:
- redis
- mariadb
deploy:
replicas: 3
redis:
image: redis
restart: always
networks:
- app-redis
volumes:
- redis:/data
command: redis-server --save 60 1 --loglevel warning |
I don't like the idea of writing our own scheduling code, whether it is self-requeueing or scheduling other jobs. There's a lot of pitfalls, and it might result in more maintenance time than it is worth. It seems like sidekiq-scheduler got most of their own race condition problems sorted out last month (after they were raised in 2016). I like how it looks in general. |
For recurring jobs I agree, self-requeuing has a lot of race conditions to work through. If it ever stops then there is a chance it won't start again. However, spawning non-requeing jobs from jobs is fine. For example, a recurring job every 5 minutes (running on a scheduler) could spawn individual one-shot jobs to send event notifications at exactly the right time (this example feature should go in a separate PR though). |
Partial progress towards #544.
Removed:
Moved:
Note: email pulling is no longer long-running.
Same:
Note: for moving to Docker, it may make sense to put the TS server into its own container so that Docker can handle its lifecycle.