-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add delay on entity re-ingestion on failed attempt #298
Conversation
When an entity failed to be deposited, PMPY would only wait the minimum age to retry to ingest the entity. This creates the problem where we don't let the problem to be fixed and we flood the logs with information that is no longer useful. This change doubles the time of each attempt to give space for the problems to be fixed. There are 3 configuration parameters added: - ingestion_prefix : The prefix used for redis keys storing the failed attempt count - Default value of - Default value of `prod:pmpy_ingest_attempt:` - ingestion_attempts : The max number of re-ingestion attempts - Default value of `15` - first_failed_wait : The time to add at the beginning which will be double every retry - Default value of `10` The default values will keep retrying for around a week and will drop the entity for ingestion after that. This is related to issue [#297](#297).
spec/pushmi_pullyu/cli_spec.rb
Outdated
time_now = Time.now.to_i | ||
_readded_entity, readded_entity_score = redis.zrange(PushmiPullyu.options[:queue_name], | ||
0, 0, with_scores: true).first | ||
deposit_attempt = redis.get(attempt_key).to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to assert that redis.get(attempt_key).to_i equals deposit_attempt + 1
here as it cannot be assumed necessarily from the test side that this will always increment
def self.extra_wait_time(deposit_attempt) | ||
extra = 0 | ||
deposit_attempt.times do |n| | ||
extra += (2**n) * PushmiPullyu.options[:first_failed_wait] | ||
end | ||
extra | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't appear to match the stated specifications of: "doubles the time of each attempt"
With a PushmiPullyu.options[:first_failed_wait] of 10 it appears that this is the added wait time of each attempt:
0
10
30
70
150
310
630
1270
2550
5110
If doubling added wait on each attempt starting with first_failed_wait is what we're trying to accomplish it looks like this would give that:
def self.extra_wait_time(deposit_attempt)
return 0 if deposit_attempt == 0
(2**(deposit_attempt - 1)) * PushmiPullyu.options[:first_failed_wait]
end
The numbers vary slightly so this is a small nit in reality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right!
This code should work:
(2**deposit_attempt) * PushmiPullyu.options[:first_failed_wait]
Will make the changes
o.on('-i', '--ingestion_prefix PREFIX', | ||
'Prefix for keys used in counting the number of failed ingestion attempts') do |prefix| | ||
opts[:ingestion_prefix] = prefix | ||
end | ||
|
||
o.on('-x', '--ingestion_attempts NUMBER', Integer, | ||
'Max number of attempts to try ingesting an entity') do |ingestion_attempts| | ||
opts[:ingestion_attempts] = ingestion_attempts | ||
end | ||
|
||
o.on('-f', '--first_failed_wait NUMBER', Integer, | ||
'Time in seconds to wait after first failed deposit. Time will double every failed attempt') do |failed_wait| | ||
opts[:first_failed_wait] = failed_wait | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding these new options to the README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are there, do let me know if I can clarify it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I was curious if items that fail should be kept track of? Not sure if there's already a way to verify which items didn't succeed or if we even care to keep track of this.
begin | ||
queue.add_entity_in_timeframe(entity) | ||
rescue MaxDepositAttemptsReached => e | ||
log_exception(e) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit of code will log when an entity reaches the max number of attempts for ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ConnorSheremeta do you think this would be enough to keep track of the failed items? It would mean going through the logs though, not sure if this is the best approach. Always open to ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering through the logs could suffice although it may not be ideal, I suppose these failures could be logged to another file but that may not be worth doing as this information can be filtered as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# jupiter is submitting the entries to reddis in a hash format using fat arrows. We need to change them to colons | ||
# in order to parse them correctly from json | ||
entity = JSON.parse(entity_json.gsub('=>', ':'), { symbolize_names: true }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this no longer an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, this was an old error where the items were being added to the queue in an incorrect manner that added the fat arrows. The new versions of jupiter and pmpy will be adding correct json.
@redis.with do |connection| | ||
connection.zadd @queue_name, Time.now.to_f, entity_json | ||
# separate information for priority information and queue | ||
deposit_attempt = connection.incr entity_attempts_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if we deploy this and there are items already in the queue which wouldn't have an entity_attempts_key
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from https://devdocs.io/redis/incr
If the key does not exist, it is set to 0 before performing the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, the normal behaviour would continue
redis.zadd 'test:pmpy_queue', 1, 'noid1' | ||
redis.zadd 'test:pmpy_queue', 3, 'noid3' | ||
redis.zadd 'test:pmpy_queue', 4, 'noid2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are noid1
, noid2
, and noid3
still used or defined somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values where just used for this test.
end | ||
end | ||
|
||
def self.extra_wait_time(deposit_attempt) | ||
(2**deposit_attempt) * PushmiPullyu.options[:first_failed_wait] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
When an entity fails to be deposited, PMPY would only wait for the minimum age to retry to ingest the entity. This creates the problem where we don't let the problem be fixed and we flood the logs with information that is no longer useful.
This change doubles the time of each attempt to give space for the problems to be fixed. There are 3 configuration parameters added:
prod:pmpy_ingest_attempt:
15
10
The default values will keep retrying for around a week and drop the entity for ingestion.
This is related to issue #297.