Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delay on entity re-ingestion on failed attempt #298

Merged
merged 2 commits into from
Jan 31, 2023

Conversation

lagoan
Copy link
Contributor

@lagoan lagoan commented Jan 30, 2023

When an entity fails to be deposited, PMPY would only wait for the minimum age to retry to ingest the entity. This creates the problem where we don't let the problem be fixed and we flood the logs with information that is no longer useful.

This change doubles the time of each attempt to give space for the problems to be fixed. There are 3 configuration parameters added:

  • ingestion_prefix: The prefix used for Redis keys storing the failed attempt count - Default value of - Default value of prod:pmpy_ingest_attempt:
  • ingestion_attempts: The max number of re-ingestion attempts - Default value of 15
  • first_failed_wait: The time to add at the beginning which will be double every retry - Default value of 10

The default values will keep retrying for around a week and drop the entity for ingestion.

This is related to issue #297.

When an entity failed to be deposited, PMPY would only wait the minimum age to retry to ingest the entity. This creates the problem where we don't let the problem to be fixed and we flood the logs with information that is no longer useful.

This change doubles the time of each attempt to give space for the problems to be fixed. There are 3 configuration parameters added:

- ingestion_prefix : The prefix used for redis keys storing the failed attempt count - Default value of  - Default value of `prod:pmpy_ingest_attempt:`
- ingestion_attempts : The max number of re-ingestion attempts - Default value of `15`
- first_failed_wait : The time to add at the beginning which will be double every retry - Default value of `10`

The default values will keep retrying for around a week and will drop the entity for ingestion after that.

This is related to issue [#297](#297).
time_now = Time.now.to_i
_readded_entity, readded_entity_score = redis.zrange(PushmiPullyu.options[:queue_name],
0, 0, with_scores: true).first
deposit_attempt = redis.get(attempt_key).to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to assert that redis.get(attempt_key).to_i equals deposit_attempt + 1 here as it cannot be assumed necessarily from the test side that this will always increment

Comment on lines 90 to 96
def self.extra_wait_time(deposit_attempt)
extra = 0
deposit_attempt.times do |n|
extra += (2**n) * PushmiPullyu.options[:first_failed_wait]
end
extra
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to match the stated specifications of: "doubles the time of each attempt"

With a PushmiPullyu.options[:first_failed_wait] of 10 it appears that this is the added wait time of each attempt:

0
10
30
70
150
310
630
1270
2550
5110

If doubling added wait on each attempt starting with first_failed_wait is what we're trying to accomplish it looks like this would give that:

def self.extra_wait_time(deposit_attempt)
  return 0 if deposit_attempt == 0

  (2**(deposit_attempt - 1)) * PushmiPullyu.options[:first_failed_wait]
end

The numbers vary slightly so this is a small nit in reality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right!

This code should work:

(2**deposit_attempt) * PushmiPullyu.options[:first_failed_wait]

Will make the changes

Comment on lines +151 to +165
o.on('-i', '--ingestion_prefix PREFIX',
'Prefix for keys used in counting the number of failed ingestion attempts') do |prefix|
opts[:ingestion_prefix] = prefix
end

o.on('-x', '--ingestion_attempts NUMBER', Integer,
'Max number of attempts to try ingesting an entity') do |ingestion_attempts|
opts[:ingestion_attempts] = ingestion_attempts
end

o.on('-f', '--first_failed_wait NUMBER', Integer,
'Time in seconds to wait after first failed deposit. Time will double every failed attempt') do |failed_wait|
opts[:first_failed_wait] = failed_wait
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding these new options to the README

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@ConnorSheremeta ConnorSheremeta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I was curious if items that fail should be kept track of? Not sure if there's already a way to verify which items didn't succeed or if we even care to keep track of this.

Comment on lines +221 to +225
begin
queue.add_entity_in_timeframe(entity)
rescue MaxDepositAttemptsReached => e
log_exception(e)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of code will log when an entity reaches the max number of attempts for ingestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConnorSheremeta do you think this would be enough to keep track of the failed items? It would mean going through the logs though, not sure if this is the best approach. Always open to ideas!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering through the logs could suffice although it may not be ideal, I suppose these failures could be logged to another file but that may not be worth doing as this information can be filtered as needed

@lagoan lagoan merged commit a1f46d4 into master Jan 31, 2023
@lagoan lagoan deleted the add/entity_ingestion_attempts_priority branch January 31, 2023 21:19
Copy link
Member

@pgwillia pgwillia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines -189 to -191
# jupiter is submitting the entries to reddis in a hash format using fat arrows. We need to change them to colons
# in order to parse them correctly from json
entity = JSON.parse(entity_json.gsub('=>', ':'), { symbolize_names: true })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this no longer an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, this was an old error where the items were being added to the queue in an incorrect manner that added the fat arrows. The new versions of jupiter and pmpy will be adding correct json.

@redis.with do |connection|
connection.zadd @queue_name, Time.now.to_f, entity_json
# separate information for priority information and queue
deposit_attempt = connection.incr entity_attempts_key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if we deploy this and there are items already in the queue which wouldn't have an entity_attempts_key?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from https://devdocs.io/redis/incr

If the key does not exist, it is set to 0 before performing the operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, the normal behaviour would continue

Comment on lines -11 to -13
redis.zadd 'test:pmpy_queue', 1, 'noid1'
redis.zadd 'test:pmpy_queue', 3, 'noid3'
redis.zadd 'test:pmpy_queue', 4, 'noid2'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are noid1, noid2, and noid3 still used or defined somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values where just used for this test.

end
end

def self.extra_wait_time(deposit_attempt)
(2**deposit_attempt) * PushmiPullyu.options[:first_failed_wait]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants