-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Deadlock in PostgreSQL. #4445
base: master
Are you sure you want to change the base?
Conversation
69eb4d0
to
6559f2b
Compare
e29d93f
to
9ef84c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
keiko-sql/src/main/kotlin/com/netflix/spinnaker/q/sql/SqlQueue.kt
Outdated
Show resolved
Hide resolved
I'm assuming there's no way to get a test to demonstrate this? |
@@ -281,7 +281,8 @@ class SqlQueue( | |||
return | |||
} | |||
|
|||
candidates.shuffle() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO MOSTLY concerned b/c if I'm reading the comments ABOVE correct:
To minimize lock contention, this is a non-locking read. The id's returned may be
locked or removed by another instance before we can acquire them. We read more id's
than [maxMessages] and shuffle them to decrease the likelihood that multiple instances
polling concurrently are all competing for the oldest ready messages when many more
than [maxMessages] are read.
The shuffle shouldn't have mattered. Except... it turns out it DOES matter b/c of lock behavior. Wondering if we can update the comments or explanations on this... OR refactor this code to be less... confusing on how it operates in combination with below which does seem to use a lock mechanism...
Benchmark: cpu: "100m" MySQL 10 records = 23 ms / 13 ms / 11 ms / 12 ms PostgreSQL 10 records = 5 ms / 7 ms / 6 ms / 5 ms |
940bb17
to
dc8eb74
Compare
dc8eb74
to
a81744e
Compare
In PostgreSQL the rows will be locked as they are updated - in fact, the way this actually works is that each tuple (version of a row) has a system field called
xmin
to indicate which transaction made that tuple current (by insert or update) and a system field calledxmax
to indicate which transaction expired that tuple (by update or delete). When we access data, it checks each tuple to determine whether it is visible to our transaction, by checking our active "snapshot" against these values.If we are executing an UPDATE and a tuple which matches our search conditions has an
xmin
which would make it visible to our snapshot and anxmax
of an active transaction, it blocks, waiting for that transaction to complete. If the transaction which first updated the tuple rolls back, our transaction wakes up and processes the row; if the first transaction commits, our transaction wakes up and takes action depending on the current transaction isolation level.Obviously, a deadlock is the result of this happening to rows in different order. There is no row-level lock in RAM which can be obtained for all rows at the same time, but if rows are updated in the same order we can't have the circular locking. Unfortunately, the suggested IN(1, 2) syntax doesn't guarantee that. Different sessions may have different costing factors active, a background "analyze" task may change statistics for the table between the generation of one plan and the other, or it may be using a
seqscan
and be affected by the PostgreSQL optimization which causes a newseqscan
to join one already in progress and "loop around" to reduce disk I/O.