-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang in batch Command
#104
Comments
Just happened with debug logs, attached. One thing that I notice is that there's an exactly 10s gap between the last DEBUG log and the |
Ok, I actually have a local reproducer for this now. This happens because a synchronous part of our task is running too long, a I'm unsure how to best deal with this though. This needs to be run as part of the task as a whole but it's not clear to me how to offload it to unblock the ELF and resume processing on the ELF once it's done. I guess I could increase the timeout but to some degree that's just kicking the can down the road until the next bigger repo / network glitch comes along. I'd like to avoid the hangs, as they require manual restarting of the job. |
@finestructure have you tried submitted the long running job to Vapor's thread pool to offload it from the event loop? https://api.vapor.codes/vapor/documentation/vapor/application/threadpool That should at least free up the underlying event loop and stop it hanging. If that doesn't work how are you shelling out to git clone? |
Ah, that's exactly what I was going to try and figure out tomorrow - how would I offload this properly. Thanks a lot for the pointer! |
Note to future self re-reading this comment: This has been explained by Gwynne on Discord. The use of the transaction is indeed what triggers the error but the other code path only doesn't, because it doesn't actually request a connection due to the way Fluent deals with the connection pool. So using If my workload is
the
it isn't. Note that there's no actual db activity involved at all. The example runs without a db. I may have oversimplified the example and be triggering something that only looks like our issue but it certainly looks like what we're seeing in the full app (which is also using a transaction). The reproducer (a small vapor app) is here: https://github.com/finestructure/debug-2227 |
I'm pretty sure now this is our core issue. Luckily it was easy to rearrange the code (in this instance) to pull the slow sync workload out of the transaction and do it up front before the db access. Once I do that, the deadlock errors also disappear in my local reproducer with our actual analysis code. |
Looks like this issue is actually two separate issues in a trench coat. After working around the |
Update: the More details here: SwiftPackageIndex/SwiftPackageIndex-Server#2227 (comment) |
This is the effect of the |
Yes, and I managed to work around this particular part of the problem (the The real problem for us is that our batch job (the Vapor command) hangs and needs manual restarting. I have my doubts this is actually related to the Also note the change in behaviour if the slow job is within a db transaction or not (#104). Feels like there's something amiss there 🤔 |
The weird thing is that this timeout seems to be dependent on whether I'm using a db transaction or not (see the comment and the test project above: #104). I'll bump the parameter up for now to see if that helps with the frequency of the hangs. Thank you! |
Not sure why this is in console-kit; it belongs on async-kit, transferring there. |
@finestructure Is this still an issue? Have you tried with the latest async kit? That had a load of improvements to connection pooling to stop these kind of issues |
Unfortunately, it is. We're on async-kit 1.18.0 and our analysis job on dev is hanging right now. I've started tracking just our prod occurrences of it here: SwiftPackageIndex/SwiftPackageIndex-Server#2227 (comment). I'm unsure what else I could do to track this down - and it's infrequent enough that it's just easier to respond and restart the job when it happens that dig into this further. Let me know if there's anything I can do to gather more info! |
My current best guess is that this is happening because the @finestructure To test this theory,
extension FluentKit.Databases {
public func db(_ id: DatabaseID?, on eventLoop: any EventLoop) -> any Database {
self.databases
.database(
id,
logger: self.logger,
on: eventLoop,
history: self.fluent.history.historyEnabled ? self.fluent.history.history : nil,
pageSizeLimit: self.fluent.pagination.pageSizeLimit
)!
}
}
If that fixes it, I will look into adding appropriate support to the packages in question. |
Thanks for taking a look, @gwynne ! We're not using the analyze:
image: registry.gitlab.com/finestructure/swiftpackageindex:${VERSION}
<<: *shared
depends_on:
- migrate
entrypoint: ["/bin/bash"]
command: ["-c", "--",
"trap : TERM INT; while true; do ./Run analyze --env ${ENV} --limit ${ANALYZE_LIMIT:-25}; sleep ${ANALYZE_SLEEP:-20}; done"
]
deploy:
resources:
limits:
memory: 4GB
restart_policy:
max_attempts: 5
networks:
- backend Would your theory still apply? |
@finestructure Even more so, actually, since by default a let eventLoop = context.application.eventLoopGroup.any())
let db = context.application.db(.psql, on: eventLoop) Then make sure to use |
This has been resolved by other work. |
This is a follow-up of a Discord discussion, FYI @0xTim !
We're seeing the following error in one of our Vapor batch jobs every once in a while (a few times per month, maybe once a week - hard to quantify precisely):
When this message occurs, the job HANGS - i.e. it locks up and does not terminate, preventing all further processing.
In normal operation the job spins up and typically processes for 3-6 seconds, running a number of queries against the Github APIs (we're processing batches of 25 Swift packages, updating Github metadata in the Swift Package Index).
I've done a little research and there seem to be a couple of cases where this can happen:
I suspect it's 2). Last I checked we're not exhausing our db connections and our db utilization across our two envs is <15% and <10%.
I believe this issue first started happening after moving to async/await. This may be due to how we're launching async processes from a
Command
via our ownAsyncCommand
. (We previously used a semaphore-based implementation which also encountered hangs.)Our logs show no unusual activity around the error, although we've not yet had it happen with
DEBUG
logging enabled.We're also tracking this issue here: SwiftPackageIndex/SwiftPackageIndex-Server#2227, and I'll be adding more observations as they happen there.
The text was updated successfully, but these errors were encountered: