-
-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Received signal 11" in app_analyze #2227
Comments
None of these packages individually are causing a signal 11 when running locally (macOS/arm64). It's going to be difficult to run this on Linux/x86 but maybe Linux/arm64 can reproduce it. |
Doesn't crash on Linux/arm64 either. I have no way to test this on Linux/x86. The next best thing we can try is to make sure the backtrace shows up in the logs. |
It's not clear to me why the backtrace isn't captured by our logging. We're logging both stderr and stdout and yet there's no trace (😅) of the output beyond |
I'll close this for now, will reopen and add more details if it happens again. |
Another lock-up, different error and we got a partially symbolicated stack trace:
|
Common theme:
Not sure if that's a symptom or the cause, but worth investigating. Worth noting that these hangs started appearing when I converted more ELFs to a/a recently. |
It's been a while since we've seen this, so I'm going to close this for now. |
This is still happening and it's now also visible on Recent hangs didn't show a signal 11 or a backtrace but I'm not 100% sure if that's just due to how things are logged. What we typically get is the following in the container logs:
and something like this in the Grafana logs:
I'll post something in the Swift slack to see if there's some kind of logging we can add to track this down. I'm pretty sure this is related to our move to a/a. That's when it started happening. |
This here seems to be exactly the issue we're running into: vapor/apns#28 It also explains why we're only seeing this in |
After increasing log level to
The // all connections are busy, check if we have room for more
if self.activeConnections < self.maxConnections {
logger.debug("No available connections on this event loop, creating a new one")
self.activeConnections += 1
return makeActiveConnection()
} else {
// connections are exhausted, we must wait for one to be returned
logger.debug("Connection pool exhausted on this event loop, adding request to waitlist")
let promise = eventLoop.makePromise(of: Source.Connection.self)
self.waiters.append((logger, promise))
let task = eventLoop.scheduleTask(in: self.requestTimeout) { [weak self] in
guard let self = self else { return }
logger.error("Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout.")
if let idx = self.waiters.firstIndex(where: { _, p in return p.futureResult === promise.futureResult }) {
self.waiters.remove(at: idx)
}
promise.fail(ConnectionPoolTimeoutError.connectionRequestTimeout)
}
return promise.futureResult.always { _ in task.cancel() }
}
} |
After reducing the batch size to 5 (from 25) the messages are still there but there are fewer of them:
This could possible prevent us from running into the deadlock but I'll also check if we can get rid of them entirely by increasing the pool size. |
One notable observation: the last couple of times this happened I didn't see any backtrace. I'm not sure if this means the process isn't crashing. Note also that in January '23 when I opened this issue we were running on Swift 5.7 and now we're on 5.8, so that might be an explanation. |
Happened again on dev just now. |
Happened again on dev this morning. |
Happened in prod this morning. |
And now on dev. |
The analysis job is also running EventLoop latency logging is a recommended diagnostic tool, more details here: apple/swift-nio#2410 |
I mean, we do run checkouts of new packages for instance, and those can take > 10s if they're large. |
prod, ~5:40 CET, July 30 2023 |
prod, ~21:00 CET, July 31 2023 |
prod, ~1:30 CET, Aug 12 2023 |
prod, 9:19 CET, Sep 3 2023 |
We've had a hang again, with all the latest ShellOut fixes, at 16:18 CET today, Sep 6 2023. It's very clear now that the cause is an actual crash:
Note the timestamp of the last entry at 14:02 UTC (16:00 CET) is a single entry and then it hangs. Our alerts trigger when we've not seen logs for 20 mins, so that aligns with the alert at 16:18 CET. I see signal 11 crashes every few hours but it seems they don't always lead to a hang. Haven't looked at the logs in more detail yet, attached below. analyze-2023-09-06-stderr.logs.zip FYI @gwynne 😢 |
@finestructure So, from that crash log, looks to me like odds are decent you're hitting the ARC miscompile crash in 5.8.1. My very unorthodox suggestion is to try building with a 5.9 snapshot, for three reasons:
|
Mmm, we're actually compiling with a 5.9 toolchain, from Sep 1 or so. https://gitlab.com/finestructure/spi-base/-/blob/main/Dockerfile |
Ooof... then my suggestion is to update your dependencies (and bump your 5.9 to the latest snapshot (which at the moment appears to be sha256:76911e1da04bce21683872c23033f6e2c07f296df7465fce0bf1e1bd01835001). This will get you the new native backtracking support (if you don't already have it), plus the updated Vapor that doesn't overwrite it with the old version. |
That image is from yesterday, dependencies updated on Monday - we should be pretty much up to date, I'd say. I'm not sure why the backtrace is only partially working. It's been like this for a long time. It's weird 🤔 |
(I'll def check in the morning that we've got the Vapor update you're referring to.) |
For reference, the update in question is Vapor 4.81.0, released ~12 hours ago at the time of this writing. |
I just realised we'd updated to that latest Swift 5.9 and run the package updates but it's not deployed yet. It's the revision just after so what was running had just the ShellOut changes! |
Another prod hang, 15:57 CET, Sep 9 2023. stderr is just
stdout logs below - I don't see any backtrace or other bits that indicate issues, except the infamous
|
Hold on, that segfault is from 10:01 this morning, so certainly not the issue. Again, it seems like the segfault/signal 11 is a red herring and the hangs are due to something else but who knows what's going on. |
This is on prod with Vapor 4.81.0, compiled with Swift 5.9 from Sep 1.
|
@finestructure If the corefile (indicated by |
prod, 16:33 CET, Sep 14 2023 I missed the first alert and only checked on this now (21:37 CET (UTC+2)). I've pulled the logs and the closest Seg fault is at 13:28 UTC
with processing continuing until 14:18 UTC while the connection timeout messages leading to the hang are appearing:
The are 15 seg faults in the log file I pulled (ranging from Sep 11 to Sep 14). I think it's safe to say that the seg faults aren't the cause of the hangs. I looked for a core file in the running container but there was none in the executable's directory nor in a few other places I checked (/var/cache/abrt, /var/spool/abrt, /var/crash). Core dump size it unlimited:
It's late and so I restarted the container for now. Can look into where it ends up some other time - these seem to happen frequently enough (15 times in 3 days). BTW, the logs do not contain any stack trace info, despite the latest Vapor and Swift 5.9 🤔:
|
@finestructure For next time, I'd suggest just searching the entire filesystem, e.g. |
Closing this as fixed now - we haven't had a hang since we removed the Finally 🙂🎉 Huge thanks again to Gwynne for all the help! |
I finally caught a glimpse as to why analysis sometimes hangs:
The backtrace itself isn't in the logs but hopefully the crash is reproducible with one of the packages in question.
The text was updated successfully, but these errors were encountered: