Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Received signal 11" in app_analyze #2227

Closed
finestructure opened this issue Jan 22, 2023 · 56 comments · Fixed by #2379 or #2592
Closed

"Received signal 11" in app_analyze #2227

finestructure opened this issue Jan 22, 2023 · 56 comments · Fixed by #2379 or #2592
Assignees
Labels
bug Something isn't working

Comments

@finestructure
Copy link
Member

finestructure commented Jan 22, 2023

I finally caught a glimpse as to why analysis sometimes hangs:

2023-01-21T18:57:54.937913401Z app_analyze.1.t1dix7fv4zph@p3    | [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2023-01-21T18:57:54.983740764Z app_analyze.1.t1dix7fv4zph@p3    | [ INFO ] Checkout directory: /checkouts [component: analyze]
2023-01-21T18:57:54.987633060Z app_analyze.1.t1dix7fv4zph@p3    | [ INFO ] pulling https://github.com/yonaskolb/Stringly.git in /checkouts/github.com-yonaskolb-stringly [component: analyze]
2023-01-21T18:57:55.021965033Z app_analyze.1.t1dix7fv4zph@p3    | [ INFO ] pulling https://github.com/dagronf/DSFActionBar.git in /checkouts/github.com-dagronf-dsfactionbar [component: analyze]
2023-01-21T18:57:55.029915226Z app_analyze.1.t1dix7fv4zph@p3    | [ INFO ] pulling https://github.com/ordo-one/package-frostflake.git in /checkouts/github.com-ordo-one-package-frostflake [component: analyze]
2023-01-21T18:57:55.271462231Z app_analyze.1.t1dix7fv4zph@p3    | [ INFO ] pulling https://github.com/chojnac/Dumpling.git in /checkouts/github.com-chojnac-dumpling [component: analyze]
2023-01-21T18:57:55.273393230Z app_analyze.1.t1dix7fv4zph@p3    | [ INFO ] pulling https://github.com/raymccrae/swift-jsonpatch.git in /checkouts/github.com-raymccrae-swift-jsonpatch [component: analyze]
2023-01-21T18:57:55.277729726Z app_analyze.1.t1dix7fv4zph@p3    | Received signal 11. Backtrace:
2023-01-21T18:58:04.984287787Z app_analyze.1.t1dix7fv4zph@p3    | [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]
2023-01-21T18:58:04.984746687Z app_analyze.1.t1dix7fv4zph@p3    | [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]

The backtrace itself isn't in the logs but hopefully the crash is reproducible with one of the packages in question.

@finestructure finestructure added the bug Something isn't working label Jan 22, 2023
@finestructure finestructure self-assigned this Jan 22, 2023
@finestructure
Copy link
Member Author

None of these packages individually are causing a signal 11 when running locally (macOS/arm64). It's going to be difficult to run this on Linux/x86 but maybe Linux/arm64 can reproduce it.

@finestructure
Copy link
Member Author

finestructure commented Jan 23, 2023

Doesn't crash on Linux/arm64 either.

I have no way to test this on Linux/x86. The next best thing we can try is to make sure the backtrace shows up in the logs.

@finestructure
Copy link
Member Author

It's not clear to me why the backtrace isn't captured by our logging. We're logging both stderr and stdout and yet there's no trace (😅) of the output beyond Received signal 11. Backtrace:.

@finestructure
Copy link
Member Author

I'll close this for now, will reopen and add more details if it happens again.

@finestructure
Copy link
Member Author

Another lock-up, different error and we got a partially symbolicated stack trace:

2023-01-27T15:37:57.557322522Z app_analyze.1.9jb9jx9vkshk@p2    | [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your con[130/1048]
meout. [component: server, database-id: psql]
2023-01-27T15:37:57.557341622Z app_analyze.1.9jb9jx9vkshk@p2    | [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection ti
meout. [component: server, database-id: psql]
2023-01-27T15:37:57.557377022Z app_analyze.1.9jb9jx9vkshk@p2    | [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection ti
meout. [component: server, database-id: psql]
2023-01-27T15:37:57.557428923Z app_analyze.1.9jb9jx9vkshk@p2    | [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection ti
meout. [component: server, database-id: psql]
2023-01-27T15:37:57.557445323Z app_analyze.1.9jb9jx9vkshk@p2    | [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection ti
meout. [component: server, database-id: psql]
2023-01-26T13:50:08.693743753Z app_analyze.1.luku8u31ozjl@p2    | 0x561f78b36187, closure vapor/console-kit#1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /build/.build/checkouts/swift-backtrace/Sources/Backt
race/Backtrace.swift:80
2023-01-26T13:50:08.693750253Z app_analyze.1.luku8u31ozjl@p2    | 0x561f78b36187, @objc closure vapor/console-kit#1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /build/<compiler-generated>:79
2023-01-26T13:50:08.693754653Z app_analyze.1.luku8u31ozjl@p2    | 0x7f4d69f8841f
2023-01-26T13:50:08.693758753Z app_analyze.1.luku8u31ozjl@p2    | 0x561f799f0571
2023-01-26T13:50:08.694152755Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79533bd0, closure vapor/console-kit#4 () throws -> Swift.String in (extension in ShellOut):Foundation.Process.(launchBash in _839723A297212BDF262C1834C3E29C1F)(with: Swift.String, outputHandle:
 Swift.Optional<Foundation.FileHandle>, errorHandle: Swift.Optional<Foundation.FileHandle>) throws -> Swift.String at /build/.build/checkouts/ShellOut/Sources/ShellOut.swift:442
2023-01-26T13:50:08.694178555Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79533bd0, partial apply forwarder for closure vapor/console-kit#4 () throws -> Swift.String in (extension in ShellOut):Foundation.Process.(launchBash in _839723A297212BDF262C1834C3E29C1F)(with:
 Swift.String, outputHandle: Swift.Optional<Foundation.FileHandle>, errorHandle: Swift.Optional<Foundation.FileHandle>) throws -> Swift.String at /build/<compiler-generated>:0
2023-01-26T13:50:08.694185355Z app_analyze.1.luku8u31ozjl@p2    | 0x561f7982a3eb
2023-01-26T13:50:08.694189055Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79829285
2023-01-26T13:50:08.694192655Z app_analyze.1.luku8u31ozjl@p2    | 0x561f7982d0fb
2023-01-26T13:50:08.694196255Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79824068
2023-01-26T13:50:08.694202855Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79840cd9
2023-01-26T13:50:08.694206355Z app_analyze.1.luku8u31ozjl@p2    | 0x561f7982d071
2023-01-26T13:50:08.694210455Z app_analyze.1.luku8u31ozjl@p2    | 0x561f7982914b
2023-01-26T13:50:08.694214055Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79829594
2023-01-26T13:50:08.694217755Z app_analyze.1.luku8u31ozjl@p2    | 0x561f795318e0, (extension in ShellOut):Foundation.Process.(launchBash in _839723A297212BDF262C1834C3E29C1F)(with: Swift.String, outputHandle: Swift.Optional<Foundation.FileHandle>,
errorHandle: Swift.Optional<Foundation.FileHandle>) throws -> Swift.String at /build/.build/checkouts/ShellOut/Sources/ShellOut.swift:441
2023-01-26T13:50:08.694237255Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79531d27, ShellOut.shellOut(to: Swift.String, arguments: Swift.Array<Swift.String>, at: Swift.String, process: Foundation.Process, outputHandle: Swift.Optional<Foundation.FileH
andle>, errorHandle: Swift.Optional<Foundation.FileHandle>) throws -> Swift.String at /build/.build/checkouts/ShellOut/Sources/ShellOut.swift:40
2023-01-26T13:50:08.694242255Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79531d27, ShellOut.shellOut(to: ShellOut.ShellOutCommand, at: Swift.String, process: Foundation.Process, outputHandle: Swift.Optional<Foundation.FileHandle>, errorHandle: Swift
.Optional<Foundation.FileHandle>) throws -> Swift.String at /build/.build/checkouts/ShellOut/Sources/ShellOut.swift:106
2023-01-26T13:50:08.694566856Z app_analyze.1.luku8u31ozjl@p2    | 0x561f7880f4d8, closure vapor/console-kit#1 (ShellOut.ShellOutCommand, Swift.String) throws -> Swift.String in variable initialization expression of static App.Shell.live : App.Shell at /build/Source
s/App/Core/AppEnvironment.swift:291
2023-01-26T13:50:08.694694157Z app_analyze.1.luku8u31ozjl@p2    | 0x561f78830c1d, App.Shell.run(command: ShellOut.ShellOutCommand, at: Swift.String) throws -> Swift.String at Sources/App/Core/AppEnvironment.swift:283
2023-01-26T13:50:08.694709557Z app_analyze.1.luku8u31ozjl@p2    | 0x561f78830c1d, function signature specialization <Arg[2] = Dead> of static App.Git.revisionInfo(_: App.Reference, at: Swift.String) throws -> App.Git.RevisionInfo at /build/Sources/
App/Core/Git.swift:70
2023-01-26T13:50:08.694851657Z app_analyze.1.luku8u31ozjl@p2    | 0x561f78812395, static App.Git.revisionInfo(_: App.Reference, at: Swift.String) throws -> App.Git.RevisionInfo at /build/<compiler-generated>:0
2023-01-26T13:50:08.694873258Z app_analyze.1.luku8u31ozjl@p2    | 0x561f78812395, implicit closure vapor/console-kit#12 (App.Reference, Swift.String) throws -> App.Git.RevisionInfo in implicit closure vapor/console-kit#11 (App.Git.Type) -> (App.Reference, Swift.String) throws -> Ap
p.Git.RevisionInfo in variable initialization expression of static App.Git.live : App.Git at /build/<compiler-generated>:271
2023-01-26T13:50:08.694901558Z app_analyze.1.luku8u31ozjl@p2    | 0x561f78812395, partial apply forwarder for implicit closure vapor/console-kit#12 (App.Reference, Swift.String) throws -> App.Git.RevisionInfo in implicit closure vapor/console-kit#11 (App.Git.Type) -> (App.Reference
, Swift.String) throws -> App.Git.RevisionInfo in variable initialization expression of static App.Git.live : App.Git at /build/<compiler-generated>:0
2023-01-26T13:50:08.696367663Z app_analyze.1.luku8u31ozjl@p2    | 0x561f7878eefe, closure vapor/console-kit#3 (App.Reference) throws -> App.Version in static App.Analyze.getIncomingVersions(client: Vapor.Client, logger: Logging.Logger, package: App.Joined<App.Packa
ge, App.Repository>) async throws -> Swift.Array<App.Version> at /build/Sources/App/Commands/Analyze.swift:438
2023-01-26T13:50:08.696417564Z app_analyze.1.luku8u31ozjl@p2    | 0x561f7878eefe, function signature specialization <Arg[1] = Owned To Guaranteed, Arg[2] = Owned To Guaranteed> of function signature specialization <Arg[0] = [Closure Propagated : cl
osure vapor/console-kit#3 (App.Reference) throws -> App.Version in static App.Analyze.getIncomingVersions(client: Vapor.Client, logger: Logging.Logger, package: App.Joined<App.Package, App.Repository>) async throws -> Swift.Array<App.Version>, Argument Types : [Swi
ft.StringApp.Joined<App.Package, App.Repository>]> of generic specialization <Swift.Array<App.Reference>, App.Version> of (extension in Swift):Swift.Collection.map<A>((A.Element) throws -> A1) throws -> Swift.Array<A1> at /build/<compiler-generated
>:0
2023-01-26T13:50:08.696437964Z app_analyze.1.luku8u31ozjl@p2    | 0x561f787920c7, function signature specialization <Arg[0] = [Closure Propagated : closure vapor/console-kit#3 (App.Reference) throws -> App.Version in static App.Analyze.getIncomingVersions(client: V
apor.Client, logger: Logging.Logger, package: App.Joined<App.Package, App.Repository>) async throws -> Swift.Array<App.Version>, Argument Types : [Swift.StringApp.Joined<App.Package, App.Repository>]> of generic specialization <Swift.Array<App.Refe
rence>, App.Version> of (extension in Swift):Swift.Collection.map<A>((A.Element) throws -> A1) throws -> Swift.Array<A1> at /build/<compiler-generated>:0
2023-01-26T13:50:08.696444064Z app_analyze.1.luku8u31ozjl@p2    | 0x561f787920c7, (1) suspend resume partial function for function signature specialization <Arg[3] = Dead> of static App.Analyze.getIncomingVersions(client: Vapor.Client, logger: Logg
ing.Logger, package: App.Joined<App.Package, App.Repository>) async throws -> Swift.Array<App.Version> at /build/Sources/App/Commands/Analyze.swift:437
2023-01-26T13:50:08.696486964Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79e6520d
2023-01-26T13:50:08.696503764Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79e658fb
2023-01-26T13:50:08.696508064Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79837424
2023-01-26T13:50:08.696511764Z app_analyze.1.luku8u31ozjl@p2    | 0x561f798371d2
2023-01-26T13:50:08.696515264Z app_analyze.1.luku8u31ozjl@p2    | 0x561f79842ac1
2023-01-26T13:50:08.696518764Z app_analyze.1.luku8u31ozjl@p2    | 0x7f4d69f7c608
2023-01-26T13:50:08.696538164Z app_analyze.1.luku8u31ozjl@p2    | 0x7f4d69ac0132
2023-01-26T13:50:08.696542264Z app_analyze.1.luku8u31ozjl@p2    | 0xffffffffffffffff
2023-01-26T13:50:08.846038854Z app_analyze.1.luku8u31ozjl@p2    | /bin/bash: line 1: 20617 Illegal instruction     (core dumped) ./Run analyze --env prod --limit 25
2023-01-26T13:50:28.883543917Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2023-01-26T13:50:28.953630593Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] Checkout directory: /checkouts [component: analyze]
2023-01-26T13:50:28.958182711Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] pulling https://github.com/genius/future.git in /checkouts/github.com-genius-future [component: analyze]
2023-01-26T13:50:28.987394827Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] pulling https://github.com/xiiagency/SwiftConcurrency.git in /checkouts/github.com-xiiagency-swiftconcurrency [component: analyze]
2023-01-26T13:50:29.001886784Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] pulling https://github.com/JUSTINMKAUFMAN/SwiftUIPhone.git in /checkouts/github.com-justinmkaufman-swiftuiphone [component: analyze]
2023-01-26T13:50:29.226366270Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] pulling https://github.com/patchthecode/JTAppleCalendar.git in /checkouts/github.com-patchthecode-jtapplecalendar [component: analyze]
2023-01-26T13:50:29.231274989Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] pulling https://github.com/RougeWare/Swift-Optional-Tools.git in /checkouts/github.com-rougeware-swift-optional-tools [component: analyze]
2023-01-26T13:50:29.254494181Z app_analyze.1.luku8u31ozjl@p2    | [ INFO ] pulling https://github.com/apple/swift-corelibs-xctest.git in /checkouts/github.com-apple-swift-corelibs-xctest [component: analyze]

@finestructure finestructure reopened this Jan 27, 2023
@finestructure
Copy link
Member Author

finestructure commented Jan 27, 2023

Common theme:

Connection request timed out. This might indicate a connection deadlock in your application.

Not sure if that's a symptom or the cause, but worth investigating.

Worth noting that these hangs started appearing when I converted more ELFs to a/a recently.

@finestructure
Copy link
Member Author

It's been a while since we've seen this, so I'm going to close this for now.

@finestructure
Copy link
Member Author

This is still happening and it's now also visible on dev. I believe it was always present there but we didn't see it because we didn't have an alert active there.

Recent hangs didn't show a signal 11 or a backtrace but I'm not 100% sure if that's just due to how things are logged.

What we typically get is the following in the container logs:

│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │
│   [ ERROR ] Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout. [component: server, database-id: psql]                │

and something like this in the Grafana logs:

2023-04-17 19:38:18	
[ INFO ] pulling https://github.com/gonzalezreal/AdaptiveCardUI.git in /checkouts/github.com-gonzalezreal-adaptivecardui [component: analyze]
2023-04-17 19:38:18	
[ INFO ] pulling https://github.com/EFPrefix/EFStorage.git in /checkouts/github.com-efprefix-efstorage [component: analyze]
2023-04-17 19:38:17	
[ INFO ] pulling https://github.com/hendriku/colorpicker.git in /checkouts/github.com-hendriku-colorpicker [component: analyze]
2023-04-17 19:38:17	
[ INFO ] pulling https://github.com/LiYanan2004/SwiftUI-Haptics.git in /checkouts/github.com-liyanan2004-swiftui-haptics [component: analyze]
2023-04-17 19:38:17	
[ INFO ] pulling https://github.com/0xLet/WTV.git in /checkouts/github.com-0xlet-wtv [component: analyze]
2023-04-17 19:38:17	
[ INFO ] Checkout directory: /checkouts [component: analyze]
2023-04-17 19:38:17	
[ INFO ] Analyzing (limit: 25) ... [component: analyze]

I'll post something in the Swift slack to see if there's some kind of logging we can add to track this down.

I'm pretty sure this is related to our move to a/a. That's when it started happening.

@finestructure
Copy link
Member Author

This here seems to be exactly the issue we're running into: vapor/apns#28

It also explains why we're only seeing this in analyze: of the four batch processes it's the one with the heaviest use of network requests (querying Github's API).

@finestructure
Copy link
Member Author

finestructure commented Apr 18, 2023

After increasing log level to DEBUG we're seeming interesting details:

│   [ DEBUG ] Connection pool exhausted on this event loop, adding request to waitlist [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:206)                                                                             │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Connection pool exhausted on this event loop, adding request to waitlist [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:206)                                                                             │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] updateStatus ops: 25 [component: analyze] (App/Common.swift:48)                                                                                                                                                                          │
│   [ DEBUG ] REFRESH MATERIALIZED VIEW recent_packages [] [component: server, database-id: psql] (FluentPostgresDriver/FluentPostgresDatabase.swift:164)                                                                                              │

The Connection pool exhausted on this event code path is one step from our deadlock error message:

https://github.com/vapor/async-kit/blob/a61da00d404ec91d12766f1b9aac7d90777b484d/Sources/AsyncKit/ConnectionPool/EventLoopConnectionPool.swift#L206

        // all connections are busy, check if we have room for more
        if self.activeConnections < self.maxConnections {
            logger.debug("No available connections on this event loop, creating a new one")
            self.activeConnections += 1
            return makeActiveConnection()
        } else {
            // connections are exhausted, we must wait for one to be returned
            logger.debug("Connection pool exhausted on this event loop, adding request to waitlist")
            let promise = eventLoop.makePromise(of: Source.Connection.self)
            self.waiters.append((logger, promise))
            
            let task = eventLoop.scheduleTask(in: self.requestTimeout) { [weak self] in
                guard let self = self else { return }
                logger.error("Connection request timed out. This might indicate a connection deadlock in your application. If you're running long running requests, consider increasing your connection timeout.")
                if let idx = self.waiters.firstIndex(where: { _, p in return p.futureResult === promise.futureResult }) {
                    self.waiters.remove(at: idx)
                }
                promise.fail(ConnectionPoolTimeoutError.connectionRequestTimeout)
            }
            
            return promise.futureResult.always { _ in task.cancel() }
        }
    }

@finestructure
Copy link
Member Author

After reducing the batch size to 5 (from 25) the messages are still there but there are fewer of them:

│   id: psql] (FluentPostgresDriver/FluentPostgresDatabase.swift:33)                                                                                                                                                                                   │
│   [ DEBUG ] UPDATE "packages" SET "status" = 'ok', "score" = $1, "processing_stage" = 'analysis', "updated_at" = $2 WHERE "packages"."id" = $3 [60, 2023-04-18 09:52:42 +0000, E11AC48B-A2C8-48FB-B496-D05FE9368F93] [component: server, database-   │
│   id: psql] (FluentPostgresDriver/FluentPostgresDatabase.swift:33)                                                                                                                                                                                   │
│   [ DEBUG ] Connection pool exhausted on this event loop, adding request to waitlist [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:206)                                                                             │
│   [ DEBUG ] Connection pool exhausted on this event loop, adding request to waitlist [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:206)                                                                             │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] Fulfilling connection waitlist request [component: server, database-id: psql] (AsyncKit/EventLoopConnectionPool.swift:275)                                                                                                               │
│   [ DEBUG ] updateStatus ops: 5 [component: analyze] (App/Common.swift:48)                                                                                                                                                                           │
│   [ DEBUG ] REFRESH MATERIALIZED VIEW recent_packages [] [component: server, database-id: psql] (FluentPostgresDriver/FluentPostgresDatabase.swift:164)                                                                                              │

This could possible prevent us from running into the deadlock but I'll also check if we can get rid of them entirely by increasing the pool size.

@finestructure
Copy link
Member Author

One notable observation: the last couple of times this happened I didn't see any backtrace. I'm not sure if this means the process isn't crashing.

Note also that in January '23 when I opened this issue we were running on Swift 5.7 and now we're on 5.8, so that might be an explanation.

@finestructure
Copy link
Member Author

Happened again yesterday (alert from April 26 2023, 13:42 CET) but no additional detail in logs:

CleanShot 2023-04-27 at 11 30 17@2x

CleanShot 2023-04-27 at 11 28 52@2x

I didn't get a chance to check the state of the container before a re-deploy "fixed" the issue, so I can't be 100% sure it was the same error.

@finestructure
Copy link
Member Author

@finestructure
Copy link
Member Author

Happened again on dev just now.

@finestructure
Copy link
Member Author

Happened again on dev this morning.

@finestructure
Copy link
Member Author

Happened in prod this morning.

@finestructure
Copy link
Member Author

And now on dev.

@finestructure
Copy link
Member Author

The analysis job is also running git commands. Johannes Weiß mentioned in passing on the Swift Open Source slack that they saw "crazy stuff" due to FileManager.defaul.removeItems(path) taking very long due to a large directory being remove. I wonder if our case is a symptom of that same issue.

EventLoop latency logging is a recommended diagnostic tool, more details here: apple/swift-nio#2410

@finestructure
Copy link
Member Author

I mean, we do run checkouts of new packages for instance, and those can take > 10s if they're large.

@finestructure
Copy link
Member Author

prod, ~5:40 CET, July 30 2023

@finestructure
Copy link
Member Author

prod, ~21:00 CET, July 31 2023

@finestructure
Copy link
Member Author

prod, ~1:30 CET, Aug 12 2023

@finestructure
Copy link
Member Author

prod, 9:19 CET, Sep 3 2023

@finestructure
Copy link
Member Author

We've had a hang again, with all the latest ShellOut fixes, at 16:18 CET today, Sep 6 2023.

It's very clear now that the cause is an actual crash:

CleanShot 2023-09-06 at 21 17 56@2x

2023-09-06T13:36:58.114865247Z Received signal 11. Backtrace:
2023-09-06T13:36:58.615927201Z 0x55ba0c45b4f1, Backtrace.(printBacktrace in _B82A8C0ED7C904841114FDF244F9E58E)(signal: Swift.Int32) -> () at /build/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:66
2023-09-06T13:36:58.615971303Z 0x55ba0c45b7a7, closure #1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /build/.build/checkouts/swift-backtrace/Sources/Backtrace/Backtrace.swift:80
2023-09-06T13:36:58.615994404Z 0x55ba0c45b7a7, @objc closure #1 (Swift.Int32) -> () in static Backtrace.Backtrace.install(signals: Swift.Array<Swift.Int32>) -> () at /build/<compiler-generated>:79
2023-09-06T13:36:58.615999404Z 0x7fb0bf00941f
2023-09-06T13:36:58.616003205Z 0x55ba0df75896
2023-09-06T13:36:58.616006905Z 0x55ba0df7ceb9
2023-09-06T13:36:58.616011105Z 0x55ba0df7ccb5
2023-09-06T13:36:58.616014905Z 0x55ba0df6c60b
2023-09-06T13:36:58.616018605Z 0x55ba0df7575b
2023-09-06T13:36:58.616022306Z 0x55ba0df6941d
2023-09-06T13:36:58.616470727Z 0x55ba0c0fa23a, implicit closure #4 (Swift.String) throws -> Swift.Array<Swift.String> in implicit closure #3 (Foundation.FileManager) -> (Swift.String) throws -> Swift.Array<Swift.String> in variable initialization expression of static App.FileManager.live : App.FileManager at /build/Sources/App/Core/AppEnvironment.swift:270
2023-09-06T13:36:58.616499028Z 0x55ba0c0fa23a, partial apply forwarder for implicit closure #4 (Swift.String) throws -> Swift.Array<Swift.String> in implicit closure #3 (Foundation.FileManager) -> (Swift.String) throws -> Swift.Array<Swift.String> in variable initialization expression of static App.FileManager.live : App.FileManager at /build/<compiler-generated>:0
2023-09-06T13:36:58.620597122Z 0x55ba0c0404fe, App.FileManager.contentsOfDirectory(atPath: Swift.String) throws -> Swift.Array<Swift.String> at Sources/App/Core/AppEnvironment.swift:258
2023-09-06T13:36:58.620625524Z 0x55ba0c0404fe, function signature specialization <Arg[0] = Dead> of static App.Analyze.trimCheckouts() throws -> () at /build/Sources/App/Commands/Analyze.swift:97
2023-09-06T13:36:58.620662926Z 0x55ba0c037a2c, static App.Analyze.trimCheckouts() throws -> () at /build/<compiler-generated>:0
2023-09-06T13:36:58.620772931Z 0x55ba0c037a2c, (3) suspend resume partial function for App.Analyze.Command.run(using: ConsoleKit.CommandContext, signature: App.Analyze.Command.Signature) async throws -> () at /build/Sources/App/Commands/Analyze.swift:64
2023-09-06T13:36:58.620926738Z 0x55ba0de0123e
2023-09-06T13:36:58.620943239Z 0x55ba0de0191f
2023-09-06T13:36:58.620947939Z 0x55ba0de162f8
2023-09-06T13:36:58.620951539Z 0x55ba0de1610e
2023-09-06T13:36:58.620955639Z 0x55ba0de21135
2023-09-06T13:36:58.620959240Z 0x7fb0beffd608
2023-09-06T13:36:58.621011242Z 0x7fb0becfc132
2023-09-06T13:36:58.621028243Z 0xffffffffffffffff
2023-09-06T13:36:58.763773210Z /bin/bash: line 1: 28473 Segmentation fault      (core dumped) ./Run analyze --env prod --limit 25
2023-09-06T14:02:12.447331859Z Received signal 11. Backtrace:

Note the timestamp of the last entry at 14:02 UTC (16:00 CET) is a single entry and then it hangs. Our alerts trigger when we've not seen logs for 20 mins, so that aligns with the alert at 16:18 CET.

I see signal 11 crashes every few hours but it seems they don't always lead to a hang. Haven't looked at the logs in more detail yet, attached below.

analyze-2023-09-06-stderr.logs.zip

analyze-2023-09-06.logs.zip

FYI @gwynne 😢

@gwynne
Copy link
Collaborator

gwynne commented Sep 6, 2023

@finestructure So, from that crash log, looks to me like odds are decent you're hitting the ARC miscompile crash in 5.8.1. My very unorthodox suggestion is to try building with a 5.9 snapshot, for three reasons:

  1. 5.9 is now (presumably) very close to release, and so can be reasonably considered at least hopefully stable. I've seen several people using it in production (at least for smaller stuff) already, and we're doing so in Vapor for generating DocC docs.
  2. 5.9 is known to have had the fix for the miscompile in question for at least 3 months now, whereas it's pretty much a foregone conclusion that it won't be backported to 5.8 at this point.
  3. 5.9 now has built-in Backtrace support, and Vapor has been updated accordingly, meaning if it does crash again, it's likely the log will be more helpful.

@finestructure
Copy link
Member Author

Mmm, we're actually compiling with a 5.9 toolchain, from Sep 1 or so.

https://gitlab.com/finestructure/spi-base/-/blob/main/Dockerfile

@gwynne
Copy link
Collaborator

gwynne commented Sep 6, 2023

Ooof... then my suggestion is to update your dependencies (and bump your 5.9 to the latest snapshot (which at the moment appears to be sha256:76911e1da04bce21683872c23033f6e2c07f296df7465fce0bf1e1bd01835001). This will get you the new native backtracking support (if you don't already have it), plus the updated Vapor that doesn't overwrite it with the old version.

@finestructure
Copy link
Member Author

That image is from yesterday, dependencies updated on Monday - we should be pretty much up to date, I'd say.

I'm not sure why the backtrace is only partially working. It's been like this for a long time. It's weird 🤔

@finestructure
Copy link
Member Author

(I'll def check in the morning that we've got the Vapor update you're referring to.)

@gwynne
Copy link
Collaborator

gwynne commented Sep 6, 2023

For reference, the update in question is Vapor 4.81.0, released ~12 hours ago at the time of this writing.

@finestructure
Copy link
Member Author

I just realised we'd updated to that latest Swift 5.9 and run the package updates but it's not deployed yet. It's the revision just after so what was running had just the ShellOut changes!

@finestructure
Copy link
Member Author

Another prod hang, 15:57 CET, Sep 9 2023. stderr is just

2023-09-09T10:01:16.343649916Z /bin/bash: line 1: 13830 Segmentation fault      (core dumped) ./Run analyze --env prod --limit 25

stdout logs below - I don't see any backtrace or other bits that indicate issues, except the infamous

2023-09-09T14:09:20.520006338Z [ ERROR ] Connection request (ID 25 timed out. This might indicate a connection deadlock in your application. If you have long-running requests, consider increasing your connection timeout. [component: server, database-id: psql]

analyze-2023-09-09-stderr.logs.zip

@finestructure
Copy link
Member Author

finestructure commented Sep 9, 2023

Hold on, that segfault is from 10:01 this morning, so certainly not the issue.

Again, it seems like the segfault/signal 11 is a red herring and the hangs are due to something else but who knows what's going on.

@finestructure
Copy link
Member Author

This is on prod with Vapor 4.81.0, compiled with Swift 5.9 from Sep 1.

 ################################################################
 #                                                              #
 # Swift Nightly Docker Image                                   #
 # Tag: swift-5.9-DEVELOPMENT-SNAPSHOT-2023-09-01-a                     #
 #                                                              #
 ################################################################

@gwynne
Copy link
Collaborator

gwynne commented Sep 9, 2023

@finestructure If the corefile (indicated by (core dumped)) is still present in the container that runs the command, it can be used to get a backtrace of the crash.

@finestructure
Copy link
Member Author

finestructure commented Sep 14, 2023

prod, 16:33 CET, Sep 14 2023

I missed the first alert and only checked on this now (21:37 CET (UTC+2)). I've pulled the logs and the closest Seg fault is at 13:28 UTC

2023-09-14T13:28:13.410953838Z /bin/bash: line 1: 29712 Segmentation fault      (core dumped) ./Run analyze --env prod --limit 25

with processing continuing until 14:18 UTC while the connection timeout messages leading to the hang are appearing:

2023-09-14T14:18:09.519182656Z [ ERROR ] Connection request (ID 6 timed out. This might indicate a connection deadlock in your application. If you have long-running requests, consider increasing your connection timeout. [component: server, database-id: psql]
2023-09-14T14:18:09.529500376Z [ ERROR ] Connection request (ID 7 timed out. This might indicate a connection deadlock in your application. If you have long-running requests, consider increasing your connection timeout. [component: server, database-id: psql]

The are 15 seg faults in the log file I pulled (ranging from Sep 11 to Sep 14). I think it's safe to say that the seg faults aren't the cause of the hangs.

I looked for a core file in the running container but there was none in the executable's directory nor in a few other places I checked (/var/cache/abrt, /var/spool/abrt, /var/crash). Core dump size it unlimited:

root@fce97f658c3d:/app# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited

It's late and so I restarted the container for now. Can look into where it ends up some other time - these seem to happen frequently enough (15 times in 3 days).

BTW, the logs do not contain any stack trace info, despite the latest Vapor and Swift 5.9 🤔:

2023-09-14T13:28:08.682435359Z [ INFO ] pulling https://github.com/pointfreeco/swift-dependencies.git in /checkouts/github.com-pointfreeco-swift-dependencies [component: analyze]
2023-09-14T13:28:08.684063885Z [ INFO ] pulling https://github.com/Alamofire/AlamofireImage.git in /checkouts/github.com-alamofire-alamofireimage [component: analyze]
2023-09-14T13:28:09.417655788Z [ WARNING ] stderr: From https://github.com/team-telnyx/telnyx-webrtc-ios
2023-09-14T13:28:09.417694189Z  * [new tag]         0.1.10     -> 0.1.10 [component: analyze]
2023-09-14T13:28:10.842271007Z [ INFO ] throttled 1 incoming revisions [component: analyze]
2023-09-14T13:28:10.937562626Z [ INFO ] throttled 1 incoming revisions [component: analyze]
2023-09-14T13:28:11.961565948Z [ INFO ] Updating 25 packages for stage 'analysis' (errors: 0) [component: analyze]
2023-09-14T13:28:13.410953838Z /bin/bash: line 1: 29712 Segmentation fault      (core dumped) ./Run analyze --env prod --limit 25
2023-09-14T13:28:33.536090365Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2023-09-14T13:28:33.592712560Z [ INFO ] Checkout directory: /checkouts [component: analyze]
2023-09-14T13:28:33.593526773Z [ INFO ] Updating 0 packages for stage 'analysis' (errors: 0) [component: analyze]
2023-09-14T13:28:55.544915191Z [ INFO ] Analyzing (limit: 25) ... [component: analyze]
2023-09-14T13:28:55.619826266Z [ INFO ] Checkout directory: /checkouts [component: analyze]

@gwynne
Copy link
Collaborator

gwynne commented Sep 14, 2023

@finestructure For next time, I'd suggest just searching the entire filesystem, e.g. find / -name core (if that finds nothing, I'd give it one more shot with a more permissive search, like find / -iname '*core*').

@finestructure
Copy link
Member Author

Closing this as fixed now - we haven't had a hang since we removed the TaskGroup in analysis in #2656 (released as 2.91.9).

Finally 🙂🎉

Huge thanks again to Gwynne for all the help!

finestructure added a commit that referenced this issue Nov 27, 2023
This reverts commit b00334a.
finestructure added a commit that referenced this issue Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants