-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix long shutdown times for Teraslice jobs #2106
Comments
The current working hypothesis on this issue is that a rebalance comes in while line 62 (below) is trying to disconnect, resulting in an error that is lost and prevents disconnect: So how should we rety this? There's a |
We probably should just log the error, and not propagate it. |
It's possible this node-rdkafka issue is causing our issue here: We're going to build a new release of Teraslice that uses https://github.com/edenhill/librdkafka/releases/tag/v1.4.0 As well as updates the node 12 version (for a memory leak issue not yet reported in github). |
Ref: #2107 |
@peterdemartini , the upgraded node-rdkafka version still has the 5m shutdown problem AND has stuck workers right off the bat. Here's individual partition lag on a high volume topic Can you build a |
I am still seeing this issue, but in cases where kafka isn't involved. This is happening in very simple jobs that are |
We were able to reproduce this again today:
The above message was in a worker log, on a job running on Teraslice Since this appears to be messaging related, it's hard to say how the kubernetes backend version would impact it. Here's the raw json log: {
"name": "teraslice",
"assignment": "worker",
"hostname": "ts-wkr-peter-kafka-to-noop-26dc9429-4e32-7fd4b7745f-kjf5w",
"pid": 7,
"module": "worker",
"worker_id": "GH12Do3K",
"ex_id": "9be4928f-ab40-4ab5-94b3-d13fd4c905a8",
"job_id": "26dc9429-4e32-4829-9f96-b1cbfd4d52d8",
"level": 50,
"err": {
"message": "Timed out after 2m, waiting for message \"worker:slice:complete\"",
"name": "Error",
"stack": "Error: Timed out after 2m, waiting for message \"worker:slice:complete\"\n at Client.handleSendResponse (file:///app/source/packages/teraslice-messaging/dist/src/messenger/core.js:40:19)\n at runNextTicks (node:internal/process/task_queues:65:5)\n at process.processTimers (node:internal/timers:526:9)\n at async file:///app/source/packages/teraslice/dist/src/lib/workers/worker/index.js:286:17\n at async Timeout.<anonymous> (file:///app/source/packages/utils/dist/src/promises.js:172:32)"
},
"msg": "slice e22f89fd-be83-48a8-8627-7beb19b9432c run error",
"time": "2024-11-18T17:59:31.255Z",
"v": 0
} |
Our interpretation here is that the worker is waiting for the execution controller to acknowledge the slice complete message, but the execution controller has exited and will never respond. |
Error related to the long shutdown:
It looks like the worker finishes a slice at the perfect moment: the client checks if the server is up and it is, so it sends a I am adding a special listener in #3838 that will abort this process immediately if the |
…dSliceComplete() (#3838) This PR makes the following changes: - The execution-controller `Client` now creates an `abortController`. Every call to `Client.sendSliceComplete()` now includes a reference to `abortController.signal` allowing for the `pEvent` in the messenger `Core.onceWithTimeout()` to be aborted. - A new listener is added for the `server:shutdown` event that calls `abortController.abort()`. This prevents a worker from still waiting for a response to `sendSliceComplete()` after the server is shutdown. - bump teraslice-messaging from 1.7.0 to 1.7.1 ref: #2106
I think this specific scenario was addressed but since #3856 was filed, there may be a new one. We'll use that issue though. |
This issue (kind of) replaces:
#942
We have jobs (mostly if not all kafka reader jobs) which take 5 minutes to shutdown. I attempted to fix that with a controlled job shutdown here:
#2074
But @macgyver603 has confirmed that the problem still exists. We're going to have to look at the kafka asset now. I don't have any ideas beyond that.
The text was updated successfully, but these errors were encountered: