Fix long shutdown times for Teraslice jobs #2106

godber · 2020-08-04T01:30:15Z

This issue (kind of) replaces:

We have jobs (mostly if not all kafka reader jobs) which take 5 minutes to shutdown. I attempted to fix that with a controlled job shutdown here:

#2074

But @macgyver603 has confirmed that the problem still exists. We're going to have to look at the kafka asset now. I don't have any ideas beyond that.

godber · 2020-08-06T17:34:41Z

The current working hypothesis on this issue is that a rebalance comes in while line 62 (below) is trying to disconnect, resulting in an error that is lost and prevents disconnect:

https://github.com/terascope/kafka-assets/blob/3ea53de6d6c08c799b279904f203ce40da9ac6a4/asset/src/_kafka_clients/base-client.ts#L59-L66

So how should we rety this? There's a _try in that client there. I don't see pRetry in use here.

peterdemartini · 2020-08-06T17:40:49Z

We probably should just log the error, and not propagate it.

godber · 2020-08-06T18:52:35Z

It's possible this node-rdkafka issue is causing our issue here:

confluentinc/librdkafka#2615

We're going to build a new release of Teraslice that uses

https://github.com/edenhill/librdkafka/releases/tag/v1.4.0

As well as updates the node 12 version (for a memory leak issue not yet reported in github).

peterdemartini · 2020-08-06T19:26:00Z

Ref: #2107

godber · 2020-08-07T13:08:14Z

@peterdemartini , the upgraded node-rdkafka version still has the 5m shutdown problem AND has stuck workers right off the bat. Here's individual partition lag on a high volume topic

Can you build a v0.69.3 that reverts the node-rdkafka change?

godber · 2024-04-01T23:32:30Z

I am still seeing this issue, but in cases where kafka isn't involved. This is happening in very simple jobs that are data_generator to noop. If I have 200 workers, somewhere from 2 to 5 of the pods will sit in terminating for several minutes after attempting to shut down safely. Well beyond the slice generation has completed. Reviewing the logs we can see the termination logs and the slice completion, then it sits there doing nothing.

godber · 2024-11-18T18:38:57Z

We were able to reproduce this again today:

Error: Timed out after 2m, waiting for message "worker:slice:complete"
 at Client.handleSendResponse (file:///app/source/packages/teraslice-messaging/dist/src/messenger/core.js:40:19)
 at runNextTicks (node:internal/process/task_queues:65:5)
 at process.processTimers (node:internal/timers:526:9)
 at async file:///app/source/packages/teraslice/dist/src/lib/workers/worker/index.js:286:17
 at async Timeout.<anonymous> (file:///app/source/packages/utils/dist/src/promises.js:172:32)

The above message was in a worker log, on a job running on Teraslice v2.7.0 using the kubernetes backend. The job was a kafka_consumer to noop job with 200 workers running in an internal dev cluster on real hardware. This happened on the first try on kubernetes backend but we were unable to reproduce it 5 or 6 times on kubernetesV2. Just the one worker got stuck.

Since this appears to be messaging related, it's hard to say how the kubernetes backend version would impact it.

Here's the raw json log:

{
  "name": "teraslice",
  "assignment": "worker",
  "hostname": "ts-wkr-peter-kafka-to-noop-26dc9429-4e32-7fd4b7745f-kjf5w",
  "pid": 7,
  "module": "worker",
  "worker_id": "GH12Do3K",
  "ex_id": "9be4928f-ab40-4ab5-94b3-d13fd4c905a8",
  "job_id": "26dc9429-4e32-4829-9f96-b1cbfd4d52d8",
  "level": 50,
  "err": {
    "message": "Timed out after 2m, waiting for message \"worker:slice:complete\"",
    "name": "Error",
    "stack": "Error: Timed out after 2m, waiting for message \"worker:slice:complete\"\n at Client.handleSendResponse (file:///app/source/packages/teraslice-messaging/dist/src/messenger/core.js:40:19)\n at runNextTicks (node:internal/process/task_queues:65:5)\n at process.processTimers (node:internal/timers:526:9)\n at async file:///app/source/packages/teraslice/dist/src/lib/workers/worker/index.js:286:17\n at async Timeout.<anonymous> (file:///app/source/packages/utils/dist/src/promises.js:172:32)"
  },
  "msg": "slice e22f89fd-be83-48a8-8627-7beb19b9432c run error",
  "time": "2024-11-18T17:59:31.255Z",
  "v": 0
}

godber · 2024-11-18T18:42:29Z

Our interpretation here is that the worker is waiting for the execution controller to acknowledge the slice complete message, but the execution controller has exited and will never respond.

busma13 · 2024-11-20T21:05:14Z

Error related to the long shutdown:

Error: Timed out after 2m, waiting for message "worker:slice:complete"
 at Client.handleSendResponse (file:///app/source/packages/teraslice-messaging/dist/src/messenger/core.js:40:19)
 at runNextTicks (node:internal/process/task_queues:65:5)
 at process.processTimers (node:internal/timers:526:9)
 at async file:///app/source/packages/teraslice/dist/src/lib/workers/worker/index.js:286:17
 at async Timeout.<anonymous> (file:///app/source/packages/utils/dist/src/promises.js:172:32)

It looks like the worker finishes a slice at the perfect moment: the client checks if the server is up and it is, so it sends a worker:slice:complete message, but the server shuts down before it receives the message, so the client sits and waits until the timeout not knowing that the server is gone.

I am adding a special listener in #3838 that will abort this process immediately if the server:shutdown event is called.

…dSliceComplete() (#3838) This PR makes the following changes: - The execution-controller `Client` now creates an `abortController`. Every call to `Client.sendSliceComplete()` now includes a reference to `abortController.signal` allowing for the `pEvent` in the messenger `Core.onceWithTimeout()` to be aborted. - A new listener is added for the `server:shutdown` event that calls `abortController.abort()`. This prevents a worker from still waiting for a response to `sendSliceComplete()` after the server is shutdown. - bump teraslice-messaging from 1.7.0 to 1.7.1 ref: #2106

godber · 2024-12-09T23:55:15Z

I think this specific scenario was addressed but since #3856 was filed, there may be a new one. We'll use that issue though.

godber assigned busma13 and sotojn Nov 15, 2024

busma13 mentioned this issue Nov 20, 2024

[teraslice-messaging, teraslice] Add abortController for use with sendSliceComplete() #3838

Merged

godber added this to the Teraslice 2.8.0 milestone Nov 22, 2024

godber closed this as completed Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix long shutdown times for Teraslice jobs #2106

Fix long shutdown times for Teraslice jobs #2106

godber commented Aug 4, 2020 •

edited

Loading

godber commented Aug 6, 2020

peterdemartini commented Aug 6, 2020 •

edited

Loading

godber commented Aug 6, 2020

peterdemartini commented Aug 6, 2020

godber commented Aug 7, 2020

godber commented Apr 1, 2024

godber commented Nov 18, 2024

godber commented Nov 18, 2024

busma13 commented Nov 20, 2024 •

edited

Loading

godber commented Dec 9, 2024

Fix long shutdown times for Teraslice jobs #2106

Fix long shutdown times for Teraslice jobs #2106

Comments

godber commented Aug 4, 2020 • edited Loading

godber commented Aug 6, 2020

peterdemartini commented Aug 6, 2020 • edited Loading

godber commented Aug 6, 2020

peterdemartini commented Aug 6, 2020

godber commented Aug 7, 2020

godber commented Apr 1, 2024

godber commented Nov 18, 2024

godber commented Nov 18, 2024

busma13 commented Nov 20, 2024 • edited Loading

godber commented Dec 9, 2024

godber commented Aug 4, 2020 •

edited

Loading

peterdemartini commented Aug 6, 2020 •

edited

Loading

busma13 commented Nov 20, 2024 •

edited

Loading