Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix long shutdown times for Teraslice jobs #2106

Closed
godber opened this issue Aug 4, 2020 · 10 comments
Closed

Fix long shutdown times for Teraslice jobs #2106

godber opened this issue Aug 4, 2020 · 10 comments
Assignees

Comments

@godber
Copy link
Member

godber commented Aug 4, 2020

This issue (kind of) replaces:

#942

We have jobs (mostly if not all kafka reader jobs) which take 5 minutes to shutdown. I attempted to fix that with a controlled job shutdown here:

#2074

But @macgyver603 has confirmed that the problem still exists. We're going to have to look at the kafka asset now. I don't have any ideas beyond that.

@godber
Copy link
Member Author

godber commented Aug 6, 2020

The current working hypothesis on this issue is that a rebalance comes in while line 62 (below) is trying to disconnect, resulting in an error that is lost and prevents disconnect:

https://github.com/terascope/kafka-assets/blob/3ea53de6d6c08c799b279904f203ce40da9ac6a4/asset/src/_kafka_clients/base-client.ts#L59-L66

So how should we rety this? There's a _try in that client there. I don't see pRetry in use here.

@peterdemartini
Copy link
Contributor

peterdemartini commented Aug 6, 2020

We probably should just log the error, and not propagate it.

@godber
Copy link
Member Author

godber commented Aug 6, 2020

It's possible this node-rdkafka issue is causing our issue here:

confluentinc/librdkafka#2615

We're going to build a new release of Teraslice that uses

https://github.com/edenhill/librdkafka/releases/tag/v1.4.0

As well as updates the node 12 version (for a memory leak issue not yet reported in github).

@peterdemartini
Copy link
Contributor

Ref: #2107

@godber
Copy link
Member Author

godber commented Aug 7, 2020

@peterdemartini , the upgraded node-rdkafka version still has the 5m shutdown problem AND has stuck workers right off the bat. Here's individual partition lag on a high volume topic

Screen Shot 2020-08-06 at 7 03 30 PM

Can you build a v0.69.3 that reverts the node-rdkafka change?

@godber
Copy link
Member Author

godber commented Apr 1, 2024

I am still seeing this issue, but in cases where kafka isn't involved. This is happening in very simple jobs that are data_generator to noop. If I have 200 workers, somewhere from 2 to 5 of the pods will sit in terminating for several minutes after attempting to shut down safely. Well beyond the slice generation has completed. Reviewing the logs we can see the termination logs and the slice completion, then it sits there doing nothing.

@godber
Copy link
Member Author

godber commented Nov 18, 2024

We were able to reproduce this again today:

Error: Timed out after 2m, waiting for message "worker:slice:complete"
 at Client.handleSendResponse (file:///app/source/packages/teraslice-messaging/dist/src/messenger/core.js:40:19)
 at runNextTicks (node:internal/process/task_queues:65:5)
 at process.processTimers (node:internal/timers:526:9)
 at async file:///app/source/packages/teraslice/dist/src/lib/workers/worker/index.js:286:17
 at async Timeout.<anonymous> (file:///app/source/packages/utils/dist/src/promises.js:172:32)

The above message was in a worker log, on a job running on Teraslice v2.7.0 using the kubernetes backend. The job was a kafka_consumer to noop job with 200 workers running in an internal dev cluster on real hardware. This happened on the first try on kubernetes backend but we were unable to reproduce it 5 or 6 times on kubernetesV2. Just the one worker got stuck.

Since this appears to be messaging related, it's hard to say how the kubernetes backend version would impact it.

Here's the raw json log:

{
  "name": "teraslice",
  "assignment": "worker",
  "hostname": "ts-wkr-peter-kafka-to-noop-26dc9429-4e32-7fd4b7745f-kjf5w",
  "pid": 7,
  "module": "worker",
  "worker_id": "GH12Do3K",
  "ex_id": "9be4928f-ab40-4ab5-94b3-d13fd4c905a8",
  "job_id": "26dc9429-4e32-4829-9f96-b1cbfd4d52d8",
  "level": 50,
  "err": {
    "message": "Timed out after 2m, waiting for message \"worker:slice:complete\"",
    "name": "Error",
    "stack": "Error: Timed out after 2m, waiting for message \"worker:slice:complete\"\n at Client.handleSendResponse (file:///app/source/packages/teraslice-messaging/dist/src/messenger/core.js:40:19)\n at runNextTicks (node:internal/process/task_queues:65:5)\n at process.processTimers (node:internal/timers:526:9)\n at async file:///app/source/packages/teraslice/dist/src/lib/workers/worker/index.js:286:17\n at async Timeout.<anonymous> (file:///app/source/packages/utils/dist/src/promises.js:172:32)"
  },
  "msg": "slice e22f89fd-be83-48a8-8627-7beb19b9432c run error",
  "time": "2024-11-18T17:59:31.255Z",
  "v": 0
}

@godber
Copy link
Member Author

godber commented Nov 18, 2024

Our interpretation here is that the worker is waiting for the execution controller to acknowledge the slice complete message, but the execution controller has exited and will never respond.

@busma13
Copy link
Contributor

busma13 commented Nov 20, 2024

Error related to the long shutdown:

Error: Timed out after 2m, waiting for message "worker:slice:complete"
 at Client.handleSendResponse (file:///app/source/packages/teraslice-messaging/dist/src/messenger/core.js:40:19)
 at runNextTicks (node:internal/process/task_queues:65:5)
 at process.processTimers (node:internal/timers:526:9)
 at async file:///app/source/packages/teraslice/dist/src/lib/workers/worker/index.js:286:17
 at async Timeout.<anonymous> (file:///app/source/packages/utils/dist/src/promises.js:172:32)

It looks like the worker finishes a slice at the perfect moment: the client checks if the server is up and it is, so it sends a worker:slice:complete message, but the server shuts down before it receives the message, so the client sits and waits until the timeout not knowing that the server is gone.

I am adding a special listener in #3838 that will abort this process immediately if the server:shutdown event is called.

@godber godber added this to the Teraslice 2.8.0 milestone Nov 22, 2024
godber pushed a commit that referenced this issue Nov 22, 2024
…dSliceComplete() (#3838)

This PR makes the following changes:

- The execution-controller `Client` now creates an `abortController`.
Every call to `Client.sendSliceComplete()` now includes a reference to
`abortController.signal` allowing for the `pEvent` in the messenger
`Core.onceWithTimeout()` to be aborted.
- A new listener is added for the `server:shutdown` event that calls
`abortController.abort()`. This prevents a worker from still waiting for
a response to `sendSliceComplete()` after the server is shutdown.
- bump teraslice-messaging from 1.7.0 to 1.7.1

ref: #2106
@godber
Copy link
Member Author

godber commented Dec 9, 2024

I think this specific scenario was addressed but since #3856 was filed, there may be a new one. We'll use that issue though.

@godber godber closed this as completed Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants