Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teraslice] - Add execution controller restart logic in kubernetesV2 #3740

Merged
merged 13 commits into from
Sep 11, 2024

Conversation

sotojn
Copy link
Contributor

@sotojn sotojn commented Sep 5, 2024

This PR makes the following changes:

New features

  • Teraslice running in KubernetesV2 will have the ability to restart in a new pod automatically under certain conditions
    • These conditions include receiving a SIGTERM signal not associated with a job shutdown, and the slicer being restartable.
  • Added new function isRestartable() to the base slicer
    • isRestartable() will return a boolean to tell wether the slicer is compatible with the restartable feature or not. This allows for an initial implementation for kafka slicers without having to worry about the complexity of other slicers.

refs: #1007

@sotojn sotojn added enhancement feedback needed k8s Applies to Teraslice in kubernetes cluster mode only. pkg/teraslice feature labels Sep 5, 2024
@sotojn sotojn self-assigned this Sep 5, 2024
@sotojn sotojn changed the title Add execution controller restart logic in kubernetesV2 [teraslice] - Add execution controller restart logic in kubernetesV2 Sep 5, 2024
@godber godber linked an issue Sep 6, 2024 that may be closed by this pull request
@sotojn
Copy link
Contributor Author

sotojn commented Sep 6, 2024

I did a test where I created a new version of kafka-assets that has the isRestartable() which returns true. I tested it with kafka-assets v5.0.1 which does not support restarting or has a isRestartable() function. I made a kafka topic with 500k records and made two jobs that read from kafka and write to elasticsearch.

List of assets I'm using: (command: curl localhost:5678/txt/assets)

name           version  id                                        _created                  description                     node_version  platform  arch
-------------  -------  ----------------------------------------  ------------------------  ------------------------------  ------------  --------  ----
standard       1.0.1    5ab91d2dd9b17e742b652e7d8c4798c54a5b65b0  2024-09-06T15:05:19.697Z  Teraslice standard processor a  18                          
kafka          5.1.0    1071af3d0c057c06cc98d95802ea8b7581faeaa1  2024-09-06T15:05:19.591Z  Kafka reader and writer suppor  18                          
kafka          5.0.1    dac9ba2450f65e1114bfb65bbc2f255c5102721b  2024-09-06T15:05:19.565Z  Kafka reader and writer suppor  18                          
elasticsearch  4.0.3    a84bc365d409d7da45572f1b2355ed4375ae757c  2024-09-06T15:05:19.460Z                                  18       

Jobfile testing new kafka that supports restarting:

{
    "name": "kafka-to-es-ex-restartable",
    "lifecycle": "persistent",
    "workers": 1,
    "assets": [
        "kafka:5.1.0",  <---- NOTE: This is the kafka-assets I made locally. I just labeled it as v5.1.0
        "elasticsearch"
    ],
    "operations": [
        {
            "_op": "kafka_reader",
            "topic": "test-topic-1",
            "group": "test-group-ex-restartable",
            "size": 25000
        },
        {
            "_op": "elasticsearch_bulk",
            "size": 25000,
            "index": "es-test-ex-restartable"
        }
    ]
}

Logs of execution pod that came up after deleting the execution pod with k8s api:

[2024-09-06T15:52:03.935Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-restartable-6f2163a1-b172-58p7j: execution storage initialized (assignment=execution_controller, module=ex_storage, worker_id=ofMt5b5X, ex_id=76a0f68a-22a9-4ac2-b6d4-973852bac167, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T15:52:03.935Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-restartable-6f2163a1-b172-58p7j: state storage initialized (assignment=execution_controller, module=state_storage, worker_id=ofMt5b5X, ex_id=76a0f68a-22a9-4ac2-b6d4-973852bac167, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T15:52:03.937Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-restartable-6f2163a1-b172-58p7j: Execution 76a0f68a-22a9-4ac2-b6d4-973852bac167 detected to have been restarted.. (assignment=execution_controller, module=execution_controller, worker_id=ofMt5b5X, ex_id=76a0f68a-22a9-4ac2-b6d4-973852bac167, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T15:52:03.937Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-restartable-6f2163a1-b172-58p7j: Execution 76a0f68a-22a9-4ac2-b6d4-973852bac167 is restarable and will continue reinitializing... (assignment=execution_controller, module=execution_controller, worker_id=ofMt5b5X, ex_id=76a0f68a-22a9-4ac2-b6d4-973852bac167, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T15:52:03.939Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-restartable-6f2163a1-b172-58p7j: [START] "elasticsearch_sender_api" api instance initialize (assignment=execution_controller, module=slicer_context, worker_id=ofMt5b5X, ex_id=76a0f68a-22a9-4ac2-b6d4-973852bac167, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)

Jobfile testing kafka v5.0.1 that doesn't support restarting/ doesn't have isRestartable() function:

{
    "name": "kafka-to-es-ex-not-restartable",
    "lifecycle": "persistent",
    "workers": 1,
    "assets": [
        "kafka:5.0.1",
        "elasticsearch"
    ],
    "operations": [
        {
            "_op": "kafka_reader",
            "topic": "test-topic-1",
            "group": "test-group-ex-not-restartable",
            "size": 25000
        },
        {
            "_op": "elasticsearch_bulk",
            "size": 25000,
            "index": "es-test-ex-not-restartable"
        }
    ]
}

Logs of execution pod that came up after deleting the execution pod with k8s api (kafka v5.0.1):

[2024-09-06T16:12:47.318Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: creating connection for elasticsearch-next (assignment=execution_controller)
[2024-09-06T16:12:47.381Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: Creating an opensearch client for elasticsearch v7 for backwards compatibility (assignment=execution_controller)
[2024-09-06T16:12:47.384Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: Creating an opensearch client for elasticsearch v7 for backwards compatibility (assignment=execution_controller)
[2024-09-06T16:12:47.395Z] DEBUG: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: client ad33afe1-bc5d-47e3-8e2a-2e0302dbb916 connect (assignment=execution_controller, module=messaging:client, worker_id=mhfM5qIu, ex_id=ad33afe1-bc5d-47e3-8e2a-2e0302dbb916, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T16:12:47.396Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: state storage initialized (assignment=execution_controller, module=state_storage, worker_id=mhfM5qIu, ex_id=ad33afe1-bc5d-47e3-8e2a-2e0302dbb916, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T16:12:47.396Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: execution storage initialized (assignment=execution_controller, module=ex_storage, worker_id=mhfM5qIu, ex_id=ad33afe1-bc5d-47e3-8e2a-2e0302dbb916, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T16:12:47.398Z]  WARN: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: Changing execution status from running to failed (assignment=execution_controller, module=execution_controller, worker_id=mhfM5qIu, ex_id=ad33afe1-bc5d-47e3-8e2a-2e0302dbb916, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
[2024-09-06T16:12:47.499Z] ERROR: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: Unable to verify execution on initialization (assignment=execution_controller, module=execution_controller, worker_id=mhfM5qIu, ex_id=ad33afe1-bc5d-47e3-8e2a-2e0302dbb916, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)
    Error: Execution ad33afe1-bc5d-47e3-8e2a-2e0302dbb916 was starting in running status, sending execution:finished event to cluster master
        at ExecutionController._verifyExecution (file:///app/source/packages/teraslice/dist/src/lib/workers/execution-controller/index.js:794:21)
        at async ExecutionController.initialize (file:///app/source/packages/teraslice/dist/src/lib/workers/execution-controller/index.js:150:24)
        at async Service.initialize (file:///app/source/packages/teraslice/worker-service.js:40:9)
        at async main (file:///app/source/packages/teraslice/worker-service.js:80:9)
[2024-09-06T16:12:47.499Z]  INFO: teraslice/10 on ts-exc-kafka-to-es-ex-not-restartable-6f2163a1-b172-vxg8w: shutting down. (assignment=execution_controller, module=ex_storage, worker_id=mhfM5qIu, ex_id=ad33afe1-bc5d-47e3-8e2a-2e0302dbb916, job_id=6f2163a1-b172-4cd8-8612-162797f76d8a)

End results of elasticsearch indices:

health status index                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   ts-dev1__ex                 qKPAy2a3RyWsgBxBGSs60Q   5   1          6            8    223.2kb        223.2kb
yellow open   es-test-ex-restartable      wFNtGZ3WRwirCcB9ePKgbA   1   1     500000            0    336.8mb        336.8mb
yellow open   ts-dev1__jobs               Sy_WS-VCSvSzXMuk1V5CfA   5   1          2            0     14.5kb         14.5kb
yellow open   ts-dev1__assets             8lg--vebTB-dKVkDUOrZoQ   5   1          4            0      4.4mb          4.4mb
yellow open   ts-dev1__state-2024.09      ckqVQPVfSe6eBrz6uZvflQ   5   1        118           18    369.3kb        369.3kb
yellow open   es-test-ex-not-restartable  vlqRkms0QyS3ZJbiR9LQvg   1   1     175000            0    337.5mb        337.5mb
yellow open   ts-dev1__analytics-2024.09  CNaz4BexTb-yzsDb-I1uNg   5   1        216            0      403kb          403kb

@sotojn
Copy link
Contributor Author

sotojn commented Sep 9, 2024

I've added new restart logic to the execution controller exit process that address some issues with the original implementation.

Now when the execution process is exiting it will check for three things to be true in order to restart:

  • If its running in kubernetesV2
    • This replaces sending an environment variable called ALLOW_EX_RESTART because it's clear it should only work with a specific backend.
  • if the reason for the exit is due to a SIGTERM event
  • If the execution _status is in a valid running state that is not stopping
    • This addresses an issue that was brought up where just checking for a running state doesn't account for all the other running statuses.
    • Running statuses according to the execution code are [recovering, running, failing, paused, stopping]
    • Link to statuses:
      const RUNNING_STATUS = ['recovering', 'running', 'failing', 'paused', 'stopping'];

If all of the above conditions are met, the execution will set it's status to paused before restarting. Then on startup if the execution is in a paused state, it will proceed to reinitialize if the slicer supports it.

@sotojn sotojn force-pushed the ex-restart-logic-k8s-V2 branch from e6e41fb to 35e4394 Compare September 9, 2024 18:20
Copy link
Member

@godber godber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this.

@godber
Copy link
Member

godber commented Sep 9, 2024

I think we'll try and get away without using paused or implementing a new interrupted state. It may ultimately be the right thing to do, but I am concerned it will add more complexity to handle scenarios that are really quite rare. There is a kubernetes mechanism that will let us know if the the container is continuously crashing or "crash-looping" in k8s lingo. We will rely on that to let us know if an execution controller experienced repeated failures.

That being said, I definitely want to avoid re-using the paused state, so we would use interrupted instead if we did want to go this route.

Copy link
Member

@jsnoble jsnoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please update version though

@sotojn sotojn marked this pull request as ready for review September 11, 2024 16:48
Copy link
Member

@godber godber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bump the teraslice minor version.

@sotojn sotojn force-pushed the ex-restart-logic-k8s-V2 branch from 7f2f717 to c55e25b Compare September 11, 2024 18:58
Copy link
Member

@godber godber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we have an unreleased 2.3.0, that's sufficient.

@godber godber merged commit 732dc3c into master Sep 11, 2024
66 checks passed
@godber godber deleted the ex-restart-logic-k8s-V2 branch September 11, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement feature feedback needed k8s Applies to Teraslice in kubernetes cluster mode only. pkg/teraslice
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make the execution controller restartable on kafka in k8s
3 participants