Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support embedded NATS as alternate cluster option to etcd #7451

Open
bruth opened this issue May 8, 2023 · 22 comments
Open

Support embedded NATS as alternate cluster option to etcd #7451

bruth opened this issue May 8, 2023 · 22 comments
Assignees

Comments

@bruth
Copy link
Contributor

bruth commented May 8, 2023

Is your feature request related to a problem? Please describe.

Currently, embedded HA is supported only by etcd. With the option of embedded NATS that was added to Kine (as of v0.10.0/v0.10.1), NATS can be another option since it supports native clustering as well.

Describe the solution you'd like

Add native support for NATS as an alternative cluster option when doing --cluster-init.

Describe alternatives you've considered

There are no other native options, however, using external NATS configuration (when configuring the --datastore-endpoint), the nodes can be clustered without the k3s layer being aware that it is clustered. This provides HA/FT of the KV data, but k3s is unaware of this and not technically running in clustered mode.

Additional context

I plan on contributing this, but any guidance or things to be aware of is welcome!

@gedw99
Copy link

gedw99 commented May 15, 2023

Looking forward to seeing this

@brandond brandond added this to the v1.27.3+k3s1 milestone May 15, 2023
@brandond brandond moved this from New to Backlog in K3s Development May 15, 2023
@brandond
Copy link
Member

cc @rancher-max @cwayne18

@rancher-max
Copy link
Contributor

This is cool and a great feature suggestion! Thank you!

I have some clarifying questions to determine how deep down the proverbial rabbit hole we should go:

  1. Is k3s expected to supply backup/restore functionality?
    a. Would this extend cluster-reset/cluster-reset-restore-path functionality?
    b. Would it be a new command?
    c. Does it follow nats' approach or is it done differently?
  2. Should an operator be able to run NATS in their cluster while also using it as the embedded datastore?
  3. Should NATS certs be rotated during manual certificate rotation?
    a. What is the expectation when an operator provides their own certs? Ref: https://docs.k3s.io/cli/certificate#using-custom-ca-certificates and specifically the note: etcd files are required even if embedded etcd is not in use.

@brandond
Copy link
Member

brandond commented May 15, 2023

Those are all good questions!

At the moment I see the embedded NATS as a replacement for sqlite only; while it is possible to host a multi-node cluster using the embedded NATS server, @bruth or someone on his team will need to provide instructions on how to set this up as I believe it requires a user-managed config file to accomplish.

If it is desired that K3s support multi-server clusters by managing the configuration and cluster membership, allow for backup/restore using the embedded NATS datastore, and all the other stuff that would provide complete parity with the embedded etcd datastore, I think that would also need to be driven by someone on the Synadia side.

@gedw99
Copy link

gedw99 commented May 16, 2023

i agree in that some Ops aspects need to be added or documented.

@bruth
Copy link
Contributor Author

bruth commented May 16, 2023

need to provide instructions on how to set this up as I believe it requires a user-managed config file to accomplish.

This can be accomplished programmatically without config files for this particular setup. The Kine integration relies on the NATS server package which makes all of the config options available to be configured.

Since this would be a k3s feature, we would likely need to add support for additional query params on the Kine endpoint to indicate "cluster-mode" for example. But that design can get worked out to prevent needing users to manually define config files. It should be opt-in if they want more control, but not required.

the other stuff that would provide complete parity with the embedded etcd datastore, I think that would also need to be driven by someone on the Synadia side.

That is the intent for sure and why I am looking for guidance to understand the scope of complete parity! I don't want to boil the ocean in one pass if there is too much, but this is a good first list.

  1. Is k3s expected to supply backup/restore functionality?

If this functionality sits behind an interface, then we can hook in NATS standard method of backing up stream/consumer state as well as restore. I will need to read up on what k3s does today to compare.

  1. Should an operator be able to run NATS in their cluster while also using it as the embedded datastore?

They certainly should be able to run an additional server/cluster in k3s itself independent of the embedded one if they choose to. They shouldn't need, however I could understand the argument that they don't want to mix k3s and application concerns or the potential for applications impacting the embedded server/cluster and prefer a clear boundary.

One could say the same about etcd, but one distinction with NATS is that with it's multi-tenancy support, the k3s/kine state and messaging would be completely isolated from any applications.

In terms of recommended approaches, have a set of use cases and/or considerations in whether to reuse the embedded cluster vs. running another container should be sufficient for people to make that decision.

  1. Should NATS certs be rotated during manual certificate rotation?

Based on the link it looks like k3s is temporarily shutdown to do the cert rotation? That would certainly work for NATS as well. Custom CAs can be set in NATS config as well.

@VestigeJ VestigeJ self-assigned this Jun 1, 2023
@bruth
Copy link
Contributor Author

bruth commented Jun 5, 2023

Hey @VestigeJ, I saw you assigned this to yourself! Are you actively working on this or interested in collaborating?

@VestigeJ
Copy link

VestigeJ commented Jun 5, 2023

Hey @bruth I DM'd you back on your home Slack if you want to work together I'd be more than happy to. :)

@VestigeJ
Copy link

VestigeJ commented Sep 7, 2023

@bruth Did this get put onto a back burner on the Synadia side?

@brandond
Copy link
Member

brandond commented Sep 8, 2023

@udf2457
Copy link

udf2457 commented Oct 9, 2023

@VestigeJ if it has been put on a back-burner then it would be very unfortunate that @bruth chose to highlight it on a recent podcast.

@brandond
Copy link
Member

brandond commented Oct 9, 2023

@udf2457 that comment is probably best directed at @bruth himself, not anyone on the K3s team. NATS support is maintained by the Synadia folks.

@bruth
Copy link
Contributor Author

bruth commented Oct 9, 2023

@udf2457 This was a temporary back burner.. focus has been on the NATS 2.10 release the past couple months. The KINE PR works, but there are a couple remaining subtle recovery issues to address (likely tweaking a couple timeouts). Now that it is out, focus is shifting back and will have an update next week.

@bruth
Copy link
Contributor Author

bruth commented Oct 26, 2023

Hey folks, just giving a quick update so it doesn't get lost in the void again. I made some more progress today on the Kine PR (k3s-io/kine#194), including porting the client code to the new JetStream API. I am debugging a few remaining things, but planning to have it ready for review and merge early next week.

As it pertains to this issue, it will support HA mode without needing to change anything in k3s itself. This is a simpler option/better outcome IMO given how intertwined etcd as a dependency is (outside of kine).

Regarding backup/restore this can be achieve out-of-band using standard NATS utilities. If there is a strong desire to get them baked into k3s utilities, I am happy to move that along along.

@bruth
Copy link
Contributor Author

bruth commented Nov 1, 2023

Converted k3s-io/kine#194 to ready for review. There are some final bits to clean up and testing a couple failure cases, but in a good spot. Docs will come in the next couple days.

@brandond brandond moved this from Backlog to To Test in K3s Development Nov 14, 2023
@brandond brandond modified the milestones: Backlog, v1.28.4+k3s1 Nov 14, 2023
@brandond brandond self-assigned this Nov 14, 2023
@brandond brandond moved this from To Test to Next Up in K3s Development Nov 16, 2023
@brandond brandond modified the milestones: v1.28.4+k3s1, v1.29.0+k3s1 Nov 16, 2023
@brandond
Copy link
Member

brandond commented Nov 16, 2023

Bumping this back out; embedded nats support is still disabled by build flag. We'll need to add -tags nats to the K3s build flags to enable this.

At the moment nats only supports external servers.

@bruth
Copy link
Contributor Author

bruth commented Nov 16, 2023

@brandond Other than documentation, what would be helpful to have this be supported in v1.29?

@brandond
Copy link
Member

Docs would be good, and maybe get a PR open now to add the build flag so we can see what the current size impact is?

@dereknola
Copy link
Member

Looks like it adds about 2MB to the K3s size. I'm seeing the binary go from 58MB to 60MB

derek@degion:~/rancher/k3s$ ls -lh ./dist/artifacts/
total 247M
-rwxr-xr-x 1 derek derek  60M Nov 16 09:54 k3s

@VestigeJ
Copy link

Testing note - stalled currently for December or January releases

@caroline-suse-rancher caroline-suse-rancher moved this from Next Up to Stalled in K3s Development Jan 3, 2024
@brandond brandond modified the milestones: v1.29.2+k3s1, v1.29.3+k3s1 Feb 13, 2024
@caroline-suse-rancher caroline-suse-rancher removed this from the v1.29.3+k3s1 milestone Mar 27, 2024
@m3nowak
Copy link

m3nowak commented Oct 12, 2024

@brandond Is this feature still planned?

@brandond
Copy link
Member

Conformance tests need to pass first:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Stalled
Development

No branches or pull requests

9 participants