-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent false positives in already running k0s instance detection #5400
Conversation
e19014e
to
de62d7c
Compare
The main bugfix is to check if a possibly running pid from the runtime config not only exists as a process, but also is the same excutable image that is checking. All platforms use a common library for ps-style checks now in this implementation. This changes the behavior for all platforms to use a common library to perform the "is k0s running" checks on startup. The advantage is, that the code is much more simple, the disadvantage that an additional library is now required with additional sub-dependencies. This commit also refactors the function checkPid into checkInstanceRunning, since the previous implementation did not allow for more complex error handling and would have hidden errors. More complex error handling is now required because the implementation is more complex. Signed-off-by: Peter Honeder <[email protected]>
de62d7c
to
0f4f23e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! And double thanks for providing a PR with thorough comments.
When running k0s in one of our deployments, we noticed that during startup with an existing runtime config file, it thinks it's running, but it isn't since the system just started up from scratch.
This is interesting, as the runtime config path is usually stored below /run
, which is usually a tmpfs and shouldn't survive reboots.
So what happens is, that if any other process during startup takes up that pid value from the runtime config, k0s thinks it is already running although it isn't.
The good ol' problem with PID file races. This has been addressed in in other parts of k0s already, but it's still a problem here.
I guess an even better method to solve this would be to hold a lock file instead of writing a pid value in the yaml, [...]
This is related to #1434, which is about replacing the runtime config file entirely by using the k0s status socket.
[...] but the fix right now addresses the topic by checking if the pid value is from the same executable image, and if not, it will allow k0s to be started although the pid value is taken by another process.
Agree. Left some comments in the review.
// (infinity is not available on default macOS zsh), this simulates | ||
// a different running process with the same pid as stored in the runtime config, but | ||
// from a different executable image | ||
cmd := getBackgroundCommand() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these kinds of tests, you could use internal/testutil/pingpong
, which provides a no-op testing process and might help avoid some boilerplate here.
// - actually test if k0s is still running by checking against: | ||
// - the same executable with the same pid as the runtime config | ||
// - a different executable with the same pid as the runtime config | ||
// - a pid that is not running anymore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. It would be super helpful to have all of those in separate tests.
@@ -1,7 +1,7 @@ | |||
//go:build unix | |||
|
|||
/* | |||
Copyright 2023 k0s authors | |||
Copyright 2025 k0s authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leave the copyright date as is.
|
||
// check that the executable is the same in order to issue an error, if it would be a different | ||
// than it can't be a running instance of k0s, which the check is all about | ||
if exeTarget != currentExe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's os.SameFile
which might help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take this approach, there is a chance that another running copy of the k0s binary won't be detected as already running. Say k0s was started by the init system and is running in the background. Then someone grabs a fresh binary and runs sudo ./k0s controller
. This will certainly cause some problems. I am not sure how contrived this scenario is, though.
|
||
// Get the executable path of the target pid | ||
// - no need to resolve symlinks here, it is already a resolved path | ||
exeTarget, err := proc.Exe() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather hardcode something like os.Stat(filepath.Join("/proc", pid, "exe"))
here rather than pulling in a rather large dependency. This wouldn't work on Windows, but I'd revisit this another time. If runtime.GOOS
is not "linux"
, you could simply skip this function call, or return successful unconditionally, or return an error that will unwrap to errors.ErrUnsupported
and check for this on the call sites ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe os.Args[0]
would be a platform-agnostic alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually my first PR commit had just a linux impl using the procfs. @twz123 the reason for this using procfs or something else to find the executable path, is actually that you need to find the executable path of a foreign pid too, not just your own.
I also think it would be much better to just implement something for Linux, or maybe even drop the PR completely, if you plan to change the runtime config parts anyway soon. Here was the original minimal change to checkPid that will only work on Linux:
e19014e#diff-406aeff8e9fcab83e1f62a534a0b5f9292b2cad0bf7d874a526fc15a9f6e8fe8L27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to find the executable path of a foreign pid too, not just your own.
... of course 🤦♂️
or maybe even drop the PR completely, if you plan to change the runtime config parts anyway soon.
Not sure about the soon part. It's definitely something worth doing, but that issue is lying around for years already, and is not really on the roadmap. You could try to tackle it, If you're interested! OTOH, we could also try to get this PR merged in the meantime.
This pull request has merge conflicts that need to be resolved. |
I had a look at this and my general thinking is that this is a complicated and unreliable solution. I think we should simply drop the pid check completely and use other ways to prevent multiple instances of k0s, for example using flock(2). |
I have an alternative way to solve this by simply removing the pid from config and use flock instead to ensure unique instance of k0s: #5435 |
I like that much more and I think that's the better solution! |
closing because no reason to continue |
Description
When running k0s in one of our deployments, we noticed that during startup with an existing runtime config file, it thinks it's running, but it isn't since the system just started up from scratch. The root cause is, that the detection in the linux runtime works by checking if the pid from the runtime config is a running process on the system, but not if the pid is actually the same executable image.
So what happens is, that if any other process during startup takes up that pid value from the runtime config, k0s thinks it is already running although it isn't.
I guess an even better method to solve this would be to hold a lock file instead of writing a pid value in the yaml, but the fix right now addresses the topic by checking if the pid value is from the same executable image, and if not, it will allow k0s to be started although the pid value is taken by another process.
Fixes #5399
Type of change
How Has This Been Tested?
The tests extend the current runtime tests by 3 more cases:
Checklist: