-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for safe driver loading feature #600
Conversation
d72fe32
to
5054f10
Compare
/retest-all |
@ykulazhenkov for the kind CI a rebase is needed to have the revert of the tighting patch applied, for the helm CI, the mofed container seem to have been delayed to be up, causing the 10 minute timeout of the CI to fail, Do you think we need to increase the timeout of the CI to wait for the nic cluster policy to become ready. |
5054f10
to
de36e05
Compare
a1b8ea0
to
05bcf3e
Compare
37834e9
to
6cf4d7a
Compare
/retest-all |
2d368ce
to
18d4400
Compare
09dbdff
to
598d765
Compare
I updated PR code to use network-operator-init-container 0.0.2 |
598d765
to
4de254d
Compare
/retest-nic_operator_helm |
1 similar comment
/retest-nic_operator_helm |
c7b7be3
to
3eca43a
Compare
/retest-nic_operator_helm |
f8a9925
to
76776e3
Compare
76776e3
to
c27eff8
Compare
I added check to Admission controller to make sure that initContainer and autoUpgrade are enabled when a user tries to enable safeLoad feature. |
c27eff8
to
e578586
Compare
/retest-nic_operator_helm |
1 similar comment
/retest-nic_operator_helm |
7fb0594
to
003c802
Compare
@adrianchiris @rollandf @almaslennikov I did a big change in the PR after discussion with Adrian: now we use enviroment variable to configure image for the init container. Please, take a look again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, otherwise LGTM
pkg/state/state_ofed.go
Outdated
OSName: nodeAttr[nodeinfo.AttrTypeOSName], | ||
OSVer: nodeAttr[nodeinfo.AttrTypeOSVer], | ||
MOFEDImageName: s.getMofedDriverImageName(cr, nodeAttr, reqLogger), | ||
InitContainerConfig: s.getInitContainerConfig(cr, reqLogger, config.FromEnv().OFEDState.InitContainerImage), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe we would generally want to work with staticconfig provider[1] or just delete it and work with config.FromEnv()...
Line 85 in c40b096
staticInfoProvider := staticconfig.NewProvider(staticconfig.StaticConfig{CniBinDirectory: cniBinDir}) |
same goes for the namespace we use in runtimespec (here and in other states)
thoughts ? (ill create a task to align so it will not affect this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I was thinking about using a static provider, but I didn't find any benefits from it.
Maybe we don't need the static provider at all, because it doesn't make sense to cache env variables (we already read them only once)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe will be useful for UT ? (one less source of informaiton)
that is a state gets the CR, some providers, renderer, and do the work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created #698 please add your preference there.
Signed-off-by: Yury Kulazhenkov <[email protected]>
Signed-off-by: Yury Kulazhenkov <[email protected]>
Signed-off-by: Yury Kulazhenkov <[email protected]>
003c802
to
b2398da
Compare
/retest-nic_operator_helm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
On Node startup, the OFED container takes some time to compile and load the driver.
During that time, workloads might get scheduled on that Node.
When OFED is loaded, all existing PODs that use NVIDIA NICs will lose their network interfaces.
Some such PODs might silently fail or hang.
To avoid such a situation, before the OFED container is loaded,
the Node should get Cordoned and Drained to ensure all workloads are rescheduled.
The Node should be un-cordoned when the driver is ready on it.
The safe driver loading feature is implemented as a part of the upgrade flow,
meaning safe driver loading is a special scenario of the upgrade procedure,
where we upgrade from the inbox driver to the containerized OFED.
depends on: #685