Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server start failed for reason 'oversized message' #102

Open
yylt opened this issue Aug 13, 2024 · 13 comments
Open

server start failed for reason 'oversized message' #102

yylt opened this issue Aug 13, 2024 · 13 comments
Milestone

Comments

@yylt
Copy link
Contributor

yylt commented Aug 13, 2024

nri-daemon log

time="2024-08-10T06:23:28Z" level=info msg="Configuring plugin 05-Controller for runtime containerd/v1.7.16"
time="2024-08-10T06:23:28Z" level=info msg="Started plugin 05-Controller..."
time="2024-08-10T06:23:29Z" level=error msg="error receiving message" error="failed to discard after receiving oversized message: cannot allocate memory"

containerd log

time="2024-08-10T14:25:55.098248241+08:00" level=info msg="synchronizing plugin 05-Controller"
time="2024-08-10T14:25:57.098681078+08:00" level=info msg="failed to synchronize plugin: context deadline exceeded"
time="2024-08-10T14:25:57.098762853+08:00" level=info msg="plugin \"05-Controller\" connected"

Sure the issue is likely within the ttrpc repository, there is a check on the window during the reception process within ttrpc. https://github.com/containerd/ttrpc/blob/655622931dab8c39a563e8c82ae90cdc748f72a1/channel.go#L126

However, there are questions:

  • Is ttrpc the appropriate RPC framework for this context?
  • Should exit when encountering such an error?

/cc @klihub

@klihub
Copy link
Member

klihub commented Aug 13, 2024

@yylt Is that a reproducible problem in your environment ? Can you give a bit more details ? I'd be interested at least in the number of pods and containers you have running in your system.

@yylt
Copy link
Contributor Author

yylt commented Aug 13, 2024

There are 150 pods, primarily distributed in the same namespace, such as "default", might be related to the number of env.

The issue can be consistently reproduced in my environment, maybe a certain number of pods would be required when reproduced.

@klihub
Copy link
Member

klihub commented Aug 13, 2024

And how many containers do you have altogether in those pods ?

@yylt
Copy link
Contributor Author

yylt commented Aug 13, 2024

And how many containers do you have altogether in those pods ?

The number of containers per pod should not seem to affect this, as each sync operation is independent of the others.

@klihub
Copy link
Member

klihub commented Aug 13, 2024

Not per pod. The number of total containers in all pods. I assume we hit the ttrpc messageLengthMax limit with the sync request, so what matters is both the total number of pods and the total number of containers. That's why I'd like to know it. IOW what does crictl ps | grep -v CONTAINER | wc -l tell on the failing host ?

@klihub
Copy link
Member

klihub commented Aug 13, 2024

Also it would be interesting to see the result of these:

  • crictl pods -o json | wc -c
  • crictl ps -o json | wc -c

@yylt
Copy link
Contributor Author

yylt commented Aug 13, 2024

# ctr -n k8s.io c ls  |wc -l
643

# crictl ps | grep -v CONTAINER | wc -l
156

# crictl pods -o json | wc -c
256810

# crictl ps -o json | wc -c
200467

@klihub
Copy link
Member

klihub commented Aug 14, 2024

@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.

I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.

@yylt
Copy link
Contributor Author

yylt commented Aug 14, 2024

@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.

I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.

ok, Is https://github.com/klihub/nri/tree/fixes/yylt-sync-failure, right?

@klihub
Copy link
Member

klihub commented Aug 14, 2024

@yylt Yes, but I have a directly patched 1.17.16 containerd tree pointing at that nri version and re-vendored here, so it's easier to just compile and use that...

https://github.com/klihub/containerd/tree/fixes/yylt-sync-failure

@klihub
Copy link
Member

klihub commented Aug 14, 2024

Oh, and you will need to recompile your plugin against that NRI tree as well. Otherwise the runtime-side will detect that the plugin does not have the necessary support compiled in and will kick it out during synchronization.

@yylt
Copy link
Contributor Author

yylt commented Aug 14, 2024

After replacing both nri-daemon and containerd, the sync error no longer occurs upon restart.

If nri-daemon is replaced individually, the issue still persists.

@klihub
Copy link
Member

klihub commented Aug 14, 2024

After replacing both nri-daemon and containerd, the sync error no longer occurs upon restart.

If nri-daemon is replaced individually, the issue still persists.

Yes, that is the expected behavior. And if you only update containerd, but run with an old plugin ('nri-daemon' I believe in your case), then the plugin should get disconnected during synchronization...

@mikebrow mikebrow added this to the 1.0 milestone Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants