-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeouts preventing pxe booting? #32
Comments
Also, using the pxe command line we can take advantage of what appears to be some sort of caching in pixiecore's /_/ipxe endpoint by doing:
If you time it right, it will work. Of course this is not very automatable, but shows that things would work if the ipxe endpoint had a faster response time or even cached the response for a long enough time that the pxe client could try again after it reboots. |
In API mode, each request to the ipxe endpoint translates to exactly one request to the upstream API server. Pixiecore does not do any caching at all, because caching would make it hard/impossible to implement certain workflows. So, any delays/variance you see in a /_/ipxe request should be because of a slow API server, not because of Pixiecore. One thing that is strange is that Pixiecore defaults to a 5s timeout on the upstream API server request, so it should be impossible to get a successful response after 10s... You should be getting a timeout error after 5s. Give me a few minutes, I'll add some debug logging to the API server codepath, so we can examine more closely what's happening. What API server are you using? Is it an open-source one like waitron, or something custom that you built? |
Okay, I've added some timing logging to the ipxe codepath. Please reproduce the slow boot using the latest code, and add If you're using an automated build, all autobuilders (quay.io, docker hub, packagecloud Debian packages) have updated to the latest code. Or, of course, you can Thanks! |
Thanks for the speedy reply :) Apologies for not getting back to you sooner. I've pulled latest and unfortunately the timings aren't present in the log I'm seeing until I manually force the pxe boot via the command line. Below are relevant logs for both the automated boot and the manual boot. Automated pxe bootpixeicore logs
pixeicore API server logs
I added response times for the API server endpoint as well to see if there's something I was missing. It's a pretty simple flask app (also running in a container) that looks for known MACs and returns the kernel, initrd, and cmdline JSON as required. Manual pxe bootSimulated with pxe command line's pixiecore logs
pixiecore API server logs
|
I've also made a screen capture of the proxmox boot along with two shells with the logs for pixiecore and the API server if that would be helpful. It shows the net0 device getting network info from DHCP and then about a 10 second delay before pixiecore logs anything which is odd to me and might point to the issue. |
I just ran a tcpdump to catch what's going on during a pxe We noticed there was a good amount of time spent waiting on DNS before that boot filename packet came by, so we started poking around. strace on pixiecore showed some DNS things happening, so as a test we changed pixiecore container's nameserver to 127.0.0.1 and it booted right up. We'll keep looking at this on our end, but if you have any thoughts/ideas please let us know! Appreciate the time and help :) |
Hmm, interesting. Looking at the logs (the first set, where the machine is trying to boot autonomously), Pixiecore never progresses past the ProxyDHCP stage when you boot autonomously. Your API server is responding ~instantly when Pixiecore requests, so that implies (as you discovered) that the delay is between the DHCPDISCOVER and Pixiecore's DHCPOFFER response. The only blocking thing that happens in that interval is the request to the API server. The DNS nameserver configuration is a very good candidate for the problem. My guess of what's happening is that something in the DNS resolution path is failing, and after a 10s timeout the query falls through to something else that succeeds (e.g. a second nameserver that resolves correctly). I can't prove it because pixiecore's logs don't log timing data about the API request in the DHCP codepath :(. I've pushed another change that adds logging around the API request made during DHCP, if you retry with 63c4bab included, you should get some data about that request as well. My theory on what's happening is that your container's DNS configuration is pointing to a non-responsive resolver, so when Pixiecore tries to dial "localhost:9091", the DNS resolution takes 10s to time out. I'm not sure why Pixiecore would then succeed after 10s, I'm guessing it either falls through to a second resolver that works, or ends up getting a useful answer somewhere in /etc/nsswitch.conf. It's strange that this (suspected) DNS timeout doesn't trigger the 5s timeout on the HTTP client, the documentation says that it should... My guess is the documentation is inaccurate, and I should be providing dial timeouts to the What container kind are you using (docker, rkt, systemd...) ? Can you share its configuration (and if applicable, how you build pixiecore) ? I'd like to replicate the configuration where you were seeing DNS issues, so I can look more precisely at what's going on. Also, what was the contents of /etc/resolv.conf in the container before you changed it to 127.0.0.1? Was there >1 nameserver line? |
I'll take a look at your recent commit and build the image to test in our environment. For completeness, here's a screenshot of the wireshark stream: http://imgur.com/a/te44X
We're using docker in a small kubernetes cluster.
Just using the latest danderson/pixiecore image from docker hub at the moment.
We're running in kubernetes, but the pod manifest specifices the following arguments to pixiecore along with two mounted volumes: one for the kernel and initrd images we want to serve using the file:// protocol and the other volume is for overriding /etc/resolv.conf (a temporary solution at the moment.)
It was our internal lab nameserver |
Here's an updated log with the automated net boot that's timing out due to the DNS and using the latest pixiecore container:
|
We've updated our workflow to override the pixiecore container's /etc/resolv.conf with just 127.0.0.1 as a nameserver. It works every time...we would love to be able to figure it out without hacking that in, but at least it lets us move forward while we investigate this. |
Hi, we're seeing some weirdness with pixiecore in API mode on proxmox hosts and we think it's related to the time it takes to provide the ipxe boot script which we can test like so:
if we run the curl again it will be very fast (<<1s).
What's happening is our VM comes up and attempts to pxe boot from net0, fails, attempts to boot net1, fails, and the reboots. This will continue to happen with not much success. We can drop into the pxe command line and force things to work:
This will run through our expected kickstart install and all is well, so I believe we can assume that pixiecore and the API are working OK. Is there some way to debug what we're seeing? Can the ProxyDHCP bits be affected by slowness to receive the boot image/script?
Also, I should say if we use pixiecore's static mode it works every time, but we lose the ability to provide a dynamic kickstart or control which MAC addresses we care about dynamically.
The text was updated successfully, but these errors were encountered: