Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go TLS tracing not working on Amazon Linux 2023 #1986

Closed
SamuraiPrinciple opened this issue Aug 18, 2024 · 24 comments
Closed

Go TLS tracing not working on Amazon Linux 2023 #1986

SamuraiPrinciple opened this issue Aug 18, 2024 · 24 comments

Comments

@SamuraiPrinciple
Copy link

SamuraiPrinciple commented Aug 18, 2024

Describe the bug
Go TLS tracing does not work on Amazon Linux 2023.

To Reproduce
Provision EKS cluster (v1.30) with two node groups, one running AL2, one running AL2023. Deploy pixie. Run a Go (1.21 or 1.22) app (pod/deployment) that's making outbound HTTPS requests (HTTP 1.1). Go to px/http_data and try and observe said outbound traffic. Only traffic initiated by the pods running on AL2 node group would appear.

Expected behavior
Be able to see traffic initiated by pods scheduled on node group running AL2023.

Logs
Please attach the logs by running the following command:

./px collect-logs

App information (please complete the following information):

  • Pixie version: 0.14.10
  • K8s cluster version: EKS 1.30
  • Node Kernel version: 6.1.97-104.177.amzn2023.x86_64
  • Browser version: -

pem-AL2.log
pem-al2023.log

@ddelnano
Copy link
Member

@SamuraiPrinciple thanks for reporting this. I'm aiming to try to reproduce the issue later this week and will report back when I have more information.

@SamuraiPrinciple
Copy link
Author

SamuraiPrinciple commented Aug 20, 2024

Hi @ddelnano - many thanks for the quick reply, it really is appreciated. Do let me know if I could be of any help (more logs, screenshare, ...).

Take care and wish you a great day.

@ddelnano
Copy link
Member

This is still on my radar. I'm expecting to have time to look into this at the end of the week.

@SamuraiPrinciple
Copy link
Author

That's great, thanks for the update. I've tried with the latest pixie version (0.14.11) and it's still not working.

@SamuraiPrinciple
Copy link
Author

Hi @ddelnano hope all is well :) Wondering if you've managed to reproduce the issue? Happy to provide more details (EKS AMIs, pixie deployment config and sample go app) in case you didn't manage to get a repro...

Many thanks,
damjan

@ddelnano
Copy link
Member

@SamuraiPrinciple sorry for missing to follow up on this! I'll be giving this some attention tomorrow or Thursday.

@SamuraiPrinciple
Copy link
Author

No, no, no problem at all - just let me know if you're struggling to reproduce it, very happy to make it as easy as possible for you.

Thanks for the reply and your hard work!

@ddelnano
Copy link
Member

Since our existing Go tls tracing test hasn't been updated for Go 1.22 yet, I wanted to rule out any obvious incompatibility. We will eventually have that version under test, but until that is upstream I've created this branch which adds coverage for Go 1.22. The test passes on that branch, which is a good sign. My next step is to reproduce the issue on EKS w/ AL2023.

@SamuraiPrinciple while I haven't tried the EKS steps yet, can you able to share an example Go application that exhibits the problem and the compiler version?

@SamuraiPrinciple
Copy link
Author

Hi, thanks for picking this up! I will provide you with the sample app later, but just as quick note - if I rotate EKS nodes so that they use AL2 AMIs, the tracing then works with everything else being exactly the same.

@ddelnano
Copy link
Member

No problem! Looking forward to seeing the example app as I didn't have luck with the reproduction on EKS.

@SamuraiPrinciple
Copy link
Author

@SamuraiPrinciple
Copy link
Author

The repo contains a simple go program that polls https://example.com as well as a Dockerfile that can be used to build an image. There is also a very simple k8s deployment manifest that runs two containers (one for HTTP 1.1 and the other for HTTP/2).

@ddelnano
Copy link
Member

Thanks for the quick response. Does the HTTP/2 case work on AL2? Pixie's HTTP/2 tracing should only work for GRPC and when Go binaries have debug symbols (docs).

screen

As for the HTTP 1.1 case, can you explain more on the success case (when running on AL2)? I crafted a sample application very similar to yours and didn't see any traffic for AL2 or AL2023. I didn't consider my case a reproduction since I thought it was more similar to #899, which was never fully understood.

I'll be able to continue debugging this on my end, but I'd still like to understand the AL2 case to help narrow down where the problem might be.

@SamuraiPrinciple
Copy link
Author

SamuraiPrinciple commented Dec 12, 2024

Hi, mega-thanks for the quick reply! To be honest, we're not using HTTP/2 (i.e. we're actively disabling it - long story, but it's problematic for various reasons, unrelated to pixie), so I've only included that 'just in case'. I can confirm that on AL2 I can indeed see payloads for outbound TLS requests (HTTP 1.1).

@ddelnano
Copy link
Member

ddelnano commented Dec 12, 2024

For your AL2 environment, can you run the PEM with --stirling_conn_trace_pid=${pid_of_working_app} and upload the PEM log to this issue?

You can add that arg to the DaemonSet even though the PID won't match on the other instances. Just make sure to grab the logs from the instance that has the application with the PID.

@SamuraiPrinciple
Copy link
Author

SamuraiPrinciple commented Dec 12, 2024

Will get on it. In the meantime, here is what I can see for the node group running AL2

Screenshot 2024-12-12 at 22 56 15

@SamuraiPrinciple
Copy link
Author

I'm deploying pixie using a helm chart - do you happen to know what the quickest way is to pass an argument to pem daemonset? or do I have to manually update the pem daemonset?

@SamuraiPrinciple
Copy link
Author

vizier-pem-fvgxt.log

I think I've got it...

@ddelnano
Copy link
Member

For completeness, the patches config setting is probably the only way to do that and a helm example can be found here.

Thanks for the logs! That is exactly what I am looking for. Need some time to further investigate, but I believe I have the information I need for now!

@ddelnano
Copy link
Member

ddelnano commented Dec 13, 2024

Ok, I've tracked down the issue. The get_goid function is broken on AL23. I made the following changes to print out all the variables and here is the output that I'm seeing.

diff --git a/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h b/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h
index 02cc46a3c..3667a490b 100644
--- a/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h
+++ b/src/stirling/source_connectors/socket_tracer/bcc_bpf/go_trace_common.h
@@ -95,12 +95,14 @@ static inline uint64_t get_goid(struct pt_regs* ctx) {
   uint64_t id = bpf_get_current_pid_tgid();
   uint32_t tgid = id >> 32;
   struct go_common_symaddrs_t* common_symaddrs = go_common_symaddrs_map.lookup(&tgid);
+  bpf_trace_printk("get_goid: tgid=%d, common_symaddrs=%p\n", tgid, common_symaddrs);
   if (common_symaddrs == NULL) {
     return 0;
   }

   // Get fsbase from `struct task_struct`.
   const struct task_struct* task_ptr = (struct task_struct*)bpf_get_current_task();
+  bpf_trace_printk("get_goid: task_ptr=%p\n", task_ptr);
   if (!task_ptr) {
     return 0;
   }
@@ -118,6 +120,7 @@ static inline uint64_t get_goid(struct pt_regs* ctx) {
   size_t g_addr;
   bpf_probe_read_user(&g_addr, sizeof(void*), (void*)(fs_base + common_symaddrs->g_addr_offset));
   bpf_probe_read_user(&goid, sizeof(void*), (void*)(g_addr + common_symaddrs->g_goid_offset));
+  bpf_trace_printk("fs_base=%p get_goid: g_addr=%p, goid=%d\n", fs_base, g_addr, goid);
   return goid;
 }

  http-requester-7557    [001] ...11 14174.901214: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0
  http-requester-7557    [001] ...11 14174.901218: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0
             olm-3807    [000] .N.11 14177.957893: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0
             olm-3807    [000] .N.11 14177.958153: bpf_trace_printk: fs_base=0000000000000000 get_goid: g_addr=0000000000000000, goid=0

bpf-printk-logs.txt

@ddelnano
Copy link
Member

@SamuraiPrinciple can you try installing the kernel-devel.x86_64 package on all AL23 hosts and see if tracing works?

I was able to verify that the issue above is because Pixie's prepackaged headers (what are used if kernel headers aren't available on host) is causing this thread lookup to fail. The fsbase is essential for locating the goid, so incorrectly reading that value breaks tracing.

Essentially AL23 has enough kernel differences (back ports, etc) compared to vanilla linux that Pixie's pre-packaged headers aren't compatible. Our docs mention that it's highly recommended to install linux headers although we haven't had a good way to surface this to end users. Unfortunately these kernel header incompatibilities manifest in strange ways, so it's usually time consuming to identify the problem.

Coincidentally I've been working on #2051 and it is getting close to making it in a release. I intend to make that check part of the Pixie cli and helm install process in addition to the px collect-logs command. That would have allowed us to catch this much sooner and rule out the headers being a problem.

@SamuraiPrinciple
Copy link
Author

Hi! Many thanks for the investigation and the info. I can confirm that installing linux headers package indeed resolves the issue. The HTTP 1.1 tls connections can be traced (I have noticed that gzipped responses are not rendered in UI - "resp_body: <removed: non-text content-type>,"). Happy to close this issue.

@ddelnano
Copy link
Member

No problem, sorry it took some time for me to dig into this and greatly appreciate your help throughout the process!

I'm hopeful that surfacing the lack of linux headers in Pixie's diagnostic tools will help uncover and fix these problems before they become month long bugs! That tooling should be making its way in a release in the next few weeks (across vizier, operator, cli, etc).

@SamuraiPrinciple
Copy link
Author

Glad I could be useful :) and many, many thanks for your help with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants