Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: JGF Name was removed, and build with distroless destroyed logging #85

Merged
merged 2 commits into from
Oct 10, 2024

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Oct 6, 2024

There were a few problems here, and I tried to get at least 2 into separate commits.

Logging Not Working

The base image for the scheduler used to use alpine, which was a good strategy to have a minimal build (image size wise) and still have a filesystem. We used a filesystem to write logs to /tmp/fluence.log. So when this was changed to the current distroless, I basically saw no logging, and it's not clear if there were other errors being hidden beyond the initial startup in the entrypoint. I just saw nothing printed, anywhere, which made debugging hard, so I looked to the Dockerfile and found that. Then I was able to see the next layer of the error onion 🧅

I1006 21:44:23.452232       1 fluxion.go:91] [Fluence] There are no errors
	----Match Allocate output---
jobid: 1
reserved: false
allocated: {"graph": {"nodes": [{"id": "3", "metadata": {"type": "core", "id": 0, "rank": -1, "exclusive": true, "paths": {"containment": "/k8scluster0/0/kind-worker1/core0"}}}, {"id": "2", "metadata": {"type": "node", "basename": "kind-worker", "id": 1, "rank": -1, "paths": {"containment": "/k8scluster0/0/kind-worker1"}}}, {"id": "1", "metadata": {"type": "subnet", "basename": "", "id": 0, "rank": -1, "paths": {"containment": "/k8scluster0/0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "k8scluster", "id": 0, "rank": -1, "paths": {"containment": "/k8scluster0"}}}], "edges": [{"source": "2", "target": "3"}, {"source": "1", "target": "2"}, {"source": "0", "target": "1"}]}}

at: 0
overhead: 0.000442
panic: interface conversion: interface {} is nil, not string

goroutine 52 [running]:
github.com/flux-framework/flux-k8s/flux-plugin/fluence/utils.ParseAllocResult({0xc00067e000, 0x2ad}, {0xc00063c300, 0x29})
	/go/src/fluence/utils/utils.go:295 +0x785
github.com/flux-framework/flux-k8s/flux-plugin/fluence/fluxion.(*Fluxion).Match(0xc0003a6ba0, {0x12b54a0?, 0xc000099ad8?}, 0xc0000c03c0)
	/go/src/fluence/fluxion/fluxion.go:110 +0x2a8
github.com/flux-framework/flux-k8s/flux-plugin/fluence/fluxcli-grpc._FluxcliService_Match_Handler({0x12b54a0, 0xc0003a6ba0}, {0x15f4e48, 0xc0001ea990}, 0xc00058e300, 0x0)
	/go/src/fluence/fluxcli-grpc/fluxcli_grpc.pb.go:95 +0x1a6
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000481180, {0x15f9b98, 0xc000002000}, 0xc0005a2900, 0xc0001ea3c0, 0x1f17240, 0x0)
	/go/src/fluence/vendor/google.golang.org/grpc/server.go:1286 +0xc5e
google.golang.org/grpc.(*Server).handleStream(0xc000481180, {0x15f9b98, 0xc000002000}, 0xc0005a2900, 0x0)
	/go/src/fluence/vendor/google.golang.org/grpc/server.go:1609 +0x9da
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	/go/src/fluence/vendor/google.golang.org/grpc/server.go:934 +0x8d
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 51
	/go/src/fluence/vendor/google.golang.org/grpc/server.go:932 +0x226

Discussed next.

Name removed

The interface conversion issue hinted at a change in flux, and looking at the data as presented (and what we expect to parse) it was quick to see. The "Name" field was removed from the response from fluxion:

jobid: 1
reserved: false
allocated: {"graph": {"nodes": [{"id": "3", "metadata": {"type": "core", "id": 0, "rank": -1, "exclusive": true, "paths": {"containment": "/k8scluster0/0/kind-worker1/core0"}}}, {"id": "2", "metadata": {"type": "node", "basename": "kind-worker", "id": 1, "rank": -1, "paths": {"containment": "/k8scluster0/0/kind-worker1"}}}, {"id": "1", "metadata": {"type": "subnet", "basename": "", "id": 0, "rank": -1, "paths": {"containment": "/k8scluster0/0"}}}, {"id": "0", "metadata": {"type": "cluster", "basename": "k8scluster", "id": 0, "rank": -1, "paths": {"containment": "/k8scluster0"}}}], "edges": [{"source": "2", "target": "3"}, {"source": "1", "target": "2"}, {"source": "0", "target": "1"}]}}

Note that basename is present in the above, but not name. This led to an interface conversion error, where nil was attempted to be converted to a string.

at: 0
overhead: 0.000442
panic: interface conversion: interface {} is nil, not string

goroutine 52 [running]:

Notably, another piece of this was that the bindings for fluxion-go were pinned to 0.32.0, and yet we were cloning the latest flux-sched. This is why I updated the bindings version to the latest, which has nicely been releasing itself for quite some time now :)

I'll ping for review when tests pass.

vsoch added 2 commits October 6, 2024 16:11
Problem: name was removed (and replaced with Subsystem) and this is problematic
because the field comes through as nil and triggers an error. We do not
currently have subsystem support in fluence, and so I am opting to remove it
entirely for the simplest solution.
Solution: remove the Name field, which was not used aside from a print statement.

Signed-off-by: vsoch <[email protected]>
Problem: the sig-scheduler-plugins Dockerfile was updated to use
a distroless image over alpine, and this destroyed our logging
strategy so it was impossible to see printed messages from fluence.
Since this has bit us before, it will be a simpler strategy moving
forward to put this build under our control, and resort back to using
the previous alpine base.
Solution: add the src/build/scheduler/Dockerfile to our maintained
files, and ensure we use an alpine base that has a filesystem.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch requested review from cmisale and milroy October 6, 2024 22:30
Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good; thanks! As a note, we can expect more changes to JGF in the near future that will likely break Fluence.

@vsoch
Copy link
Member Author

vsoch commented Oct 10, 2024

This looks good; thanks! As a note, we can expect more changes to JGF in the near future that will likely break Fluence.

Totally OK, I am here to unbreak our very dapper Fluence. :bowtie:

@vsoch vsoch merged commit 29f411e into main Oct 10, 2024
8 checks passed
@vsoch vsoch deleted the control-builds branch October 10, 2024 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault in PreFilter
2 participants