Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics collector fails to create watcher #1769

Closed
drawesomenic opened this issue Jan 11, 2022 · 19 comments
Closed

Metrics collector fails to create watcher #1769

drawesomenic opened this issue Jan 11, 2022 · 19 comments

Comments

@drawesomenic
Copy link

/kind bug

What steps did you take and what happened:
I started Katib runs using Kale which leads to about 50% of the pipelines succeeding and 50% of the pipelines failing randomly with the following error message of the "metrics-logger-and-collector" container:

Mon, Jan 10 2022 4:07:47 pm | I0110 15:07:47.414005 20 main.go:342] Trial Name: test-dev-blo6q-ptpgnwzg
Mon, Jan 10 2022 4:07:47 pm | 2022/01/10 15:07:47 FATAL -- failed to create Watcher
Mon, Jan 10 2022 4:07:47 pm | goroutine 34 [running]:
Mon, Jan 10 2022 4:07:47 pm | runtime/debug.Stack()
Mon, Jan 10 2022 4:07:47 pm | /usr/local/go/src/runtime/debug/stack.go:24 +0x65
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/util.Fatal({0xcc1a11, 0x0}, {0x0, 0x0, 0x0})
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x97
Mon, Jan 10 2022 4:07:47 pm | github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc0000bc000)
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x68
Mon, Jan 10 2022 4:07:47 pm | created by github.com/hpcloud/tail/watch.glob..func1
Mon, Jan 10 2022 4:07:47 pm | /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x173

What did you expect to happen:
In the succeeding pipelines no error is thrown, but instead shows normal output:

Wed, Jan 5 2022 8:26:38 pm | I0105 19:26:37.970244 16 main.go:342] Trial Name: test-dev-gtbb0-847s8svl
Wed, Jan 5 2022 8:26:39 pm | I0105 19:26:39.075769 16 main.go:136] 2022-01-05 19:26:39 Kale kfputils:176 [INFO] Creating KFP experiment 'test-dev-gtbb0'...

Anything else you would like to add:
I also tried increasing the resources via katib-config but it did not resolve the issue. The error does not occur with specific pipeline parameters but happens randomly. The workflow is completed successfully, however, as the "metrics-logger-and-collector" container fails, also the related job and trial fails.

Environment:

  • Katib version (check the Katib controller image version): 0.12.0
  • Kubernetes version: (kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
  • OS (uname -a): Linux dashboard-shell-w5nrd 5.4.0-88-generic 99-Ubuntu SMP Thu Sep 23 17:29:00 UTC 2021 x86_64 Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

@jbottum
Copy link

jbottum commented Jan 11, 2022

/kind question
/priority p2
/area katib

@andreyvelich
Copy link
Member

Thank you for creating this @drawesomenic.
Did you try to use File metrics collector instead of StdOut ?
Also, can you show me your Entrypoint command for the Trial training job container ?

@andreyvelich
Copy link
Member

It might be this issue: hpcloud/tail#151 (comment).
Did you build your own Metrics Collector image on aarch64 ?

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@knkski
Copy link
Contributor

knkski commented May 13, 2022

I'm also getting this error periodically with the default metrics collector image on x86:

I0513 15:40:02.000547      66 main.go:394] Trial Name: orbit-dlt-g6gm7-6qwczzt9
2022/05/13 15:40:02 FATAL -- failed to create Watcher
goroutine 18 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/hpcloud/tail/util.Fatal({0xd083fb?, 0x0?}, {0x0, 0x0, 0x0})
	/go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x97
github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc000132040)
	/go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x68
created by github.com/hpcloud/tail/watch.glob..func1
	/go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x16e

This was with the StdOut collector, though it looks like I can also replicate it with the File metrics collector:

metricsCollectorSpec:
  collector:
    kind: StdOut

If it matters, this is running on MicroK8s on my laptop.

@stale stale bot removed the lifecycle/stale label May 13, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

Sorry for the late reply. Are you still experience this issue ?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copy link

github-actions bot commented Jan 1, 2024

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

@AndersBennedsgaard
Copy link

@andreyvelich could this be re-opened? I also hit this with docker.io/kubeflowkatib/file-metrics-collector:v0.16.0 at random intervals.

@gigabyte132
Copy link

+1 I would like this reopened, I am also running into the same issues at times with docker.io/kubeflowkatib/file-metrics-collector:v0.16.0

@andreyvelich
Copy link
Member

@AndersBennedsgaard @gigabyte132 Please can you try the latest Katib release: v0.17.0 ?
We migrated to the different file watcher in this PR and maybe it will fix your issue: #2375.

@gigabyte132
Copy link

Hi @andreyvelich , I seem to to be running into the same issue with docker.io/kubeflowkatib/file-metrics-collector:v0.17.0 as well. This is with the StdOut collector.

I0924 12:55:20.295898      14 main.go:396] Trial Name: enas-cpu-vxglz6wk
2024/09/24 12:55:20 FATAL -- failed to create Watcher
goroutine 18 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x5e
github.com/hpcloud/tail/util.Fatal({0xdce7f6?, 0x0?}, {0x0, 0x0, 0x0})
        /go/pkg/mod/github.com/hpcloud/[email protected]/util/util.go:22 +0x8b
github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc000296000)
        /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:219 +0x65
created by github.com/hpcloud/tail/watch.init.func1 in goroutine 17
        /go/pkg/mod/github.com/hpcloud/[email protected]/watch/inotify_tracker.go:54 +0x14e

@andreyvelich
Copy link
Member

@gigabyte132 But it should be nxadm not hpcloud for tailing.
Can you show me output for kubectl get trial enas-cpu-vxglz6wk -n <NAMESPACE>

@gigabyte132
Copy link

@andreyvelich it seems like the change from nxadm to hpcloud is not in the 0.17 release, see
https://github.com/kubeflow/katib/blob/v0.17.0/cmd/metricscollector/v1beta1/file-metricscollector/main.go#L52

@andreyvelich
Copy link
Member

Oh, you are right, we haven't cherry-picked this change in 0.17 release.
Can you try to use the latest commit for your image: 867c40a

This is the image tag: https://hub.docker.com/layers/kubeflowkatib/file-metrics-collector/v1beta1-867c40a/images/sha256-3ab68e0932dd6c2028592dd7a7443ba4970e54f91ab145d6d35828112780eb0a?context=explore

@gigabyte132
Copy link

Sadly it seems to be the same with nxadm

2024/09/24 13:19:08 FATAL -- failed to create Watcher
goroutine 18 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:26 +0x5e
github.com/nxadm/tail/util.Fatal({0xe14a9b?, 0xc000282000?}, {0x0, 0x0, 0x0})
        /go/pkg/mod/github.com/nxadm/[email protected]/util/util.go:23 +0x8b
github.com/nxadm/tail/watch.(*InotifyTracker).run(0xc0002b6000)
        /go/pkg/mod/github.com/nxadm/[email protected]/watch/inotify_tracker.go:220 +0x68
created by github.com/nxadm/tail/watch.init.func1 in goroutine 17
        /go/pkg/mod/github.com/nxadm/[email protected]/watch/inotify_tracker.go:55 +0x14e

@andreyvelich
Copy link
Member

I see, thanks for testing it.
Please can you create dedicated issue for it @gigabyte132 ?

@gigabyte132
Copy link

@andreyvelich I have opened a new issue #2434 , let me know if you need any more information from me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants