-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
steps not displaying in UI after v3.5.1 #12165
Comments
Nodes missing issue was also mentioned here #11948 (comment) |
@agilgur5 Would this be caused due to UI refactoring? |
UI changelog analysisSo the node graph specifically hasn't changed between 3.5.0-rc1 and 3.5.1. The only recent change to it was the "Check All" checkbox from #11132, but that existed in 3.5.0-rc1 as well. #11948 (comment) also says it didn't occur in 3.4.11 and only occurred after. 3.4.12 and 3.4.13 have no UI refactors and almost no UI changes either (just date filter fixes). So I'm not so sure that this is a UI bug. more information needed@rwong2888 your screenshots seem to be of two different Workflows (perhaps the same underlying template, but two different Workflows nonetheless). An ideal comparison would be if the same exact, already completed Workflow shows up differently in 3.5.0-rc1 vs. 3.5.1. Also it seems you may perhaps have two different servers as well (as one has "staging" in its name)? It would be good to see the graph filters that are being used, especially as those are saved to |
here is staging here are the pods that ran for the workflow. notice how update-appset-and-pr, get-argocd-cluster, perform-dns-transaction, and argocd-sync do not appear in the UI. here are my filters reverted to v3.5.0-rc1 and resubmitted the workflow. this is how it is supposed to look Let me know if this helps or if anything else is required. If necessary, I can figure out grabbing all our templates... Here are my filters on rc1 |
@rwong2888 Can you try v3.5.0-rc2? It would also be great if you could provide a smaller example that produces this. Can you also check if the missing nodes (from the UI) exist in your nodeStatus in workflow YAML? |
@terrytangyuan , I tried with v3.5.0-rc2 and it displays as expected. I do see the non displayed nodes in the workflow yaml. I will see if I can come up with a smaller example. |
As I mentioned above -- it would be good to see without resubmission. If it's a UI bug, an already completed Workflow should show up differently in 3.5.1 vs. 3.5.0-rc1. The same exact Workflow, not a resubmitted one. Based on how you wrote it, my guess is that the same exact Workflow actually looked the same in 3.5.1 vs. 3.5.0-rc1 (and that's why you resubmitted it?). That would actually be indicative of a Controller bug -- as in, the version you originally ran the Workflow on matters.
If it's a Controller bug, this would suggest that the Pods ran but weren't tracked by the Controller in the Workflow resource for some reason. (which would be a pretty big bug)
This suggests it is being tracked though... 🤔 Although I'm not sure if that was the Workflow that was originally ran in 3.5.0-rc1. We'd want to see the node status in the completed Workflow YAML (i.e. If the node status appears in the completed YAML on a Workflow that ran in 3.5.0-rc1, then I'd have some follow-up questions:
That may be an issue with the event stream if so (potentially a bug in the Server)
I see that
That would narrow it down to a bugfix that's in both 3.5.1 and {3.4.12 or 3.4.13} |
Here's the common changes: 3.4.13: 3.4.12: I crossed out the ones that are around other parts of the codebase ("oss list bucket") or are pure logging changes. @carolkao wondering if you or your team might be able to confirm if the steps not displaying happens in 3.4.12 as well, or if it's only limited to 3.4.13? |
My misunderstanding. The workflow submitted on 3.5.1 displays the same for the same workflow when reverted to 3.5.0-rc2 and 3.5.0-rc1.
Nodes are missing on 3.5.1, 3.5.0-rc2, and 3.5.0-rc1.
Yes
I suspect this is the case. The last steps are failures, but with ignore Failure steps with
Here is the yaml for deploy-hello-secret-world-staging-d68kk, which was submitted on v3.5.0-rc1 deploy-hello-secret-world-staging-d68kk.txt And for comparison, the yaml for deploy-hello-secret-world-staging-bq8zb, which was submitted on v3.5.1 |
Big thanks for compiling together all the debugging details! That helps a lot! ❤️
That rules out a UI bug and a Server bug actually. So this is likely a Controller bug -- the YAML's gotta be different in some way then.
Thanks, that's potentially helpful to narrow it down.
One problem here -- per the screenshot in the UI, it looks like this Workflow does correctly show the steps after Also I'm now thinking we may want to add an annotation of the Controller version a Workflow (or other resource) was submitted with for debugging cases like these 💭 |
The issue is what happens after the snyk-scan. The gate and the entire cd nodes disappear. I am actually trying to reproduce on a smaller scale without success. One thing I notice is that the workflow onExit is showing up as a separate block instead of a contiguous line. |
Yea I got that. The screenshot of EDIT: woops, 3.5.0-rc1 is working correctly, 3.5.1 is the broken one. |
Correct, d68kk which was fired on the rc is how I'd expect the workflow to appear. bq8zb is the one fired on 3.5.1, which has the missing nodes, e.g. |
The controller bug will almost certainly be a missing child entry from the last node that correctly displays. |
Oh right, I mixed up the versions. I've been staring at these a little too much 😅
|
Ok it looks like there are indeed missing children (as expected) on all the
deploy-hello-secret-world-staging-d68kk-4191180866:
boundaryID: deploy-hello-secret-world-staging-d68kk-484182009
children:
- deploy-hello-secret-world-staging-d68kk-3662124613
displayName: snyk-container deploy-hello-secret-world-staging-d68kk-370978563:
boundaryID: deploy-hello-secret-world-staging-d68kk-484182009
children:
- deploy-hello-secret-world-staging-d68kk-3662124613
displayName: snyk-test deploy-hello-secret-world-staging-d68kk-3726739734:
boundaryID: deploy-hello-secret-world-staging-d68kk-484182009
children:
- deploy-hello-secret-world-staging-d68kk-3662124613
displayName: snyk-code deploy-hello-secret-world-staging-d68kk-1010882296:
boundaryID: deploy-hello-secret-world-staging-d68kk-3836212511
children:
- deploy-hello-secret-world-staging-d68kk-130060219
displayName: gate-snyk deploy-hello-secret-world-staging-d68kk-714360992:
boundaryID: deploy-hello-secret-world-staging-d68kk-3732999220
children:
- deploy-hello-secret-world-staging-d68kk-4275241402
displayName: cd
deploy-hello-secret-world-staging-bq8zb-1243511735:
boundaryID: deploy-hello-secret-world-staging-bq8zb-2683914452
displayName: snyk-container deploy-hello-secret-world-staging-bq8zb-424322676:
boundaryID: deploy-hello-secret-world-staging-bq8zb-2683914452
displayName: snyk-test deploy-hello-secret-world-staging-bq8zb-2914944037:
boundaryID: deploy-hello-secret-world-staging-bq8zb-2683914452
displayName: snyk-code deploy-hello-secret-world-staging-bq8zb-3342569505:
boundaryID: deploy-hello-secret-world-staging-bq8zb-244736368
displayName: gate-snyk deploy-hello-secret-world-staging-bq8zb-2523295271:
boundaryID: deploy-hello-secret-world-staging-bq8zb-1491720383
children:
- deploy-hello-secret-world-staging-bq8zb-2670969283
displayName: cd This confirms that it is a Controller bug. Now we have to figure out why it's missing children. |
Does the pure changes of v3.5.1 same as the changelog? And I see some retry logs at the log submitted. (I wonder why it is printed. failed to search where this log is printed)
|
We have another report from a different user/team of it not working in 3.4.13 when it previously worked in 3.4.11. So the changes I listed above were the common changes between 3.5.1 and 3.4.12-3.4.13. Technically speaking though, it could also be due a combination of different commits which could make it appear in both versions, although that would be a real head-turner. |
Hi @agilgur5 , from my testing , the UI display normally in 3.4.12, the missing nodes issue happens in 3.4.13 Argo UI only. |
@carolkao What’s your Argo CLI version? Can you try running v3.4.13 argo-server and v3.4.12 controller together and see if the problem persists? Do you have a small reproducible example to share? |
My Argo CLI version is 3.4.11. You can reproduce the issue with this example: ReproduceExample.zip |
Big thanks for the confirmation and the repro @carolkao ! So this confirms that it is a Controller bug, and it is in 3.4.13 and 3.5.1. |
Ok after staring at the code a decent bit yesterday and today, I think I figured it out... this may be the case of a subtle logic bug from years-old code that eventually got exposed completely accidentally... #12130 removed this line. That line was also removed in #11379. One tiny problem I discovered: Now that code is for DAGs, while this is occurring on steps. Well there's pretty much identical code for steps. This line from #693 added similar code to steps and #12130 / #11379 were similar in modifying both. Steps's
I have to step out for a bit, but I'm gonna run some tests with some fixes for that and hopefully will be back with a PR fix to this gnarly root cause! |
@terrytangyuan have you been able to test if that fixes it? I haven't been at my computer since my last comment (unfortunately not been feeling well), so haven't had a chance to confirm precisely (I am about 99% sure in theory though). Otherwise yes, I will rework it and fix the root cause in a follow-up. I left a comment on the PR that we can have a more targeted revert as well, by only reverting the |
This is a partial revert of argoproj#12130 and fixes argoproj#12165 Offered as an alternative to a full revert. Signed-off-by: Alan Clucas <[email protected]>
This is a partial revert of argoproj#12130 and fixes argoproj#12165 Offered as an alternative to a full revert. Signed-off-by: Alan Clucas <[email protected]>
This is a partial revert of argoproj#12130 and fixes argoproj#12165 Offered as an alternative to a full revert Signed-off-by: Alan Clucas <[email protected]>
Yes I have verified it but look so like @Joibel sent a fix. Let's use that instead. |
This test is providing a regression test for argoproj#12165. As promised in It verifies that a child link is correctly made in the case of a step with outputs. Signed-off-by: Alan Clucas <[email protected]>
This test is providing a regression test for argoproj#12165. As promised in It verifies that a child link is correctly made in the case of a step with outputs. Signed-off-by: Alan Clucas <[email protected]>
Pre-requisites
:latest
What happened/what you expected to happen?
v3.5.1 steps after snyk-scan no longer displayed. they do still run.
v3.5.0-rc1 shows all the steps after snyk-scan.
snyk-scan has an exit handler and ignore errors on it. the entire workflow also has an exit handler.
note: logs have been truncated to 65000 chars.
Version
v3.5.1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
n/a
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: