Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(filter): unsubscribe waitgroup, execute async, and guard against calling funtions while the protocol is not started #692

Merged
merged 5 commits into from
Sep 4, 2023

Conversation

richard-ramos
Copy link
Member

@richard-ramos richard-ramos commented Aug 28, 2023

In #691 it was reported that in status-go, go-waku fails when unsubscribing from a filter full node. Looking at the different scenarios where this could happen, it seems that this could be caused due to a race condition between calling Unsubscribe and stopping the node. This s because in status-go, ideally the logout from an account should be fast, and it cannot wait for confirmations that the unsubscription was successful.

To avoid such scenarios, and keep the API unaffected, this PR introduces new options to allow executing the unsubscribe functions in a fire and forget fashion, to optionally allow specifying a Waitgroup, as well as adding some guard conditions to return an error if the filter client has not been started

Issue

Part of #634

@status-im-auto
Copy link

status-im-auto commented Aug 28, 2023

Jenkins Builds

Click to see older builds (62)
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ 1c4c3d0 #1 2023-08-28 19:21:13 ~1 min linux 📦deb
✔️ 1c4c3d0 #1 2023-08-28 19:22:04 ~2 min nix-flake 📄log
✔️ 1c4c3d0 #1 2023-08-28 19:23:10 ~3 min tests 📄log
✔️ 1c4c3d0 #1 2023-08-28 19:23:14 ~3 min tests 📄log
✔️ 1c4c3d0 #1 2023-08-28 19:23:50 ~3 min android 📦tgz
✔️ 1c4c3d0 #1 2023-08-28 19:24:21 ~4 min ios 📦tgz
✔️ 912b4c9 #2 2023-08-28 19:28:02 ~33 sec linux 📦deb
✔️ 912b4c9 #2 2023-08-28 19:28:56 ~1 min tests 📄log
✔️ 912b4c9 #2 2023-08-28 19:29:22 ~1 min nix-flake 📄log
✔️ 912b4c9 #2 2023-08-28 19:29:25 ~1 min tests 📄log
✔️ 912b4c9 #2 2023-08-28 19:30:41 ~3 min android 📦tgz
✔️ 912b4c9 #2 2023-08-28 19:30:43 ~3 min ios 📦tgz
✔️ 32e051e #3 2023-08-28 19:48:53 ~1 min linux 📦deb
✔️ 32e051e #3 2023-08-28 19:49:01 ~1 min tests 📄log
✔️ 32e051e #3 2023-08-28 19:49:30 ~1 min tests 📄log
✔️ 32e051e #3 2023-08-28 19:49:34 ~1 min nix-flake 📄log
✔️ 32e051e #3 2023-08-28 19:50:36 ~3 min android 📦tgz
✔️ 32e051e #3 2023-08-28 19:50:54 ~3 min ios 📦tgz
✖️ 7f3763a #4 2023-08-28 20:08:24 ~23 sec tests 📄log
✔️ 7f3763a #4 2023-08-28 20:08:32 ~31 sec linux 📦deb
✖️ 7f3763a #4 2023-08-28 20:09:30 ~1 min tests 📄log
✔️ 7f3763a #4 2023-08-28 20:09:52 ~1 min nix-flake 📄log
✔️ 7f3763a #4 2023-08-28 20:11:40 ~3 min android 📦tgz
✔️ 7f3763a #4 2023-08-28 20:12:04 ~4 min ios 📦tgz
✔️ d64a04f #5 2023-08-28 20:15:46 ~44 sec linux 📦deb
✔️ d64a04f #5 2023-08-28 20:16:59 ~1 min nix-flake 📄log
✔️ d64a04f #5 2023-08-28 20:17:54 ~2 min tests 📄log
✔️ d64a04f #5 2023-08-28 20:17:54 ~2 min tests 📄log
✔️ d64a04f #5 2023-08-28 20:18:09 ~3 min android 📦tgz
✔️ d64a04f #5 2023-08-28 20:18:19 ~3 min ios 📦tgz
✔️ 625b2c0 #6 2023-08-29 15:13:19 ~1 min linux 📦deb
✖️ 625b2c0 #6 2023-08-29 15:13:30 ~1 min tests 📄log
✖️ 625b2c0 #6 2023-08-29 15:13:34 ~1 min tests 📄log
✔️ 625b2c0 #6 2023-08-29 15:14:00 ~1 min nix-flake 📄log
✔️ 625b2c0 #6 2023-08-29 15:15:32 ~3 min android 📦tgz
✔️ 625b2c0 #6 2023-08-29 15:16:20 ~4 min ios 📦tgz
✖️ 14b5766 #7 2023-08-29 15:14:39 ~18 sec tests 📄log
✔️ 14b5766 #7 2023-08-29 15:14:48 ~32 sec linux 📦deb
✖️ 14b5766 #7 2023-08-29 15:15:39 ~1 min tests 📄log
✔️ 14b5766 #7 2023-08-29 15:16:39 ~2 min nix-flake 📄log
✔️ 14b5766 #7 2023-08-29 15:18:24 ~4 min ios 📦tgz
✔️ 14b5766 #7 2023-08-29 15:18:38 ~3 min android 📦tgz
✔️ ae0f000 #8 2023-08-29 15:16:51 ~31 sec linux 📦deb
✔️ ae0f000 #8 2023-08-29 15:18:10 ~1 min tests 📄log
✔️ ae0f000 #8 2023-08-29 15:18:47 ~2 min nix-flake 📄log
✔️ ae0f000 #8 2023-08-29 15:19:19 ~2 min tests 📄log
✔️ ae0f000 #8 2023-08-29 15:19:48 ~3 min ios 📦tgz
✔️ ae0f000 #8 2023-08-29 15:21:34 ~2 min android 📦tgz
✔️ f90ea71 #9 2023-08-29 15:34:09 ~33 sec linux 📦deb
✔️ f90ea71 #9 2023-08-29 15:36:10 ~2 min nix-flake 📄log
✔️ f90ea71 #9 2023-08-29 15:37:19 ~3 min android 📦tgz
✔️ f90ea71 #9 2023-08-29 15:37:21 ~3 min ios 📦tgz
✖️ f90ea71 #9 2023-08-29 15:39:22 ~5 min tests 📄log
✖️ f90ea71 #9 2023-08-29 15:41:46 ~8 min tests 📄log
✖️ f90ea71 #10 2023-08-29 17:54:55 ~5 min tests 📄log
✖️ f90ea71 #10 2023-08-29 18:12:00 ~20 min tests 📄log
✔️ 9d648cc #10 2023-08-29 18:14:33 ~44 sec linux 📦deb
✔️ 9d648cc #11 2023-08-29 18:15:20 ~1 min tests 📄log
✔️ 9d648cc #10 2023-08-29 18:15:43 ~1 min nix-flake 📄log
✔️ 9d648cc #11 2023-08-29 18:15:47 ~1 min tests 📄log
✔️ 9d648cc #10 2023-08-29 18:17:38 ~3 min android 📦tgz
✔️ 9d648cc #10 2023-08-29 18:18:44 ~4 min ios 📦tgz
Commit #️⃣ Finished (UTC) Duration Platform Result
✔️ b1acd53 #11 2023-08-31 20:38:24 ~30 sec linux 📦deb
✔️ b1acd53 #12 2023-08-31 20:39:15 ~1 min tests 📄log
✔️ b1acd53 #11 2023-08-31 20:39:46 ~1 min nix-flake 📄log
✔️ b1acd53 #12 2023-08-31 20:39:49 ~1 min tests 📄log
✔️ b1acd53 #11 2023-08-31 20:40:58 ~3 min android 📦tgz
✔️ b1acd53 #11 2023-08-31 20:41:40 ~3 min ios 📦tgz
✔️ a6a7922 #12 2023-09-03 23:39:18 ~30 sec linux 📦deb
✔️ a6a7922 #13 2023-09-03 23:40:08 ~1 min tests 📄log
✔️ a6a7922 #12 2023-09-03 23:40:40 ~1 min nix-flake 📄log
✔️ a6a7922 #13 2023-09-03 23:40:43 ~1 min tests 📄log
✔️ a6a7922 #12 2023-09-03 23:42:04 ~3 min ios 📦tgz
✔️ a6a7922 #12 2023-09-03 23:42:13 ~3 min android 📦tgz

@richard-ramos richard-ramos changed the title refactor(filter): unsubscribe waitgroup and async refactor(filter): unsubscribe waitgroup, execute async, and guard against calling funtions while the protocol is not started Aug 28, 2023
@richard-ramos
Copy link
Member Author

@chaitanyaprem, @harsh-98 if this pattern looks fine, in future PRs i'll add the same logic to other protocols that could run into the same issue.

Copy link
Collaborator

@cammellos cammellos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I would maybe add some tests to verify the behavior?

@richard-ramos richard-ramos force-pushed the async-filter-wg branch 2 times, most recently from 7f3763a to d64a04f Compare August 28, 2023 20:14
@richard-ramos
Copy link
Member Author

Added some tests to verify behaviour!

Copy link
Collaborator

@chaitanyaprem chaitanyaprem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from minor comments.

waku/v2/protocol/filter/options.go Outdated Show resolved Hide resolved
waku/v2/protocol/filter/filter_test.go Show resolved Hide resolved
waku/v2/protocol/filter/client.go Outdated Show resolved Hide resolved
@chaitanyaprem
Copy link
Collaborator

@chaitanyaprem, @harsh-98 if this pattern looks fine, in future PRs i'll add the same logic to other protocols that could run into the same issue.

Yup, makes sense.
If so, rather than copying this pattern..would it be possible to have this common pattern applied as part of some decorator rather than duplicating it completely.

@cammellos
Copy link
Collaborator

@richard-ramos with regards of the race condition between Start/Stop, a panic can still be replicated:

https://github.com/waku-org/go-waku/compare/async-filter-wg-test?expand=1

2023-08-29T11:20:23.492+0100    INFO    gowaku.filterv2-lightnode       filter/client.go:85     starting again
2023-08-29T11:20:23.492+0100    INFO    gowaku.filterv2-lightnode       filter/client.go:89     starting
2023-08-29T11:20:23.492+0100    INFO    gowaku.filterv2-lightnode       filter/client.go:113    stopping node
2023-08-29T11:20:23.492+0100    INFO    gowaku.filterv2-lightnode       filter/client.go:92     setting cancel
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe89bbb]

goroutine 302 [running]:
github.com/waku-org/go-waku/waku/v2/protocol/filter.(*WakuFilterLightNode).Stop(0x0?)
        /home/cammellos/documents/programming/status/go-waku/waku/v2/protocol/filter/client.go:115 +0x5b
github.com/waku-org/go-waku/waku/v2/protocol/filter.(*FilterTestSuite).TestStartStop.func1()
        /home/cammellos/documents/programming/status/go-waku/waku/v2/protocol/filter/filter_test.go:366 +0x38
created by github.com/waku-org/go-waku/waku/v2/protocol/filter.(*FilterTestSuite).TestStartStop
        /home/cammellos/documents/programming/status/go-waku/waku/v2/protocol/filter/filter_test.go:382 +0x119
exit status 2

That would happen if you call Start for the first time, and before it finishes Stop gets called.
Similarly with Unsubscribe

goroutine 320 [running]:
github.com/waku-org/go-waku/waku/v2/protocol/filter.(*WakuFilterLightNode).Unsubscribe(0xc000410c80, {0x14e0d68?, 0xc000467020}, {{0x1140fb8, 0x9}, {0xc0003f7640, 0x1, 0x1}}, {0xc00047df98, 0x1, ...})
        /home/cammellos/documents/programming/status/go-waku/waku/v2/protocol/filter/client.go:382 +0xdf
github.com/waku-org/go-waku/waku/v2/protocol/filter.(*FilterTestSuite).TestStartStop.func1()
        /home/cammellos/documents/programming/status/go-waku/waku/v2/protocol/filter/filter_test.go:369 +0xd9
created by github.com/waku-org/go-waku/waku/v2/protocol/filter.(*FilterTestSuite).TestStartStop
        /home/cammellos/documents/programming/status/go-waku/waku/v2/protocol/filter/filter_test.go:382 +0x1b7
exit status 2

You can replicate using the command:

go test -run TestFilterSuite -testify.m TestStartStop

and you can check for detected race conditions using

go test -run TestFilterSuite -testify.m TestStartStop -race

The issue is that CompareAndSwap only protect the beginning of the function, but after that, all bets are off.
This currently only happens between the first Start and Stop, if Start successfully finishes, then I could not replicate race conditions, since state is not nullled, though there might be some inconsistencies, as the node might not be fully set up once one of these calls are issued.

Probably worth discussing how to address in the call, locks are obviously a solution, but we need to make sure we don't deadlock, but we can discuss later.

@richard-ramos
Copy link
Member Author

That would happen if you call Start for the first time, and before it finishes Stop gets called.

The issue is that CompareAndSwap only protect the beginning of the function, but after that, all bets are off.
This currently only happens between the first Start and Stop, if Start successfully finishes, then I could not replicate race conditions, since state is not nullled, though there might be some inconsistencies, as the node might not be fully set up once one of these calls are issued.

🤔 interesting. We do use CompareAndSwap in discv5, so I wonder if this scenario could happen there too.
cc: @harsh-98

@richard-ramos richard-ramos force-pushed the async-filter-wg branch 2 times, most recently from 14b5766 to ae0f000 Compare August 29, 2023 15:16
@richard-ramos
Copy link
Member Author

@cammellos with latest commit, I copied your test and replaced the instances of CompareAndSwap in filter client and it is not panicking anymore!

Copy link
Collaborator

@cammellos cammellos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great @richard-ramos , thank you!

waku/v2/protocol/filter/client.go Show resolved Hide resolved
@richard-ramos richard-ramos force-pushed the async-filter-wg branch 3 times, most recently from 1c82a01 to 9d648cc Compare August 29, 2023 18:10
Copy link
Collaborator

@chaitanyaprem chaitanyaprem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
A comment regarding probable impact on performance due to this lock.

waku/v2/protocol/filter/client.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@chaitanyaprem chaitanyaprem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@richard-ramos richard-ramos merged commit 8aa1c4a into master Sep 4, 2023
11 checks passed
@richard-ramos richard-ramos deleted the async-filter-wg branch September 4, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants