Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Flash failure on CI test #2964

Merged
merged 2 commits into from
Feb 16, 2022

Conversation

linked0
Copy link
Contributor

@linked0 linked0 commented Feb 4, 2022

The unit tests in agora.test.Flash fail often on CI because there is
a timing issue. Unit tests expect update_txs not to be sent as soon as
a block is externalized when the allow_publish is set to false. But
the update_tx is sent as soon as the allow_publish is to set true
by calling setPublishEnable, which is not to be supposed to be sent.

Fixes #2425

The error log on CI was following,
2022-01-28T02:49:34.7716990Z 2022-01-28 02:49:32,15 �[32mInfo�[0m [agora.flash.Channel] - Alice: Publishing update tx 4: 0x5fdc75. Result: { status: Status.Rejected, reason: "Double spend comes with a less-than-acceptable fee increase" }

This was because a charlie node succeeded to send an update tx that has the same input as the above tx, but the tx should not be sent because allow_publish is thought to have a false value at the very moment.

@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #2964 (1f2ce33) into v0.x.x (edf2052) will increase coverage by 1.16%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           v0.x.x    #2964      +/-   ##
==========================================
+ Coverage   79.00%   80.16%   +1.16%     
==========================================
  Files         209      202       -7     
  Lines       18586    18329     -257     
==========================================
+ Hits        14683    14693      +10     
+ Misses       3903     3636     -267     
Flag Coverage Δ
integration 18.12% <ø> (-2.04%) ⬇️
unittests 87.44% <100.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
source/agora/flash/Channel.d 86.41% <100.00%> (+0.02%) ⬆️
source/agora/flash/Node.d 80.67% <100.00%> (ø)
source/agora/test/Flash.d 95.23% <100.00%> (+0.02%) ⬆️
source/scpd/scp/SCPDriver.d 60.00% <0.00%> (-15.00%) ⬇️
source/agora/utils/Log.d 48.00% <0.00%> (-10.67%) ⬇️
source/agora/consensus/protocol/Nominator.d 91.70% <0.00%> (-0.38%) ⬇️
...odules/ocean/src/ocean/core/ExceptionDefinitions.d
agora/submodules/ocean/src/ocean/core/Test.d
agora/submodules/ocean/src/ocean/core/Buffer.d
agora/submodules/ocean/src/ocean/core/BitArray.d
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update edf2052...1f2ce33. Read the comment docs.

@linked0
Copy link
Contributor Author

linked0 commented Feb 4, 2022

All green!

@@ -1924,6 +1924,10 @@ unittest
network.clients[0].postTransaction(update_tx);
assert(!network.clients[0].hasTransactionHash(update_tx.hashFull()));

// make the `charlie` node not to send `update tx 6` unintentionally
// with `allow_publish` set to `false`, which cause failure to this test
Thread.sleep(1.seconds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I... Don't understand how that would solve the bug ?
Usually our test do:
a) Action
b) Test
c) Another action
d) Another test

Etc... Those "test" usually check a certain value (the state).
Whenever we have a timing issue, we place a sleep between an action and a test, allowing more time for the action to take effect (and change the state). This sleep is between a "Check" and an "Action", meaning it should not have any impact, unless a previous action has not completed by the time another action/test combo starts ?

Copy link
Contributor Author

@linked0 linked0 Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain why this would solve the bug.
This test expects that the calls, alice.setPublishEnable(false) and charlie.setPublishEnable(false), keep the update transaction from being published right after the fourth block is externalized. But in a failing situation, some time is delayed after the block is externalized, and the charlie.setPublishEnable(true) is called before charlie's call of publishing the update tx start, which results in the UTXO being already used which was supposed to be used by alice.

We could compare the logs of a succeeding situation and a failing situation as follows. The point is that sending the tx by charlie should be skipped in the normal situation.

Succeeding situation

2022-02-03 17:11:22,482 Info [agora.flash.Node] - Block #4 is externalized...
2022-02-03 17:11:22,483 Info [agora.flash.Channel] - Charlie: Update tx externalized(0x448d53) on height 4
2022-02-03 17:11:22,488 Info [agora.flash.Node] - Skipping publishing { inputs: [{ utxo: 0x1760ae909bfa48da9a19c5afc33640ab24ab791c1df7b4dcf34d59abcffe5c4a9e7e6d89c5c33bcf529002689a3b98f241c349cfffff36536e97d02f2804ec8e, unlock: 0x4903af0f1cf5ca5c00ce8008b0fa4995455a221c24c9e33d3465ac6c14114e310a4bebd1f4bac5a6c85a8b5749da0e01d617c518e4c23f45861cdf2bdf5da790370e010000000000000008060000000000000054, unlock_age: 0 }, { utxo: 0xaeb07e3b72cba7307957f0d776d00377e6b08760f4d8d679c4e17cc58a6071a1f8c91ded90d44742b1adeed208d81869d8c1529e232cc8c704ac4ce09b99c22a, unlock: 0x254f7660f3bed39bc636197ecaf321d35139cddda7db5cfa7294b0fcf32055021ae3a6dea515cdb1ee77a104048954b581c6470fc759538f7e4cb46e6da89bbe01, unlock_age: 0 }], outputs: [{ type: const(OutputType).Payment, lock: boa1xrc00kar2yqa3jzve9cm4cvuaa8duazkuwrygmqgpcuf0gqww8ye7ua9lkl, value: 594999999514200 }, { type: const(OutputType).Payment, lock: Lock(Script, 55200dadde3844d852b657dc946cd7d979e0b8882f0488d0e9184719a4241579863008070000000000000064545704030000005e203daf29f097ee46a48ffd19cb1cc4759f401cdf690214e447a88d4144b614f2f9080600000000000000645458), value: 10000 }], payload: [], lock_height: 0 }
2022-02-03 17:11:22,488 Info [agora.flash.Channel] - Charlie: Publishing update tx 6: 0xfe730c. Result: { status: Status.Accepted, reason: "" }

Failing situation

2022-01-28T02:49:34.7804380Z 2022-01-28 02:49:31,922 [32mInfo[0m [agora.flash.Node] - Block #4 is externalized...
2022-01-28T02:49:34.7805050Z 2022-01-28 02:49:31,925 [32mInfo[0m [agora.flash.Channel] - Charlie: Update tx externalized(0x448d53) on height 4
2022-01-28T02:49:34.7805700Z 2022-01-28 02:49:32,12 [32mInfo[0m [agora.flash.Channel] - Charlie: Publishing update tx 6: 0xfe730c. Result: { status: Status.Accepted, reason: "" }

@linked0 linked0 force-pushed the 2425-fix-flash-error branch 5 times, most recently from fbd1cca to 2502743 Compare February 10, 2022 05:43
@linked0 linked0 changed the title Fix failure on CI test Fix Flash failure on CI test Feb 10, 2022
@linked0 linked0 force-pushed the 2425-fix-flash-error branch 6 times, most recently from fa9465f to 4d0f431 Compare February 10, 2022 07:03
@linked0
Copy link
Contributor Author

linked0 commented Feb 11, 2022

:large_green_circle:
This PR does not fix all the failures, but I can solve other failures on Flash unit tests based on this PR.

@linked0 linked0 force-pushed the 2425-fix-flash-error branch 5 times, most recently from 0c96ba2 to 6694d1c Compare February 13, 2022 07:20
@hewison-chris
Copy link
Contributor

Do we really need to add Height to the production code to enable testing that publishing old updates are handled correctly?Updating TestFlashNode and the unit test should be enough. I will take a closer look into this today.

We are gonna use the height for checking when states has changed.
The unittests in `agora.test.Flash` fails often on CI because there is
a timing issue. Unittests expect `update_tx`s not to be sent as soon as
a block is externalized when the `allow_publish` is set to `false`. But
the `update_tx` is sent as soon as the `allow_publish` is to set `true`
by calling `setPublishEnable`, which is not to be supposed to be sent.
@linked0
Copy link
Contributor Author

linked0 commented Feb 15, 2022

Modified the code after being reviewed and improved more on the tests about Flash. There was a timing issue where an update tx was being sent before the validator being referenced did not accept a new block. I think this PR should be merged for other PRs if there's no big problem although this modification does not solve the root cause which is about the logic of publishing update txs on which I will file a new issue. I may as well focus on issue #2971 now which is about other failures on CI.

@hewison-chris I don't think there's any variable being added to the production code, but I will look into more if you let me know about that.

@hewison-chris
Copy link
Contributor

@hewison-chris I don't think there's any variable being added to the production code, but I will look into more if you let me know about that.

Sorry I didn't realize Flash.Listener.State is test code. In that case I think it is ok to merge this.

Copy link
Contributor

@hewison-chris hewison-chris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hewison-chris hewison-chris merged commit 4544d62 into bosagora:v0.x.x Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flash unittests sometimes crashes
3 participants