-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove abusable code from subnet template #88
base: main
Are you sure you want to change the base?
Conversation
…hotkey exists if it exists in metagraph)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be better just to handle these edge-cases explicitly, other wise you will get error
in logs, where its clear it was just someone sending malformed requests. And if that was you, then you would also benefit from proper description of that.
I must admit I don't get this a) overhead of uncaught exception - isn't that exactly how blacklist operates? by throwing an exception? https://github.com/opentensor/bittensor/blob/16f26047bbcce7c8e283599ad9523de0638891d3/bittensor/axon.py#L1318 Btw, in bittensor v7 there are some changes in how exceptions are handled, and even less information is actually returned to the client: https://github.com/opentensor/bittensor/pull/1822/files#diff-7b2c4a5751ea6956b701861216059a5cef7c3c3e236e77e9295b0b8f6a6438e2R978 |
Logs show amplification in response size with ValueError response vs non
registered rejection and increased delay in processing. After moving that
line below, I experienced a 90% reduction in latency to only a few ms each
request. The amount of requests was 1000s per second, so I think any
additional overhead could cause issues.
…On Wed, May 22, 2024 at 5:08 PM Maciej Urbański ***@***.***> wrote:
I must admit I don't get this
a) overhead of uncaught exception - isn't that exactly how blacklist
operates? by throwing an exception?
https://github.com/opentensor/bittensor/blob/16f26047bbcce7c8e283599ad9523de0638891d3/bittensor/axon.py#L1318
b) amplification of response - wasn't the response just string value of
exception text? why was it bigger? Do you have example?
Btw, in bittensor v7 there are some changes in how exceptions are handled,
and even less information is actually returned to the client:
https://github.com/opentensor/bittensor/pull/1822/files#diff-7b2c4a5751ea6956b701861216059a5cef7c3c3e236e77e9295b0b8f6a6438e2R978
So it should make amplification even less possible
—
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKRBZ2C6TY4NDES2JZGL3B3ZDUCONAVCNFSM6AAAAABIDDGRXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVG42DSNJUGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Co-authored-by: Maciej Urbański <[email protected]>
Reopening soon with more comprehensive pull. |
examples of the attacks: 2024-05-22 06:59:25.684 | TRACE | axon | --> | 33 B | Dummy | None | 104.54.246.177:49350 | 500 | None is not in list example of missing minimum stake attack: https://discord.com/channels/799672011265015819/1220504695404236800/1239769414585286738 |
Addressed, and added a theory on why this might be. Also added in minimum stake example as well, with explanation on why I think this is needed (based on real world exploitation in sn2) and should be offered by default rather than simply mentioning that you may want to do this. This isn't a MAY, it's a MUST, in my opinion, hence the change in the documentation as well, unless otherwise demonstrated in their subnet design pattern. Anyways, you shouldn't have unhandled exceptions (especailly of generic types like ValueErorr) that can be triggered by users even if they are effectively caught upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Minimum stake and other blacklisting options are great and needed additions, no doubt about that :)
-
As for "response size amplification" if we talk about difference measure in <100Bytes, that is for sure not what caused the initial problem - and it will be hard to tell without an example of logs from miner that was DoS into restart with logs showing what was happening right before that.
In fact, the ~50B responses size shown by logs (it's not real response size, just size of Synapse object in memory) is what you get when the blacklist is functioning correctly. -
The stacktrace being printed should be only slightly of different size (its 1 frame, and 3 extra lines being printed, so its more like 1/3 bigger, not 5x as you theorize). If that pushes it over the edge is hard to say, it is possible, while unlikely.
But I do think overeager exc trace is a problem, which hopefully can be addressed by less verbose handled synapse exceptions bittensor#1928 .
I don't think 1/3 bigger is accurate FYI. @mjurbanski-reef Perhaps I'm wrong, but you can't just count functions and say okay it's 1 extra. Its dependent on the total depth of nested calls and the complexity of the execution path of the function that throws the exception. Not like this matters too much, they straight up just shouldn't print a greedy stacktrace like that as you've mentioned, especially if its triggerable by an outside user. |
The default subnet code has been reused on major mainnet subnets leading to predictable issues with lack of checks towards stakeless validators, and a triggerable exception with a longer stacktrace that a normal blacklistexception.
Forcing an error in the blacklist_fn leads to an amplication of the normal blacklist response. For instance, a normal blacklist repsonse will state the short reason, i.e. not registered. The error blacklist response will include the full error message. Some error messages may be longer than others (and the ValueError one is), leading to an amplication in the normal response. The effect amplification is around 2-3x depending on the original return text of the blacklist_fn.
Utilizing these flaws, attackers were able to do two things:
Subnet 2 (missing stake check)
On subnet 2, a validators minimum stake was not being checked. As a result, a malicious miner was able to register a low stake validator, and prompt users with challenges. The purpose of these challenges was to slow miners average response time in comparison to their normal response time to a real validator. In addition, the attacker employed several sneaky methods such as monitoring other validator requests, in order to sync their request to a validators, forcing a concurrent request on the innocent miner.
As a result, I believe it is best practice to leave in a default stake blacklist as an example, and allow subnet devs to REMOVE this if they wish, not the other way around. In best practice, the safest solution should be offered as the example, not a barebones solution. If the goal was for subnet devs to inspect the blacklist function and tailor it to their use-case, this has been demonstrated to not happen on at least 6 subnets (likely more). In most subnets, a stakeless validator has no actual purpose of calling a miners synapse. In practice, this lack of a default example has been shown to cause issues in various subnets that are often not picked up on right away.
Subnet 31 (triggerable unhandled exception)
Other affected subnets with a similar issue to 31 included subnet 33, 28, 25, 16, 11, 5. Here's my leading theory on why this leads to exhaustion, rather than the normal blacklist behavior, which as @mjurbanski-reef has shown, already throws an exception normally. Let's compare both scenarios, specifically taking a look at the traceback that will be printed here:
https://github.com/opentensor/bittensor/blob/master/bittensor/axon.py#L959
Unhandled exception in blacklist_fn:
blacklist_fn -> blacklist -> dispatch
Normal handled exception stacktrace:
blacklist -> dispatch
As you can see, forcing an error in the blacklist_fn in the subnet (note not the blacklist fn in axon.py) leads to a larger stacktrace. This stacktrace would cost additional resources to gather and print, and would even possibly include stack trace information from the asyncio upstream, metagraph calls, etc. The stacktrace length should be roughly 3-5x the size due to the increased callstack.
While a normal exception thrown by blacklist with normal behavior (i.e. the person is blacklisted) leads to a much shorter stacktrace, only blacklist, then caught by dispatch.
It's my belief this is part one to why this leads to exhaustion. The second part, is the increased response size when throwing an error from the blacklist_fn. A combination of both means greater CPU/mem pressure, and greater networking pressure. When hundreds to thousands of requests are coming in, small inefficiencies like these can play a big part to exhausting your miner. From testing, removing the unhandled exception from blacklist_fn led to continued operation for the miner while being attacked, while not handling the exception and letting it pass downstream caused server reboots on various miners machines, and overall total DoS to the axon. Why this would happen? I can only guess differences in stacktrace/callstack length.