-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pubsub flood due to same message propagated multiple times #524
Comments
Do you have any sort of message validation running in your network? It seems that this is the real root of your problem. |
Sorry @vyzo can you clarify what you mean by this? Is this a gossipsub feature we need to enable? If you mean application-level message validation we already do validate that message conform to the expected schema, and we even have caches around most of the behaviors that are triggered when applying messages. All these messages are valid messages, they are just getting delivered multiple times. |
caches, like the seen cache, are just that... caches, not a security feature or sufficient replay protection at scale. Pubsub, the library, supports an elaborate application level validation scheme. This allows blockchains to intervene and reject or ignore messages at the application level using eventual consistency models or message pool protection rules (factoring in origin peers in transaction topics), and this is why this is a problem that does not generally arise there. You absolutely must provide a validator to the library if you want to scale. This is not a problem that can be solved with a bandaid, and the pubsub library cannot do more than it already does... The best we can do is give you an api for writing validators, and we already provide that. Please use it. (edit: fixed some typos) |
Note that your validator nees to go beyond merely syntactic validation. At, the very minimum you need to have a concept of a per peer nonce if you dont have a natural filter for your topic (such a filter could be block height or miner/validator permit for instance). |
Interesting. Our model is a bit different than a standard blockchain but I can definitely work with the team to think about what additional types of validator rules would make sense for our use case. The per peer nonce seems like a reasonable and simple place to start. I'm really surprised that kubo doesn't provide that level of simple validation out of the box though. In any case, I'll open an issue with kubo to see what we can do there. So that seems like a very useful direction to pursue for the answer to my fourth question - "what can we do to protect ourselves in the future?". Using more libp2p level validators, either through kubo configuration or by moving away from kubo and using libp2p directly, seems like a good direction to investigate. Thank you for that information! But I'm still stuck on what we can do in the short term to mitigate the issue we are actively experiencing at the moment. I.e. my first and second questions in my original post. Any suggestions on how we can get back to a stable state would be very much appreciated! |
Yes please do work wirh ipfs people for the validator interface. I am not sure what you can do to stop your current "diameter exceeds timecache boundaries" flood, short of shutting down your topic for sufficient time to quiet down. And this would be only temporary, the storm might come back. Maybe restarting with a very long timecache window (assuming this is not a replay attack and a natural occurence) could help. |
See ipfs/kubo#9665 (comment) for a possible quick solution to the current predicament. It will need to patch ipfs however, and longer term we will need to allow multiple validators per topic in pubsub (the internals are ready for it nonetheless, it is just a matter of user api and we can do this quickly when the need arises(. |
closing in favor of ipfs/kubo#9665 |
Incident report from 3Box Labs (Ceramic) Team
Incident summary
The Ceramic pubsub topic has been experiencing a flood of pubsub messages beyond our usual load for the last several days now. We log every pubsub message we receive on the nodes that we run, and running analysis on those logs using LogInsights shows us that we are receiving messages with the exact same `seqno` multiple times - one message can show up upwards of 15 times in an hour. During normal operation we do not see this issue with seqnos showing up multiple times. This dramatic increase in the number of messages that need processing is causing excess load on our nodes that is causing major performance problems, even with as much caching and de-duplication as we can do at our layer.Evidence of the issue
Graph of our incoming pubsub message activity showing how the number of messages spiked way up a few days ago. The rate before 2/20 was our normal, expected amount of traffic:
AWS LogInsights Query demonstrating how the majority of this increased traffic is due to seeing the same message (with the same seqno) re-delivered multiple times. Before the spike we never saw a
msg_count
greater than 2.Steps to reproduce
Connect to the gossipsub topic `/ceramic/mainnet`. Observe the messages that come in, keep track of the number of times you see a message with each `seqno`. You'll see that over the span of an hour you see the same message with the same `seqno` delivered multiple timesHistorical context
We have seen this happen before, in fact it's happened to us several times over the last year, and we've reported it to PL multiple times. You can see our original report here (at the time we were still using js-ipfs): https://github.com/libp2p/js-libp2p/issues/1043. When this happened again after we had migrated to go-ipfs, we reported it again, this time on slack: https://filecoinproject.slack.com/archives/C025ZN5LNV8/p1661459082059149?thread_ts=1661459082.059149&cid=C025ZN5LNV8We have since discovered a bug in how go-libp2p-pubsub maintained the seenMessage cache and worked to get a fix into kubo 0.18.1: #502
We have updated our nodes to 0.18.1, but of course we have no direct control over what versions of ipfs/kubo the rest of the nodes on the Ceramic network are running, so even if the above bugfix would resolve the issue if every single node on the network were to upgrade to it, we have no real way to enforce that and no idea how long it will be (if ever) before there are no older ipfs nodes participating in our pubsub topic. Not to mention the possibility of a malicious node connecting to our pubsub topic and publishing a large volume of bogus messages (or re-broadcasting valid messages). So no matter what, we need a way to respond to incidents like this that goes beyond "get your users to upgrade to the newest kubo and pray that that makes the problem go away", which has been what we've been told every time we're reported this issue so far.
Our request from Protocol Labs
This is an extremely severe incident that has affected us multiple times over the last year. It strikes without warning and leaves our network crippled. Every previous time this happened it cleared up on its own within a day or so, but this one has been going on for 5 days now without letting up. We need some way to respond to incidents like this, and to potential malicious attacks in the future where someone intentionally floods our network with pubsub traffic.
So our questions for PL are:
Thank you for your time and attention to this important issue!
-Spencer, Ceramic Protocol Engineer
The text was updated successfully, but these errors were encountered: