-
Notifications
You must be signed in to change notification settings - Fork 14
Feasiblity discussion: The community message history problem #420
Comments
Thanks @John-44 for the detailed and clear problem statement and set of requirements! Some of the listed problems can be addressed through the WAKU2-Store protocol which I explain below. Additionally, I have included our current limitations and ongoing research. WAKU2-Store protocol
WAKU2 Fault-Tolerant Store protocol This protocol addresses the issue regarding the offline nodes (both store nodes and querying nodes). Current Limitations
Ongoing Researches
We can also consider utilizing other existing p2p storage systems like IPFS. Please let me know your questions/comments @John-44. |
@John-44 Can you please elaborate on the second requirement i.e., scalability. Especially about this part :
I am not sure about the scalability that is mentioned in this part and the concern about the history size. My understanding is that the solution should give an option to the user to fetch a subset of the history but not necessarily the entire history, is this correct? |
May I add that similarly to the discovery issue, the store node needs to be reachable. If Alice enabled the chat history on her status desktop laptop that access to internet behind NAT then to allow Bob to connect to Alice to retrieve historical messages either:
I believe we have not yet reviewed this in the context of Waku v2 and we may need some work to decide what method can and should be implemented. |
NAT is addressed here https://github.com/status-im/nim-waku/blob/master/waku/v2/node/config.nim#L44-L47 and follows same pattern as Nimbus is doing, which is working for Eth2. |
@staheri14 thanks for your reply! I've got some questions to help me better understand exactly what is possible if we use waku store nodes to solve this problem, and to also hopefully flesh out the requirements of the community message history problem a bit more.
From the above, is it correct to assume that waku store nodes can be configured in such a way so that an individual node only stores the message history for one or multiple specific community(s)? You mention that waku store nodes are supposed to have high availability. However could they also function is a similar manner to a torrent swarm so that if there are say: 20 individual waku store nodes, with intermittent connectivity, that are fetching and serving messages related to a specific community, then:
Can the upload bandwidth used to serve the history of a specific community be throttled? Can the download bandwidth used to fetch the history of a specific community be throttled? Can a max storage size (denominated in units of storage say MB or GB) be set for a specific community? e.g. as a user, I would like to dedicate a maximum of 1GB to storing the history of the StatusPunks community. When a user set storage limit for a specific community is hit, would it be possible to keep the storage used within the limit the user has set with a first in first out (fifo) rule (when new messages arrive, the oldest messages are automatically deleted)? Can a store node which is syncing the history for say 3 different communities, have different upload bandwidth, download bandwidth, and max storage size throttles for each community?
I understand that there is no limit on the number of days for message persistence and no guarantee either. But what I’m asking here is slightly different - would using waku store nodes to serve and fetch historical community messages reduce the maximum aggregate waku messaging bandwidth that is available to Status mobile and desktop users for realtime communication? E.g. Let's say Status mobile and desktop becomes fantastically popular and everybody in the world stops using WhatsApp, Messenger, Telegram, etc… and switches to only using Status for all their messaging communication. What’s the maximum number of real time person to person messages per second that Status Mobile and Desktop (using Waku 2) could support in this scenario? And does this maximum number of realtime person to person messages per second that waku2 can support drop if the waku2 protocol is also being used to fetch and serve historical community message history? Does that question make sense? For clarity I'm not looking for actual numbers here, what I'm interested does using waku store nodes for solving the community message history problem reduce the global max number of messages per second that Status mobile and desktop can use for near realtime p2p messaging?
Ahh, so from this it sounds like it’s possible for waku store nodes to backfill gaps in each other’s history stores :-) The requirement for time synchronization makes me a bit nervous though, especially as historic message data should be immutable and a timing dependency without a global source of time truth that we can rely on makes me worry that relying on timing could lead to small amounts of historic messages being lossed over time (which could add up to the message history of a community degrading more and more with time, and would lead to users feeling that they couldn’t rely on historic messages being available). It might be worth pointing out that the requirements of ‘near-realtime global messaging’ and ‘p2p storage and distribution of the historic message history for specific communities’ are different in several important aspects: Historic message history is a finalized set of data that will never change in the future. The recently developed ‘edit message’ feature could be capped to only allow the editing of messages posted up to say 3 days in the past. After this point a community’s message history should become immutable. Historic message history doesn’t need to be delivered in near realtime Serving and fetching historic message history is far more bandwidth intensive then sending and receiving near realtime messages. The maximum number of members in any given community will always be smaller than the maximum number of people in the world using Status mobile and desktop. Because of these attributes, if it was useful or advantageous to do so, historic message history could perhaps even be treated as chunked binary blobs and distributed over a p2p distributed file sharing protocol! There are several quite radically different approaches we could take to solving this community message history problem, so I think it’s useful to weigh up the pros and cons of each approach before we choose a specific implementation direction.
If the guarantee of the availability of full message history has the same properties as the guarantee of data availability that the torrent protocol can give for a binary blob, then we are good. Status communities are responsible for themselves, we just need to provide the tools that ensure that if there are a sufficient number of community members who wish to store and serve the history of a specific community, then the history of that specific community will be preserved. There is a sliding scale here between the availability of nodes and the number of nodes needed. Some communities might put effort into making sure that there are always 2 or 3 Status Desktop nodes with good bandwidth almost always online to serve message history for their community. Or a large community might just rely on the fact that out of thousands of members, there will be enough members with community history serving turned on at any one time to reliably serve that community’s history (e.g. a far greater number of nodes, each online for a much smaller percentage of the time should equal a small number of nodes that are almost always online with good bandwidth).
This scares me a bit as well, however with any solution that delivers the end goal of the preservation and serving of reliable immutable community message history, and it’s fine if this is achieved by relying on multiple nodes, and also the fetching of community happening progressively over time. E.g. even if the initial message history sync is unreliable, that’s ok as long as the message history self heals itself over time by fetching missing messages when they become available. Note it shouldn't be possible for somebody to insert false messages into the message history, this would be very bad. Treating the message history for each channel in each community as a chunked binary blob and delivering this over an existing p2p distributed file sharing protocol could perhaps sidestep some of these reliability concerns? But would probably have a bunch of different downsides. humm...
Because we are focusing on how to fetch, store, preserve and serve the history of specific communities, perhaps there are other aspects of how members of a community communicate that we could utilize for discovery? The only people who would be motivated to fetch and serve history for a specific community are folks who have an interest in that specific community, which means they are probably members of that community. Conversely, somebody who is not a member of a community is very unlikely to be fetching and serving the history of that community (and doing this won’t be an option in the UI - only a user who is a member of a community will be able to turn on history fetching serving for that community in the UI).
I think this would be very useful, to provide benchmarks by which we can measure the properties of possible solutions against. In addition to IPFS, how about torrent? Torrent is old, but it works, it’s super reliable, it’s fast, and numerous open source torrent implementations including torrent libraries written in many languages exist. As the message history of each channel in a community could be treated as an immutable chunked binary blob, perhaps there is another p2p file distribution protocol that already exists and meets all our requirements? Many thanks for looking into this problem! :-) |
As the message history of a community will include every message and image posted to every channel in the community, the message history for a large and active community (and especially a community where a lot of images are posted) can and will grow to a large size (multiple GBs)! This is why it's important that any solution to this problem should let users set a max limit (in MB or GB) of the amount of storage they are willing to dedicate to storing the history of a specific community. When a user sets this limit, the default behaviour should be to start from the present and work chronologically backwards fetching the community's history until the storage limit the user has set is hit. Once this storage limit is hit, whatever solution we choose needs to be able to stay inside this limit by deleting the oldest messages as new messages arrive. |
It depends on how waku v2 is utilized, but the simple answer is yes. If each community has a specific and distinct content topic or pubsub topic(that is all the messages created within that community carry that content topic or pubsub topic) then messages within each community can be distinguished from other communities, hence store nodes would be able to decide on persisting/not persisting messages for specific communities.
I am answering this question based on the current state of the store protocol (assuming that there is no synchronization between store nodes).
Similar to my prior comment, the answer is yes given that the address of those nodes is publicly known.
Depending on the situation, the answer is different:
The aggregate bandwidth is available, but there is no load balancing technique in place. It all depends on how users query those nodes, as such one node might be overloaded while others are idle.
Yes they can, but they must have initially started as a full store protocol, I think nodes would not be able to change their mounted protocols afterward (or we currently do not support such behavior) cc: @oskarth
Can you please explain more about this "And if so, can Waku store nodes do this without impacting the overall capacity of the waku network?", specifically, what sort of impact you mean?
It can be, but we do not have support for it right now. cc: @oskarth from my point of view it should be feasible, wdyt?
I think it should be feasible, but we do not have support for it right now. cc: @oskarth re feasibility, wdyt?
From my point of view, it should be fairly doable, but we do not have support for it right now. @oskarth wdyt about feasibility?
IMO, it should be feasible, but we do not have support for it right now. @oskarth wdyt about feasibility?
The answer is no (it does not reduce the global max number of messages per second), or at least IMO the impact should not be noticeable. The reason is that the regular nodes, which are likely resource-constaint devices and constitute a large portion of p2p messaging ecosystem, are different from the nodes that run store protocol that serve historical messages. The load of maintaining historical messages is all on the store nodes while the rest of the nodes (that we call light nodes) can perform their regular operations without being affected by the store protocol overhead. However, the full nodes that engage in the store protocol will have a higher load which will consume their resources but they are supposed to have higher computational, storage, and bandwidth capacity to not be overwhelmed by providing storage service. However, even for full nodes, we should be able to control the load based on consumed bandwidth, the number of persisted messages etc. My answer is tentative, a concrete answer requires benchmarking and more investigation.
That is right! using other existing file-sharing protocols is a sane approach and may even save time on the development side.
With the store synchronization protocol (that is our future work) the full history will be reconstructed on all the store nodes given that store nodes collectively have the entire history. The immutability requirement is a somewhat separate feature from the full history availability. The waku store protocol does not account for it, neither in its current state nor in the prospective store synchronization protocol. This will need more thought and may have to be addressed in a separate protocol or cause some changes on the existing proposal of store synchronization protocol. cc: @oskarth
That is indeed a concern, but that is more of an access control problem and should be treated isolated from history persistence and synchronization. At a high level, digital signatures can be used to prevent malicious message injection (but if the anonymity of the users matter then zero-knowledge proofs can be utilized). Let me know if you would like to discuss it further. cc: @oskarth
I agree that the solution can be based on IPFS and Torrent, this would require a clear integration idea that takes all the community requirements into account. Designing a hybrid solution based on the existing tools like Torrent requires more study and investigation though.
|
I see your point! Totally! limiting the storage of each store node would be easy from the configuration point of view, however, that would impact the full history availability. However, depending on whether an existing file-sharing tool is going to be used or just the store protocol, we can think of a solution for that as well. |
Thanks @staheri14 and @John-44 for the discussion! Some partial answers to things mentioned:
It might be possible to do dynamically, but the easiest way to do deal with this is to either 1) start with this capability 2) restart node when changing preference.
Indeed shouldn't be an issue if this is something we need. Also see SWAP protocol for a more economic solution, but this can be complementary.
Technically not an issue, assuming communities are segregated by pubsub and/or content topic (which they should be), but it has to be implemented. Doesn't seem to be high prio to me personally but more something for later iterations?
Conceptually correct, by "Waku network" usually we mean "Relay network". This can be segregated by pubsub topic to reduce load. For historical messages and store nodes this happens in a req/resp manner, so the load is on an individual node. There's some additional things to take into account here, such as what would be happen for store synchronization across multiple Store nodes. This wouldn't impact regular nodes though, and we can bound this by O(C) where C is community size.
I think this can be complementary, where it might make sense to piggyback on IPFS/BitTorrent for static archives as an implementation/optimization detail, and then make sure the interface is clean / we use existing flow for dynamic portion. Or something like this.
Messages on their own are immutable. Depending on how the community protocol works, and if there's a closed group (i.e. not public chat) then there might be some sequence order / merkle tree stuff we can do here. In general, it is worth pointing out that e.g. Bittorrent is meant for static content, and a chat history is usually dynamic. To have a canonical immutable representation requires either a source of truth and/or some consensus mechanism. Notice that this is true regardless of what specific method we end up using (who decides the new IPFS hash? etc). I do think the store synchronization protocol should help with this, but I think it'd be useful to have a closer look at this requirement by looking at what the Community spec looks like.
Agree, my understanding is that community has some access control mechanism so there should be some way to address this. If there's a strong need for some form of validation at store level (as opposed to once you've fetched stuff down) we could consider having that as a plugin. Notice that now store nodes don't have access to the contents of messages, and in order to filtering of this nature you'd need some form of access. |
what I think is important to begin with to unpack the problem into smaller components, and see what can be done with each component individually, with regards to decentralization:
the history problem is usually modelled as a distributed log where a mechanism exists to correlate messages - this can either be a merkle tree, a list of back-pointers to previous messages (or siblings), counters, lamport clocks and so on. Once a graph is established, the contents of messages can also be fetched - in many forms of communication, knowing that a message is missing fulfills quality requirements even if content is not available - the content can maybe be filled in later as long as there's proof that it was indeed part of the stream to begin with. Key however is that every feature has a cost - being decentralised means amplifying that cost, usually - the first step must thus be to precisely define what the absolute minimum requirements are that will keep users happy, and whether there exist loopholes that can be used to emulate otherwise critical functionality - to name a trivial example, latency in text communication is often mentioned as a hard requirement, but if you merely show a progress indicator, most users are happy to wait, because they know "something" is happening, and the latency problem becomes an optimization point rather than an up-front requirement. |
As promised, here is a link to some work in progress designs for the community history service management screens. A few things to note:
With that said, hopefully the designs give an idea of the types of controls and statistics feedback that a user could find useful. |
To provide a bit more detail regarding some of the points discussed above:
|
Just to be clear, neither of these functions are really "implementable" except as a smoke screen that hides the original message and the ui would do well to .. explain this to users - ie once you go into immutable datastores territory, you can't really remove things (an edit is a delete + add) - in general "N days" considerations are not really in scope, it's more of a paint job on top later unless we're building a centralized solution where history access is gated / permissioned in which case one might be able to get as far as constraining new members access to content. |
1000% agree that every feature has a cost and that being decentralised usually means amplifying that cost! With this in mind, here is a very rough sketch of a possible solution that I think could deliver on the requirements stated at the top of this thread and allow the implementation of the designs linked above with the least amount of effort:
I think this approach can work for Status communities, because with Status communities we have an actor who can produce the canonical history for distribution (the community owner), and a bunch of other actors in community members and community admins who are motivated to download this history (so they can access it themselves), store this history (so they can continue to access it themselves) and serve this history (because they have an interest in their community being successful). |
Right, yes fully agreed the current edit and delete functionality is just a 'smoke screen/paint job' as you described it. Still useful, but not a function that can be relied on for privacy. |
hey @arnetheduck , I was thinking something similar along those lines. i.e to implement this as a torrent client of sorts, with the altruistic users functioning as seeders and the users requesting old history functioning as leeches. A specific community log time period (let's say each 24 hours) could be represented through a magnet link (or something similar) that is determined based on the community id and the time period (e.g 2015-01-10). An issue is that the final file of all messages for that period needs to be the same for everyone, and this implies some ordering and determinism. A possible way to do this is to modify the status application-level protocol to include an hash value that points to the previous (known) message, so a client will check what is the last message received in that channel, include the id for that previous message, and an hash of itself with the previous message id. Depending on how much there is concurrent activity this could introduce some 'forks' or orphan messages, to handle this we could use something like Ethereum's Greedy Heaviest Observed Subtree or Inclusive protocol. Thoughts? |
One other consideration for any potential solution to take into account - currently images posted in chats are hosted on IPFS, however we don't want to pin all images that are posted in all communities to IPFS forever, therefore IPFS is a bad long term storage solution for images posted in community chats. So if possible it could be good if the community history service includes the images posted to community channels in the history blocks that will be stored and distributed via a p2p file share protocol. |
This conversation has gotten quite long and touches many different concerns at the same time. @staheri14 did a great breakdown here vacp2p/research#82 You can see all issues here https://github.com/vacp2p/research/milestone/8 I suggest we continue the discussion in the most specific issue possible. For example, talk about how to optionally offload archival data can be talked about here: vacp2p/research#78 etc If some concern is missing, this can be covered in vacp2p/research#82 |
Issue moved here |
What is the problem we are trying to solve?
Users who are used to using a centralized group chat platform like Discord have an expectations that:
Status Communities needs to offer equivalent behaviour to meet user expectations.
Why is it important that we solve this problem?
1) Chat history availability is important for productivity
Using Status’s own internal Discord as an example, it is sometimes very useful to be able to search for and/or scroll up to conversations that may have taken place 6 months or even a year or more ago, for many purposes:
2) Losing access to group chat history could feel like a catastrophic event for a user
Group chat history will frequently become an invaluable resource for the user, which leads to the problem of how can the user retain their stored group chat history when for example the user buys a new computer, or if their computer breaks or is stolen. If a user suddenly discovers that they have lost the chat history of the Status Communities they are members of when they change their computer, saying that they might get angry could be an understatement!
3) Chat history availability is important for user equality
If new users who join a Status community don’t have access to the historical discussions that took place before they joined, this will place new users at an informational disadvantage compared to users who have access to this history. In addition, whenever a user encounters a link to a previous discussion that occurred prior to them joining the community, they would encounter an error while longer standing members of the community will be able to access and read the historical thread.
Not having access to historical discussions will place newcomers to a group chat community at a significant disadvantage.
4) Chat history availability is important for community monetization
Historical group chat discussions can provide value to users, and Status will be providing communities with tools to monetize this user value. To give one example, a hypothetical Beeple Status community could exist and there could be a channel in that community where each day Beeple posts his daily art with a short commentary. New users may well be happy to pay to access this channel, in part because accessing this channel would allow them to scroll back through Beeple’s history of daily art posts. This requires all chat history in a community to be available to all users, irrespective of when a user joins.
Status communities today
In the current incarnation of Status communities, when a user first joins a Status Community a maximum of 30 days history is available, and as long as the user logs into Status once a week the user’s client will log and store the chat history in all channels of that community going forward.
If for example a user goes on holiday and doesn’t launch Status for say 2 weeks, when the user does launch Status, unless they navigate to each individual channel and manually fetch the history for each individual channel in every Status community they are a member of, there will be gaps in their community chat history. After 4 weeks have passed, these gaps become permanent and no mechanism currently exists to fetch the missing data (due to the 30 day mail server message persistence limit).
At the moment a new user who joins a community has no way of accessing any community history that is more than 30 days older than the date they joined. As time progresses, they then lose access to any community history their device has not downloaded.
Currently, if a Status Communities user needs to install Status on a new device, the new device will not contain the community chat history that the user has stored on their previous device.
Summary of problem
Not having the full chat history of a Community available to any member who joins a community at any time undermines a basic trust assumption that users will have as a result of their experience of using centralized group chat platforms. Delivering a solution to fulfill this expectation is of critical importance if Status communities is going to succeed as a product.
Requirements that any solution to this problem must fulfill
This problem needs to be solved in a decentralised way. This means that any prospective solution must not require servers and/or require hosting expenditure. However it’s absolutely fine for any solution to rely on motivated members of a community (who altruistically care about the history of their community being preserved and served) running a number of Status desktop instances that are online roughly 90% of the time.
This problem needs to be solved without putting a load on the Waku network that will reduce the max number of users Waku can scale to (e.g. we don't want to trade Waku network scalability to solve this problem). The history of a community can over time grow large, think multiple GBs and it might take a new community member days to download the full history of an established and active community.
The solution needs to ideally be implementable in 3 to 6 months max, with 1 or 2 full time devs (as the need for this problem to be solved will become urgent soon).
Users need to be able to throttle the resources consumed by their Status desktop nodes to provide provide this service. At a minimum, users should be able to:
A strawman UI is being designed that exposes the maximum number of controls, graphs and status indications that we may wish to include. Don’t worry if some of these controls or graphs are not feasible to implement, once we have decided on an implementation direction we will go through these designs to work out what is practically possible and update the designs.
After a solution to this problem has been implemented, a next step might be to build a tool that enables a community's Discord chat history and channels to be imported into Status communities. This is tracked separately in the following issue: https://github.com/status-im/status-desktop/issues/2849
Community owners should have a toggle that lets them disable message serving for their community. If switched off, members of the community will not be able to switch on message history serving for that community. A community owner may wish to switch off message serving if privacy is very important for their community.
In the future (but not in the initial implementation), we may also wish to provide community admins with a setting to limit the availability of community history for privacy reasons. Such a setting would let a community admin mark a channel as “all messages in the channel auto-delete after X days”. If switched on, all clients would automatically delete all messages in the specified channel after 30 days. This is of course a very imperfect privacy feature as there is nothing to stop users saving their only local copies of channel content via manually screenshotting or cutting and pasting or via a more sophisticated mechanism e.g. a modified Status client.
The text was updated successfully, but these errors were encountered: