Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Feasiblity discussion: The community message history problem #420

Closed
John-44 opened this issue Jul 5, 2021 · 19 comments
Closed

Feasiblity discussion: The community message history problem #420

John-44 opened this issue Jul 5, 2021 · 19 comments

Comments

@John-44
Copy link

John-44 commented Jul 5, 2021

What is the problem we are trying to solve?

Users who are used to using a centralized group chat platform like Discord have an expectations that:

  1. new users joining a community will have access to all historical messages in that community (including messages that predate the user joining)
  2. historical messages will remain available to users after they re-install their group chat app on a new device.

Status Communities needs to offer equivalent behaviour to meet user expectations.

Why is it important that we solve this problem?

1) Chat history availability is important for productivity

Using Status’s own internal Discord as an example, it is sometimes very useful to be able to search for and/or scroll up to conversations that may have taken place 6 months or even a year or more ago, for many purposes:

  • to get up to speed on a debate that happened previously
  • to retrieve a piece of information that a user has only just realized they need but which they know is available somewhere in the past group chat history
  • to see if a claim being made about something that happened in the past actually matches up with what happened in the past
  • To follow a link to a conversation that occurred in the past, that perhaps predates the user joining the group chat.
  • Etc, etc…

2) Losing access to group chat history could feel like a catastrophic event for a user

Group chat history will frequently become an invaluable resource for the user, which leads to the problem of how can the user retain their stored group chat history when for example the user buys a new computer, or if their computer breaks or is stolen. If a user suddenly discovers that they have lost the chat history of the Status Communities they are members of when they change their computer, saying that they might get angry could be an understatement!

3) Chat history availability is important for user equality

If new users who join a Status community don’t have access to the historical discussions that took place before they joined, this will place new users at an informational disadvantage compared to users who have access to this history. In addition, whenever a user encounters a link to a previous discussion that occurred prior to them joining the community, they would encounter an error while longer standing members of the community will be able to access and read the historical thread.

Not having access to historical discussions will place newcomers to a group chat community at a significant disadvantage.

4) Chat history availability is important for community monetization

Historical group chat discussions can provide value to users, and Status will be providing communities with tools to monetize this user value. To give one example, a hypothetical Beeple Status community could exist and there could be a channel in that community where each day Beeple posts his daily art with a short commentary. New users may well be happy to pay to access this channel, in part because accessing this channel would allow them to scroll back through Beeple’s history of daily art posts. This requires all chat history in a community to be available to all users, irrespective of when a user joins.

Status communities today

In the current incarnation of Status communities, when a user first joins a Status Community a maximum of 30 days history is available, and as long as the user logs into Status once a week the user’s client will log and store the chat history in all channels of that community going forward.

If for example a user goes on holiday and doesn’t launch Status for say 2 weeks, when the user does launch Status, unless they navigate to each individual channel and manually fetch the history for each individual channel in every Status community they are a member of, there will be gaps in their community chat history. After 4 weeks have passed, these gaps become permanent and no mechanism currently exists to fetch the missing data (due to the 30 day mail server message persistence limit).

At the moment a new user who joins a community has no way of accessing any community history that is more than 30 days older than the date they joined. As time progresses, they then lose access to any community history their device has not downloaded.

Currently, if a Status Communities user needs to install Status on a new device, the new device will not contain the community chat history that the user has stored on their previous device.

Summary of problem

Not having the full chat history of a Community available to any member who joins a community at any time undermines a basic trust assumption that users will have as a result of their experience of using centralized group chat platforms. Delivering a solution to fulfill this expectation is of critical importance if Status communities is going to succeed as a product.

Requirements that any solution to this problem must fulfill

  1. This problem needs to be solved in a decentralised way. This means that any prospective solution must not require servers and/or require hosting expenditure. However it’s absolutely fine for any solution to rely on motivated members of a community (who altruistically care about the history of their community being preserved and served) running a number of Status desktop instances that are online roughly 90% of the time.

  2. This problem needs to be solved without putting a load on the Waku network that will reduce the max number of users Waku can scale to (e.g. we don't want to trade Waku network scalability to solve this problem). The history of a community can over time grow large, think multiple GBs and it might take a new community member days to download the full history of an established and active community.

  3. The solution needs to ideally be implementable in 3 to 6 months max, with 1 or 2 full time devs (as the need for this problem to be solved will become urgent soon).

  4. Users need to be able to throttle the resources consumed by their Status desktop nodes to provide provide this service. At a minimum, users should be able to:

    • Switch this functionality on for only a subset of the communities they are members
    • Throttle the max upload and download bandwidth used by this service, both per community and globally for all the communities they are members of.
    • Limit the maximum storage used by this service, both per community and globally for for all the communities they are members of.
    • In cases where the user has set a max storage limit that is less than the size of the history for a community, the community history needs to be updated in a FIFO (first in first out) manner e.g. one the max storage limit is reached, the oldest history should be deleted as new history is created in order to stay under the max limit.
    • Monitoring of the CPU and Memory used by this service (both globally and per community) would be desirable (but not essential).
    • Ideally the user should be able to throttle both memory and CPU usage as well, but this is ‘nice to have’, not essential.
  5. A strawman UI is being designed that exposes the maximum number of controls, graphs and status indications that we may wish to include. Don’t worry if some of these controls or graphs are not feasible to implement, once we have decided on an implementation direction we will go through these designs to work out what is practically possible and update the designs.

  6. After a solution to this problem has been implemented, a next step might be to build a tool that enables a community's Discord chat history and channels to be imported into Status communities. This is tracked separately in the following issue: https://github.com/status-im/status-desktop/issues/2849

  7. Community owners should have a toggle that lets them disable message serving for their community. If switched off, members of the community will not be able to switch on message history serving for that community. A community owner may wish to switch off message serving if privacy is very important for their community.

    In the future (but not in the initial implementation), we may also wish to provide community admins with a setting to limit the availability of community history for privacy reasons. Such a setting would let a community admin mark a channel as “all messages in the channel auto-delete after X days”. If switched on, all clients would automatically delete all messages in the specified channel after 30 days. This is of course a very imperfect privacy feature as there is nothing to stop users saving their only local copies of channel content via manually screenshotting or cutting and pasting or via a more sophisticated mechanism e.g. a modified Status client.

@staheri14
Copy link
Contributor

staheri14 commented Jul 7, 2021

Thanks @John-44 for the detailed and clear problem statement and set of requirements!

Some of the listed problems can be addressed through the WAKU2-Store protocol which I explain below. Additionally, I have included our current limitations and ongoing research.

WAKU2-Store protocol

  1. It is a decentralized solution, the aim is to provide message history in a p2p manner. This complies with the first item of the requirement list, i.e.,

This problem needs to be solved in a decentralized way.

  1. In this protocol, nodes can voluntarily act as a storage server for the rest of the network. What they essentially do is that they listen to the network traffic and store messages. To provide reliable storage service, store nodes are supposed to have high availability and online time.

  2. Store nodes decide on the type of messages to persist. Messages are basically associated with two types of topics namely, pubsub topic and content topic. A store node may persist messages on an arbitrary combination of these topics. This property can be utilized to address the following item of the requirement list, i.e.,

Users need to be able to throttle the consumed resources e.g., Switch this functionality on for only a subset of the communities they are members

  1. Other nodes can query store nodes and fetch subset of historical messages. Queries can be based on the conjunction of the time range, content topic, and pubsub topic. I imagine the possibility of subset query should address the second item of the requirement list i.e.,

This problem needs to be solved without putting a load on the Waku network that will reduce the max number of users Waku can scale to (e.g. we don't want to trade Waku network scalability to solve this problem). The history of a community can over time grow large, think multiple GBs and it might take a new community member days to download the full history of an established and active community.

  1. There is no limit on the number of days for message persistence, and no guarantee either.

WAKU2 Fault-Tolerant Store protocol

This protocol addresses the issue regarding the offline nodes (both store nodes and querying nodes).
That is, this protocol is designed to allow nodes to go offline and yet be able to fill the gaps in their message history given that they know at least one other store node that has been online for the same period. The querying node (who has been off-line for let's say 2 weeks) needs to know one store node in the system which has been online for that 2-week duration. Then, it can fetch the historical messages from that store node using a time-based query. In this solution, we assume the store node and the querying node have synchronized clocks +-20 seconds.

Current Limitations

  1. No guarantee on the availability of full message history. The availability of the history entirely relies on the volunteer store nodes to stay online 24/7 and persist all the messages.
  2. The history provided by the individual store nodes is not necessarily reliable and does not necessarily reflect the full history. This is because some of the messages may never reach a store node due to network or topology issues. This means the querying node cannot rely on the message history provided by a single store node and may have to fetch the history from multiple store nodes.
  3. No discovery mechanism, that is, the store nodes are not discoverable by the rest of the network. Currently, the querying nodes have to know the store nodes out of band and configure them statically in the protocol.
  4. No support for items 4-7 of the requirements list.
    • Item 7 requires more research and investigation especially given the solution should be decentralized.
    • Item 4 seems to be easy to add to the existing store protocol.

Ongoing Researches

  1. We are working on a store synchronization protocol to allow store nodes to sync their message history frequently. As the result, all the store nodes will have a consistent view of the message history and essentially they will all have replicated message states. This is an attempt towards items 1 and 2 of the Current Limitations section.
    • As such, each store node can reliably provide message history to the network.
    • This would also enable the construction of full message history out of a set of partially available store nodes with partial message history. In specific, if the aggregate of store nodes (not individual nodes) are online 24/7, then using the synchronization protocol we can guarantee the construction of the full message history at all the store nodes.
    • Note that while this solution enables store nodes to eventually obtain the full message history, it does not guarantee the availability of the full history at any time.
  2. We are also working on a distributed discovery method so that store nodes would be able to find each other in a decentralized manner. This addresses item 3 of the Current Limitations.

We can also consider utilizing other existing p2p storage systems like IPFS.

Please let me know your questions/comments @John-44.
cc: @oskarth

@staheri14
Copy link
Contributor

@John-44 Can you please elaborate on the second requirement i.e., scalability. Especially about this part :

The history of a community can over time grow large, think multiple GBs and it might take a new community member days to download the full history of an established and active community.

I am not sure about the scalability that is mentioned in this part and the concern about the history size. My understanding is that the solution should give an option to the user to fetch a subset of the history but not necessarily the entire history, is this correct?

@D4nte
Copy link
Contributor

D4nte commented Jul 9, 2021

May I add that similarly to the discovery issue, the store node needs to be reachable.

If Alice enabled the chat history on her status desktop laptop that access to internet behind NAT then to allow Bob to connect to Alice to retrieve historical messages either:

I believe we have not yet reviewed this in the context of Waku v2 and we may need some work to decide what method can and should be implemented.

@oskarth
Copy link
Contributor

oskarth commented Jul 9, 2021

NAT is addressed here https://github.com/status-im/nim-waku/blob/master/waku/v2/node/config.nim#L44-L47 and follows same pattern as Nimbus is doing, which is working for Eth2.

@John-44
Copy link
Author

John-44 commented Jul 9, 2021

@staheri14 thanks for your reply! I've got some questions to help me better understand exactly what is possible if we use waku store nodes to solve this problem, and to also hopefully flesh out the requirements of the community message history problem a bit more.

  1. In this protocol, nodes can voluntarily act as a storage server for the rest of the network. What they essentially do is that they listen to the network traffic and store messages. To provide reliable storage service, store nodes are supposed to have high availability and online time.
  1. Store nodes decide on the type of messages to persist. Messages are basically associated with two types of topics namely, pubsub topic and content topic. A store node may persist messages on an arbitrary combination of these topics. This property can be utilized to address the following item of the requirement list, i.e.,

From the above, is it correct to assume that waku store nodes can be configured in such a way so that an individual node only stores the message history for one or multiple specific community(s)?

You mention that waku store nodes are supposed to have high availability. However could they also function is a similar manner to a torrent swarm so that if there are say: 20 individual waku store nodes, with intermittent connectivity, that are fetching and serving messages related to a specific community, then:

  • if only a single store node that is storing 100% of the history for a specific community is online, then would that community's history will be available to be served to all members of that community, with the limiting factor being the available upload bandwidth of the single store node that is online in this scenario?

  • If there are say 4 store nodes online, and each of these 4 nodes store a non-overlapping 25% of the history for a specific community (contrived scenario for the purpose of discussion), would this mean that 100% of the history for the specific community would be available to be served to all members of that community?

  • In the example above with 4 store nodes that start in a state where they are each storing a non-overlapping 25% of the history of the same specific community, will these 4 store nodes start fetching the portions of the community history they are missing from each other as soon as they go online, so that over time each of the 4 nodes in this example would end up storing 100% of the community history?

  • In the example above, once the 4 store notes have completed downloading the missing history for a specific community from each other, does this mean that the aggregate upload bandwidth of all 4 nodes would be available to serve any new incoming requests for any specific moment in that community’s history?

  • As store nodes download portions of the history for a specific community from other store nodes, could they start serving this newly downloaded history to other store nodes immediately?

  • Whenever a store node comes online, can it look for gaps in the history it is storing for a specific community, and attempt to backfill these gaps if this information is available to be downloaded from other store nodes?

  • If waku store nodes can (or could be developed so that they could) do the above, would it be correct to say that waku store nodes could serve the history of a specific community in a reliable way, even if the connectivity of all the individual store nodes (that are serving the history for a specific community) is very much intermittent, as long as the nodes that are online at any one time collectively store 100% of the community's history, and each of them has sufficient bandwidth? To put it another way, I'm asking can waku store nodes be made as reliable as a torrent swarm would be with nodes going on and off line all the time? And if so, can Waku store nodes do this without impacting the overall capacity of the waku network?

  1. Other nodes can query store nodes and fetch subset of historical messages. Queries can be based on the conjunction of the time range, content topic, and pubsub topic. I imagine the possibility of subset query should address the second item of the requirement list i.e.,

Can the upload bandwidth used to serve the history of a specific community be throttled?

Can the download bandwidth used to fetch the history of a specific community be throttled?

Can a max storage size (denominated in units of storage say MB or GB) be set for a specific community? e.g. as a user, I would like to dedicate a maximum of 1GB to storing the history of the StatusPunks community. When a user set storage limit for a specific community is hit, would it be possible to keep the storage used within the limit the user has set with a first in first out (fifo) rule (when new messages arrive, the oldest messages are automatically deleted)?

Can a store node which is syncing the history for say 3 different communities, have different upload bandwidth, download bandwidth, and max storage size throttles for each community?

This problem needs to be solved without putting a load on the Waku network that will reduce the max number of users Waku can scale to (e.g. we don't want to trade Waku network scalability to solve this problem). The history of a community can over time grow large, think multiple GBs and it might take a new community member days to download the full history of an established and active community.

  1. There is no limit on the number of days for message persistence, and no guarantee either.

I understand that there is no limit on the number of days for message persistence and no guarantee either. But what I’m asking here is slightly different - would using waku store nodes to serve and fetch historical community messages reduce the maximum aggregate waku messaging bandwidth that is available to Status mobile and desktop users for realtime communication? E.g. Let's say Status mobile and desktop becomes fantastically popular and everybody in the world stops using WhatsApp, Messenger, Telegram, etc… and switches to only using Status for all their messaging communication. What’s the maximum number of real time person to person messages per second that Status Mobile and Desktop (using Waku 2) could support in this scenario? And does this maximum number of realtime person to person messages per second that waku2 can support drop if the waku2 protocol is also being used to fetch and serve historical community message history? Does that question make sense? For clarity I'm not looking for actual numbers here, what I'm interested does using waku store nodes for solving the community message history problem reduce the global max number of messages per second that Status mobile and desktop can use for near realtime p2p messaging?

WAKU2 Fault-Tolerant Store protocol

This protocol addresses the issue regarding the offline nodes (both store nodes and querying nodes).
That is, this protocol is designed to allow nodes to go offline and yet be able to fill the gaps in their message history given that they know at least one other store node that has been online for the same period. The querying node (who has been off-line for let's say 2 weeks) needs to know one store node in the system which has been online for that 2-week duration. Then, it can fetch the historical messages from that store node using a time-based query. In this solution, we assume the store node and the querying node have synchronized clocks +-20 seconds.

Ahh, so from this it sounds like it’s possible for waku store nodes to backfill gaps in each other’s history stores :-) The requirement for time synchronization makes me a bit nervous though, especially as historic message data should be immutable and a timing dependency without a global source of time truth that we can rely on makes me worry that relying on timing could lead to small amounts of historic messages being lossed over time (which could add up to the message history of a community degrading more and more with time, and would lead to users feeling that they couldn’t rely on historic messages being available).

It might be worth pointing out that the requirements of ‘near-realtime global messaging’ and ‘p2p storage and distribution of the historic message history for specific communities’ are different in several important aspects:

Historic message history is a finalized set of data that will never change in the future. The recently developed ‘edit message’ feature could be capped to only allow the editing of messages posted up to say 3 days in the past. After this point a community’s message history should become immutable.

Historic message history doesn’t need to be delivered in near realtime

Serving and fetching historic message history is far more bandwidth intensive then sending and receiving near realtime messages.

The maximum number of members in any given community will always be smaller than the maximum number of people in the world using Status mobile and desktop.

Because of these attributes, if it was useful or advantageous to do so, historic message history could perhaps even be treated as chunked binary blobs and distributed over a p2p distributed file sharing protocol! There are several quite radically different approaches we could take to solving this community message history problem, so I think it’s useful to weigh up the pros and cons of each approach before we choose a specific implementation direction.

  1. No guarantee on the availability of full message history. The availability of the history entirely relies on the volunteer store nodes to stay online 24/7 and persist all the messages.

If the guarantee of the availability of full message history has the same properties as the guarantee of data availability that the torrent protocol can give for a binary blob, then we are good. Status communities are responsible for themselves, we just need to provide the tools that ensure that if there are a sufficient number of community members who wish to store and serve the history of a specific community, then the history of that specific community will be preserved. There is a sliding scale here between the availability of nodes and the number of nodes needed. Some communities might put effort into making sure that there are always 2 or 3 Status Desktop nodes with good bandwidth almost always online to serve message history for their community. Or a large community might just rely on the fact that out of thousands of members, there will be enough members with community history serving turned on at any one time to reliably serve that community’s history (e.g. a far greater number of nodes, each online for a much smaller percentage of the time should equal a small number of nodes that are almost always online with good bandwidth).

2. The history provided by the individual store nodes is not necessarily reliable and does not necessarily reflect the full history. This is because some of the messages may never reach a store node due to network or topology issues. This means the querying node cannot rely on the message history provided by a single store node and may have to fetch the history from multiple store nodes.

This scares me a bit as well, however with any solution that delivers the end goal of the preservation and serving of reliable immutable community message history, and it’s fine if this is achieved by relying on multiple nodes, and also the fetching of community happening progressively over time. E.g. even if the initial message history sync is unreliable, that’s ok as long as the message history self heals itself over time by fetching missing messages when they become available.

Note it shouldn't be possible for somebody to insert false messages into the message history, this would be very bad.

Treating the message history for each channel in each community as a chunked binary blob and delivering this over an existing p2p distributed file sharing protocol could perhaps sidestep some of these reliability concerns? But would probably have a bunch of different downsides. humm...

3. No discovery mechanism, that is, the store nodes are not discoverable by the rest of the network. Currently, the querying nodes have to know the store nodes out of band and configure them statically in the protocol.

Because we are focusing on how to fetch, store, preserve and serve the history of specific communities, perhaps there are other aspects of how members of a community communicate that we could utilize for discovery? The only people who would be motivated to fetch and serve history for a specific community are folks who have an interest in that specific community, which means they are probably members of that community. Conversely, somebody who is not a member of a community is very unlikely to be fetching and serving the history of that community (and doing this won’t be an option in the UI - only a user who is a member of a community will be able to turn on history fetching serving for that community in the UI).

We can also consider utilizing other existing p2p storage systems like IPFS.

I think this would be very useful, to provide benchmarks by which we can measure the properties of possible solutions against. In addition to IPFS, how about torrent? Torrent is old, but it works, it’s super reliable, it’s fast, and numerous open source torrent implementations including torrent libraries written in many languages exist. As the message history of each channel in a community could be treated as an immutable chunked binary blob, perhaps there is another p2p file distribution protocol that already exists and meets all our requirements?

Many thanks for looking into this problem! :-)

@John-44
Copy link
Author

John-44 commented Jul 9, 2021

@John-44 Can you please elaborate on the second requirement i.e., scalability. Especially about this part :

The history of a community can over time grow large, think multiple GBs and it might take a new community member days to download the full history of an established and active community.

I am not sure about the scalability that is mentioned in this part and the concern about the history size. My understanding is that the solution should give an option to the user to fetch a subset of the history but not necessarily the entire history, is this correct?

As the message history of a community will include every message and image posted to every channel in the community, the message history for a large and active community (and especially a community where a lot of images are posted) can and will grow to a large size (multiple GBs)!

This is why it's important that any solution to this problem should let users set a max limit (in MB or GB) of the amount of storage they are willing to dedicate to storing the history of a specific community. When a user sets this limit, the default behaviour should be to start from the present and work chronologically backwards fetching the community's history until the storage limit the user has set is hit. Once this storage limit is hit, whatever solution we choose needs to be able to stay inside this limit by deleting the oldest messages as new messages arrive.

@staheri14
Copy link
Contributor

@staheri14 thanks for your reply! I've got some questions to help me better understand exactly what is possible if we use waku store nodes to solve this problem, and to also hopefully flesh out the requirements of the community message history problem a bit more.

  1. In this protocol, nodes can voluntarily act as a storage server for the rest of the network. What they essentially do is that they listen to the network traffic and store messages. To provide reliable storage service, store nodes are supposed to have high availability and online time.
  1. Store nodes decide on the type of messages to persist. Messages are basically associated with two types of topics namely, pubsub topic and content topic. A store node may persist messages on an arbitrary combination of these topics. This property can be utilized to address the following item of the requirement list, i.e.,

From the above, is it correct to assume that waku store nodes can be configured in such a way so that an individual node only stores the message history for one or multiple specific community(s)?

It depends on how waku v2 is utilized, but the simple answer is yes. If each community has a specific and distinct content topic or pubsub topic(that is all the messages created within that community carry that content topic or pubsub topic) then messages within each community can be distinguished from other communities, hence store nodes would be able to decide on persisting/not persisting messages for specific communities.

You mention that waku store nodes are supposed to have high availability. However could they also function is a similar manner to a torrent swarm so that if there are say: 20 individual waku store nodes, with intermittent connectivity, that are fetching and serving messages related to a specific community, then:

  • if only a single store node that is storing 100% of the history for a specific community is online, then would that community's history will be available to be served to all members of that community, with the limiting factor being the available upload bandwidth of the single store node that is online in this scenario?

I am answering this question based on the current state of the store protocol (assuming that there is no synchronization between store nodes).
In this case, the node with 100% availability should be publicly known to the entire system. For example, currently we have a set of static nodes with 100% availability, and their address is publicly available. However, this is a centralized solution that we are moving away from by designing and developing the store synchronization protocol.

  • If there are say 4 store nodes online, and each of these 4 nodes store a non-overlapping 25% of the history for a specific community (contrived scenario for the purpose of discussion), would this mean that 100% of the history for the specific community would be available to be served to all members of that community?

Similar to my prior comment, the answer is yes given that the address of those nodes is publicly known.
Alternatively, a capability discovery method can be utilized to find nodes based on their availability. But we do not have a discovery method implemented yet.

  • In the example above with 4 store nodes that start in a state where they are each storing a non-overlapping 25% of the history of the same specific community, will these 4 store nodes start fetching the portions of the community history they are missing from each other as soon as they go online, so that over time each of the 4 nodes in this example would end up storing 100% of the community history?

Depending on the situation, the answer is different:

  • If those nodes are connected to each other (they know each other) then they can fetch the history using FT-store protocol. So the answer is yes.
  • If those nodes are not connected (they do not know each other) then the answer is no.
  • The answer would be yes when we finish the development of the store synchronization protocol.
  • In the example above, once the 4 store notes have completed downloading the missing history for a specific community from each other, does this mean that the aggregate upload bandwidth of all 4 nodes would be available to serve any new incoming requests for any specific moment in that community’s history?

The aggregate bandwidth is available, but there is no load balancing technique in place. It all depends on how users query those nodes, as such one node might be overloaded while others are idle.

  • As store nodes download portions of the history for a specific community from other store nodes, could they start serving this newly downloaded history to other store nodes immediately?

Yes they can, but they must have initially started as a full store protocol, I think nodes would not be able to change their mounted protocols afterward (or we currently do not support such behavior) cc: @oskarth

  • Whenever a store node comes online, can it look for gaps in the history it is storing for a specific community, and attempt to backfill these gaps if this information is available to be downloaded from other store nodes?
  • Yes, It can do so by knowing another node that has been available for that time.
  • If waku store nodes can (or could be developed so that they could) do the above, would it be correct to say that waku store nodes could serve the history of a specific community in a reliable way, even if the connectivity of all the individual store nodes (that are serving the history for a specific community) is very much intermittent, as long as the nodes that are online at any one time collectively store 100% of the community's history, and each of them has sufficient bandwidth? To put it another way, I'm asking can waku store nodes be made as reliable as a torrent swarm would be with nodes going on and off line all the time? And if so, can Waku store nodes do this without impacting the overall capacity of the waku network?
  • right now the answer to this question is no under the current state of the store protocol. But, with the store synchronization protocol, the answer will be yes.

Can you please explain more about this "And if so, can Waku store nodes do this without impacting the overall capacity of the waku network?", specifically, what sort of impact you mean?

  1. Other nodes can query store nodes and fetch subset of historical messages. Queries can be based on the conjunction of the time range, content topic, and pubsub topic. I imagine the possibility of subset query should address the second item of the requirement list i.e.,

Can the upload bandwidth used to serve the history of a specific community be throttled?

It can be, but we do not have support for it right now. cc: @oskarth from my point of view it should be feasible, wdyt?

Can the download bandwidth used to fetch the history of a specific community be throttled?

I think it should be feasible, but we do not have support for it right now. cc: @oskarth re feasibility, wdyt?

Can a max storage size (denominated in units of storage say MB or GB) be set for a specific community? e.g. as a user, I would like to dedicate a maximum of 1GB to storing the history of the StatusPunks community. When a user set storage limit for a specific community is hit, would it be possible to keep the storage used within the limit the user has set with a first in first out (fifo) rule (when new messages arrive, the oldest messages are automatically deleted)?

From my point of view, it should be fairly doable, but we do not have support for it right now. @oskarth wdyt about feasibility?

Can a store node which is syncing the history for say 3 different communities, have different upload bandwidth, download bandwidth, and max storage size throttles for each community?

IMO, it should be feasible, but we do not have support for it right now. @oskarth wdyt about feasibility?

  1. There is no limit on the number of days for message persistence, and no guarantee either.
    I understand that there is no limit on the number of days for message persistence and no guarantee either. But what I’m asking here is slightly different - would using waku store nodes to serve and fetch historical community messages reduce the maximum aggregate waku messaging bandwidth that is available to Status mobile and desktop users for realtime communication? E.g. Let's say Status mobile and desktop becomes fantastically popular and everybody in the world stops using WhatsApp, Messenger, Telegram, etc… and switches to only using Status for all their messaging communication. What’s the maximum number of real time person to person messages per second that Status Mobile and Desktop (using Waku 2) could support in this scenario? And does this maximum number of realtime person to person messages per second that waku2 can support drop if the waku2 protocol is also being used to fetch and serve historical community message history? Does that question make sense? For clarity I'm not looking for actual numbers here, what I'm interested does using waku store nodes for solving the community message history problem reduce the global max number of messages per second that Status mobile and desktop can use for near realtime p2p messaging?

The answer is no (it does not reduce the global max number of messages per second), or at least IMO the impact should not be noticeable. The reason is that the regular nodes, which are likely resource-constaint devices and constitute a large portion of p2p messaging ecosystem, are different from the nodes that run store protocol that serve historical messages. The load of maintaining historical messages is all on the store nodes while the rest of the nodes (that we call light nodes) can perform their regular operations without being affected by the store protocol overhead.

However, the full nodes that engage in the store protocol will have a higher load which will consume their resources but they are supposed to have higher computational, storage, and bandwidth capacity to not be overwhelmed by providing storage service. However, even for full nodes, we should be able to control the load based on consumed bandwidth, the number of persisted messages etc.

My answer is tentative, a concrete answer requires benchmarking and more investigation.
cc: @oskarth

WAKU2 Fault-Tolerant Store protocol
This protocol addresses the issue regarding the offline nodes (both store nodes and querying nodes).
That is, this protocol is designed to allow nodes to go offline and yet be able to fill the gaps in their message history given that they know at least one other store node that has been online for the same period. The querying node (who has been off-line for let's say 2 weeks) needs to know one store node in the system which has been online for that 2-week duration. Then, it can fetch the historical messages from that store node using a time-based query. In this solution, we assume the store node and the querying node have synchronized clocks +-20 seconds.

Ahh, so from this it sounds like it’s possible for waku store nodes to backfill gaps in each other’s history stores :-) The requirement for time synchronization makes me a bit nervous though, especially as historic message data should be immutable and a timing dependency without a global source of time truth that we can rely on makes me worry that relying on timing could lead to small amounts of historic messages being lossed over time (which could add up to the message history of a community degrading more and more with time, and would lead to users feeling that they couldn’t rely on historic messages being available).

It might be worth pointing out that the requirements of ‘near-realtime global messaging’ and ‘p2p storage and distribution of the historic message history for specific communities’ are different in several important aspects:

Historic message history is a finalized set of data that will never change in the future. The recently developed ‘edit message’ feature could be capped to only allow the editing of messages posted up to say 3 days in the past. After this point a community’s message history should become immutable.

Historic message history doesn’t need to be delivered in near realtime

Serving and fetching historic message history is far more bandwidth intensive then sending and receiving near realtime messages.

The maximum number of members in any given community will always be smaller than the maximum number of people in the world using Status mobile and desktop.

Because of these attributes, if it was useful or advantageous to do so, historic message history could perhaps even be treated as chunked binary blobs and distributed over a p2p distributed file sharing protocol! There are several quite radically different approaches we could take to solving this community message history problem, so I think it’s useful to weigh up the pros and cons of each approach before we choose a specific implementation direction.

That is right! using other existing file-sharing protocols is a sane approach and may even save time on the development side.

  1. No guarantee on the availability of full message history. The availability of the history entirely relies on the volunteer store nodes to stay online 24/7 and persist all the messages.

If the guarantee of the availability of full message history has the same properties as the guarantee of data availability that the torrent protocol can give for a binary blob, then we are good. Status communities are responsible for themselves, we just need to provide the tools that ensure that if there are a sufficient number of community members who wish to store and serve the history of a specific community, then the history of that specific community will be preserved. There is a sliding scale here between the availability of nodes and the number of nodes needed. Some communities might put effort into making sure that there are always 2 or 3 Status Desktop nodes with good bandwidth almost always online to serve message history for their community. Or a large community might just rely on the fact that out of thousands of members, there will be enough members with community history serving turned on at any one time to reliably serve that community’s history (e.g. a far greater number of nodes, each online for a much smaller percentage of the time should equal a small number of nodes that are almost always online with good bandwidth).

  1. The history provided by the individual store nodes is not necessarily reliable and does not necessarily reflect the full history. This is because some of the messages may never reach a store node due to network or topology issues. This means the querying node cannot rely on the message history provided by a single store node and may have to fetch the history from multiple store nodes.

This scares me a bit as well, however with any solution that delivers the end goal of the preservation and serving of reliable immutable community message history, and it’s fine if this is achieved by relying on multiple nodes, and also the fetching of community happening progressively over time. E.g. even if the initial message history sync is unreliable, that’s ok as long as the message history self heals itself over time by fetching missing messages when they become available.

With the store synchronization protocol (that is our future work) the full history will be reconstructed on all the store nodes given that store nodes collectively have the entire history.

The immutability requirement is a somewhat separate feature from the full history availability. The waku store protocol does not account for it, neither in its current state nor in the prospective store synchronization protocol. This will need more thought and may have to be addressed in a separate protocol or cause some changes on the existing proposal of store synchronization protocol. cc: @oskarth

Note it shouldn't be possible for somebody to insert false messages into the message history, this would be very bad.

That is indeed a concern, but that is more of an access control problem and should be treated isolated from history persistence and synchronization. At a high level, digital signatures can be used to prevent malicious message injection (but if the anonymity of the users matter then zero-knowledge proofs can be utilized). Let me know if you would like to discuss it further. cc: @oskarth

Treating the message history for each channel in each community as a chunked binary blob and delivering this over an existing p2p distributed file sharing protocol could perhaps sidestep some of these reliability concerns? But would probably have a bunch of different downsides. humm...

  1. No discovery mechanism, that is, the store nodes are not discoverable by the rest of the network. Currently, the querying nodes have to know the store nodes out of band and configure them statically in the protocol.

Because we are focusing on how to fetch, store, preserve and serve the history of specific communities, perhaps there are other aspects of how members of a community communicate that we could utilize for discovery? The only people who would be motivated to fetch and serve history for a specific community are folks who have an interest in that specific community, which means they are probably members of that community. Conversely, somebody who is not a member of a community is very unlikely to be fetching and serving the history of that community (and doing this won’t be an option in the UI - only a user who is a member of a community will be able to turn on history fetching serving for that community in the UI).

We can also consider utilizing other existing p2p storage systems like IPFS.

I think this would be very useful, to provide benchmarks by which we can measure the properties of possible solutions against. In addition to IPFS, how about torrent? Torrent is old, but it works, it’s super reliable, it’s fast, and numerous open source torrent implementations including torrent libraries written in many languages exist. As the message history of each channel in a community could be treated as an immutable chunked binary blob, perhaps there is another p2p file distribution protocol that already exists and meets all our requirements?

I agree that the solution can be based on IPFS and Torrent, this would require a clear integration idea that takes all the community requirements into account. Designing a hybrid solution based on the existing tools like Torrent requires more study and investigation though.

Many thanks for looking into this problem! :-)
You are very welcome! :)
Hope that I could address your questions!
Let me know if there are further concerns.

@staheri14
Copy link
Contributor

staheri14 commented Jul 13, 2021

As the message history of a community will include every message and image posted to every channel in the community, the message history for a large and active community (and especially a community where a lot of images are posted) can and will grow to a large size (multiple GBs)!

This is why any solution to this problem must let users set a max limit (in MB or GB) of the amount of storage they are willing to dedicate to storing the history of a specific community. When a user sets this limit, the default behaviour should be to start from the present and work chronologically backwards fetching the community's history until the storage limit the user has set is hit. Once this storage limit is hit, whatever solution we choose needs to be able to stay inside this limit by deleting the oldest messages as new messages arrive.

I see your point! Totally! limiting the storage of each store node would be easy from the configuration point of view, however, that would impact the full history availability. However, depending on whether an existing file-sharing tool is going to be used or just the store protocol, we can think of a solution for that as well.

@oskarth oskarth changed the title The community message history problem Feasiblity discussion: The community message history problem Jul 14, 2021
@oskarth
Copy link
Contributor

oskarth commented Jul 16, 2021

Thanks @staheri14 and @John-44 for the discussion! Some partial answers to things mentioned:

Yes they can, but they must have initially started as a full store protocol, I think nodes would not be able to change their mounted protocols afterward (or we currently do not support such behavior) cc: @oskarth

It might be possible to do dynamically, but the easiest way to do deal with this is to either 1) start with this capability 2) restart node when changing preference.

Upload/download throttling
It can be, but we do not have support for it right now. cc: @oskarth from my point of view it should be feasible, wdyt?

Indeed shouldn't be an issue if this is something we need. Also see SWAP protocol for a more economic solution, but this can be complementary.

Can a store node which is syncing the history for say 3 different communities, have different upload bandwidth, download bandwidth, and max storage size throttles for each community?
IMO, it should be feasible, but we do not have support for it right now. @oskarth wdyt about feasibility?

Technically not an issue, assuming communities are segregated by pubsub and/or content topic (which they should be), but it has to be implemented. Doesn't seem to be high prio to me personally but more something for later iterations?

Load
The answer is no (it does not reduce the global max number of messages per second), or at least IMO the impact should not be noticeable. [...] My answer is tentative, a concrete answer requires benchmarking and more investigation.

Conceptually correct, by "Waku network" usually we mean "Relay network". This can be segregated by pubsub topic to reduce load. For historical messages and store nodes this happens in a req/resp manner, so the load is on an individual node.

There's some additional things to take into account here, such as what would be happen for store synchronization across multiple Store nodes. This wouldn't impact regular nodes though, and we can bound this by O(C) where C is community size.

That is right! using other existing file-sharing protocols is a sane approach and may even save time on the development side.

I think this can be complementary, where it might make sense to piggyback on IPFS/BitTorrent for static archives as an implementation/optimization detail, and then make sure the interface is clean / we use existing flow for dynamic portion. Or something like this.

Immutability, etc
This will need more thought and may have to be addressed in a separate protocol or cause some changes on the existing proposal of store synchronization protocol.

Messages on their own are immutable. Depending on how the community protocol works, and if there's a closed group (i.e. not public chat) then there might be some sequence order / merkle tree stuff we can do here.

In general, it is worth pointing out that e.g. Bittorrent is meant for static content, and a chat history is usually dynamic. To have a canonical immutable representation requires either a source of truth and/or some consensus mechanism. Notice that this is true regardless of what specific method we end up using (who decides the new IPFS hash? etc). I do think the store synchronization protocol should help with this, but I think it'd be useful to have a closer look at this requirement by looking at what the Community spec looks like.

That is indeed a concern, but that is more of an access control problem and should be treated isolated from history persistence and synchronization.

Agree, my understanding is that community has some access control mechanism so there should be some way to address this. If there's a strong need for some form of validation at store level (as opposed to once you've fetched stuff down) we could consider having that as a plugin. Notice that now store nodes don't have access to the contents of messages, and in order to filtering of this nature you'd need some form of access.

@arnetheduck
Copy link

Status Communities needs to offer equivalent behaviour to meet user expectations.

what I think is important to begin with to unpack the problem into smaller components, and see what can be done with each component individually, with regards to decentralization:

  • a torrent, or any DHT really, is able to store immutable content durably, if there are altruistic nodes around, as others have pointed out
  • a history needs a way to establish (at least partial) ordering of messages
    • a weak history function allows traversing a history up to the first missing item
    • a stronger history function allows reconstructing the message graph even in the face of partial information (ie a simple counter will tell you more or less how many messages you missed, but has other problems)
  • it's important to recognize that history functions constitute metadata that can be used to classify communication

the history problem is usually modelled as a distributed log where a mechanism exists to correlate messages - this can either be a merkle tree, a list of back-pointers to previous messages (or siblings), counters, lamport clocks and so on.

Once a graph is established, the contents of messages can also be fetched - in many forms of communication, knowing that a message is missing fulfills quality requirements even if content is not available - the content can maybe be filled in later as long as there's proof that it was indeed part of the stream to begin with.

Key however is that every feature has a cost - being decentralised means amplifying that cost, usually - the first step must thus be to precisely define what the absolute minimum requirements are that will keep users happy, and whether there exist loopholes that can be used to emulate otherwise critical functionality - to name a trivial example, latency in text communication is often mentioned as a hard requirement, but if you merely show a progress indicator, most users are happy to wait, because they know "something" is happening, and the latency problem becomes an optimization point rather than an up-front requirement.

@John-44
Copy link
Author

John-44 commented Jul 16, 2021

As promised, here is a link to some work in progress designs for the community history service management screens. A few things to note:

  • We will probably get rid of the memory cap control because most likely it will not be needed. So ignore this in the designs.
  • These designs aren't final, they have some design pattern issues that need to be looked at next week plus we will want to revisit the designs once we've decided on an implementation direction.

With that said, hopefully the designs give an idea of the types of controls and statistics feedback that a user could find useful.

@John-44
Copy link
Author

John-44 commented Jul 16, 2021

In general, it is worth pointing out that e.g. Bittorrent is meant for static content, and a chat history is usually dynamic. To have a canonical immutable representation requires either a source of truth and/or some consensus mechanism. Notice that this is true regardless of what specific method we end up using (who decides the new IPFS hash? etc). I do think the store synchronization protocol should help with this, but I think it'd be useful to have a closer look at this requirement by looking at what the Community spec looks like.

To provide a bit more detail regarding some of the points discussed above:

  • The only ways in which the message history can be mutated is via the 'edit message' and 'delete message' functions. The majority of message editing and deleting takes place minutes after a message has been posted, with almost all edits taking place within 3 days of a message being posted (true on any group messaging platform). So we were thinking of only allowing users to delete community messages that are less than 6 days old (note this doesn't apply to 1on1 and ad-hoc group chat messages). So all community message history that is 7 or more days old can be considered as immutable static content.

  • You quite rightly point out that "To have a canonical immutable representation requires either a source of truth and/or some consensus mechanism." Luckily in the case of the Communities functionality, I think we have a simple straightforward option for producing canonical immutable representations of the history of any community. I think this method should be absolutely fine for MVP launch, and then we can always improve upon this simple behaviour in the future. With Communities, there is already a requirement for the owner of each community to keep a Status client online so it can perform various management tasks. So for the MVP of this functionality, the 'community owner's node' could also be responsible for packaging history that is 7 or more days old into binary blocks to be served by a p2p file sharing mechanism. Only binary blocks signed as produced by the 'community owner's node' would be trusted and distributed. If the "community owner's node" gets taken offline for any reason, the community owner can start another node somewhere else. If the owner's node goes offline and the community owner never starts a replacement "community owner's node" for that community, then the community is dead anyhow for other reasons.

@arnetheduck
Copy link

The only ways in which the message history can be mutated is via the 'edit message' and 'delete message' functions. The majority of message editing and deleting takes place minutes after a message has been posted, with almost all edits taking place within 3 days of a message being posted (true on any group messaging platform). So we were thinking of only allowing users to delete community messages that are less than 6 days old (note this doesn't apply to 1on1 and ad-hoc group chat messages). So all community message history that is 7 or more days old can be considered as immutable static content.

Just to be clear, neither of these functions are really "implementable" except as a smoke screen that hides the original message and the ui would do well to .. explain this to users - ie once you go into immutable datastores territory, you can't really remove things (an edit is a delete + add) - in general "N days" considerations are not really in scope, it's more of a paint job on top later unless we're building a centralized solution where history access is gated / permissioned in which case one might be able to get as far as constraining new members access to content.

@John-44
Copy link
Author

John-44 commented Jul 16, 2021

Key however is that every feature has a cost - being decentralised means amplifying that cost, usually - the first step must thus be to precisely define what the absolute minimum requirements are that will keep users happy, and whether there exist loopholes that can be used to emulate otherwise critical functionality - to name a trivial example, latency in text communication is often mentioned as a hard requirement, but if you merely show a progress indicator, most users are happy to wait, because they know "something" is happening, and the latency problem becomes an optimization point rather than an up-front requirement.

1000% agree that every feature has a cost and that being decentralised usually means amplifying that cost! With this in mind, here is a very rough sketch of a possible solution that I think could deliver on the requirements stated at the top of this thread and allow the implementation of the designs linked above with the least amount of effort:

  1. Each community's 'owner node' (every community will have one of these) periodically packages chat history older than 7 days into binary chunks and signs these with the community's private key.

  2. These binary chunks are distributed between members of the community using an existing p2p file sharing protocol like IPFS or torrent. The interface between Status desktop and whatever file sharing protocol is used is abstracted, so that we can replace whatever file sharing protocol we initially use with Dagger when Dagger is ready.

  3. Status desktop clients unpack the binary chunks of community history that they download and use this information to re-populate their message databases.

I think this approach can work for Status communities, because with Status communities we have an actor who can produce the canonical history for distribution (the community owner), and a bunch of other actors in community members and community admins who are motivated to download this history (so they can access it themselves), store this history (so they can continue to access it themselves) and serve this history (because they have an interest in their community being successful).

@John-44
Copy link
Author

John-44 commented Jul 16, 2021

The only ways in which the message history can be mutated is via the 'edit message' and 'delete message' functions. The majority of message editing and deleting takes place minutes after a message has been posted, with almost all edits taking place within 3 days of a message being posted (true on any group messaging platform). So we were thinking of only allowing users to delete community messages that are less than 6 days old (note this doesn't apply to 1on1 and ad-hoc group chat messages). So all community message history that is 7 or more days old can be considered as immutable static content.

Just to be clear, neither of these functions are really "implementable" except as a smoke screen that hides the original message and the ui would do well to .. explain this to users - ie once you go into immutable datastores territory, you can't really remove things (an edit is a delete + add) - in general "N days" considerations are not really in scope, it's more of a paint job on top later unless we're building a centralized solution where history access is gated / permissioned in which case one might be able to get as far as constraining new members access to content.

Right, yes fully agreed the current edit and delete functionality is just a 'smoke screen/paint job' as you described it. Still useful, but not a function that can be relied on for privacy.

@iurimatias
Copy link

iurimatias commented Jul 19, 2021

hey @arnetheduck , I was thinking something similar along those lines. i.e to implement this as a torrent client of sorts, with the altruistic users functioning as seeders and the users requesting old history functioning as leeches. A specific community log time period (let's say each 24 hours) could be represented through a magnet link (or something similar) that is determined based on the community id and the time period (e.g 2015-01-10).

An issue is that the final file of all messages for that period needs to be the same for everyone, and this implies some ordering and determinism. A possible way to do this is to modify the status application-level protocol to include an hash value that points to the previous (known) message, so a client will check what is the last message received in that channel, include the id for that previous message, and an hash of itself with the previous message id.

Depending on how much there is concurrent activity this could introduce some 'forks' or orphan messages, to handle this we could use something like Ethereum's Greedy Heaviest Observed Subtree or Inclusive protocol.

Thoughts?

@John-44
Copy link
Author

John-44 commented Jul 22, 2021

One other consideration for any potential solution to take into account - currently images posted in chats are hosted on IPFS, however we don't want to pin all images that are posted in all communities to IPFS forever, therefore IPFS is a bad long term storage solution for images posted in community chats. So if possible it could be good if the community history service includes the images posted to community channels in the history blocks that will be stored and distributed via a p2p file share protocol.

@oskarth
Copy link
Contributor

oskarth commented Jul 28, 2021

This conversation has gotten quite long and touches many different concerns at the same time. @staheri14 did a great breakdown here vacp2p/research#82

You can see all issues here https://github.com/vacp2p/research/milestone/8

I suggest we continue the discussion in the most specific issue possible. For example, talk about how to optionally offload archival data can be talked about here: vacp2p/research#78 etc

If some concern is missing, this can be covered in vacp2p/research#82

@jimstir
Copy link
Contributor

jimstir commented Jun 13, 2024

Issue moved here

@jimstir jimstir closed this as not planned Won't fix, can't repro, duplicate, stale Jun 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants