Retrievability of Open data stored through Fil+ #883

dkkapur · 2023-05-12T16:59:46Z

dkkapur
May 12, 2023
Maintainer

Open data stored in verified deals comes with the claim that the data stored should be retrievable. This is something that is theoretically driven by both the DataCap applicants demands/claims/desires (see the application template where they specify if the data is open and retrievable to anyone) and by the value proposition of open data to begin with. However - historically, Fil+ has not checked/validated this nor had a declared retrievability metric or policy. As part of the push in the Quality Phase, this is a good time to kick start a conversation on the challenges, potential pathways, and tools that we can implement to better assess when open data onboarded is actually retrievable.

IMO (will keep this updated as we continue having the conversation), what we need to do is:

align on the definition of retrievability: this includes: (i) SPs being set up correctly to meet both client needs and Fil+ program policies/standards, (ii) define if retrievals happen over graphsync (current default in Filecoin) or something else (HTTP is likely a much better option here), and (iii) defining a path for when clients should react to data presented and what that means for data replication in the network + SP selection over time
align on the policies around it: do we want to set % success rates or something else per DataCap application? How do we help both notaries and clients get a good view into this over time?
align on a plan to getting there: tools and processes that we can rely on and iterate on over time

For those that don't believe that we should be doing this. Here's why I think we have to - at a baseline, this is a relatively good test for if clients are actually onboarding data with SPs that are meeting their claimed needs. If they are not, then this should be a low false positive path to identifying potential abuse, or at least, it becomes a good way to encourage clients to consider SPs that can actually serve their needs over time. As a reminder, here's the definition of quality data today (see https://filplus.storage/ for more details, scope, roadmap, goals):

Quality data is all content that meets local regulatory requirements and:

the data owner wants to see on the network, including private/encrypted data

or is open and retrievable

or demonstrates proof of concept or utility of the network, such as efforts to improve onboarding

stuberman · 2023-05-12T22:10:46Z

stuberman
May 12, 2023

In my opinion, open data should be retrievable. As an SP, there needs to be some SLA about retrievability which includes the maximum amount of data that a system is required to support in GiB/hour in order to not overwhelm each system. One problem we see is that our percentage of polled retrievals drops when too much data is polled or retrieved. That needs to be taken into account. Now if a system like Saturn reduces the frequency of data retrieval then that problem may be mitigated. The second critical aspect of retrievals is whether Fast Retrievals (unsealed copies) are required. That should also be part of the SLA.

0 replies

FILDCKabat · 2023-05-15T10:56:05Z

FILDCKabat
May 15, 2023

In my opinion it should be possible, that's also main difference to EFIL+ program, but performance shouldn't be critical, as it is free of charge program.

0 replies

herrehesse · 2023-05-16T08:59:03Z

herrehesse
May 16, 2023

Circle back on my already given answer: #880 (comment)

0 replies

MegaFil · 2023-05-16T09:20:18Z

MegaFil
May 16, 2023

High quality datasets must be the powerful engines for future filecoin growth, but any attempt to define high quality data concretely will add friction and internal conflict to the project, which we really wouldn't want to see. Mechanically asking SPs with datacap to turn on the retrieval function isn't the final solution. Using incentives to make SPs willingly become retrieval servers reduces internal conflict and friction, not adds controversy.

0 replies

xmcai2016 · 2023-05-16T18:30:39Z

xmcai2016
May 16, 2023
Collaborator

Great discussion topic. I would like to push from an angle that data clients hold the SPs accountable for serving retrievals. An abundant number of datacap applications indicated that the clients wanted their data retrievable. I am aligned with many of the opinions expressed above, and drafted a proposal for Fil+ Retrieval Guidelines & Requirements for Data Clients below. Why HTTP over Graphsync or Bitswap is a separate topic but the TLDR is that it's the shortest path to enable retrievals on Filecoin today due to its ability to retrieve from PieceCIDs.

Fil+ Retrieval Guidelines & Requirements for Data Clients:

The data clients participating in the Fil+ program commit to ensuring the retrievability of open datasets through HTTP from their SPs.
Fil+ data clients are advised to meticulously choose Storage Providers that align with their specific data retrieval requirements.
Fil+ clients can enhance their reputation by diligently holding their Storage Providers accountable for ensuring the accessibility of data. This facilitates acquisition of additional datacap from Notaries in the future.
Fil+ Notaries allocate DataCap to clients who actively engage in data retrieval and holding SPs accountable.
Multiple Storage Providers can share a single unsealed copy of data with the same CID. This practice is deemed acceptable as it optimizes time and resource utilization.
Data clients and SPs should be aware of the risk of network overload or attacks, and mitigate these risks by employing rate limiting tools to set a maximum number of requests per second allowed.
Data clients and SPs should agree on a throttling limit that determines the maximum bandwidth a single retrieving client can consume at any given time. The SP should implement this limit in their tooling to protect themselves.
Data clients should carefully select SPs that offer access control mechanisms for datasets. While Fil+ open datasets are intended for public retrieval and viewing, clients should prioritize SPs that provide the ability to manage access and refuse data to suspicious users.

2 replies

Aaronn85 May 17, 2023

No offense, but this is absurd. Neither the notary nor the client can guarantee that SP will always be in a retrievable status during its lifetime unless they themselves are SPs.

herrehesse May 17, 2023

While we can't provide an absolute guarantee, the role of notaries and clients in this scenario is critical. Their responsibility is to ensure that the storage providers, who are predominantly "they themselves" or a "closely associated group/entity", are making their utmost efforts to maintain availability and retrievability of verified deals. If this standard isn't met, all parties involved should be held accountable.

Without this or even a level of accountability, we are unable to curtail individuals and their potentially abusive behaviour effectively, allowing for uncontrollable abusive behaviour.

Carohere · 2023-05-17T07:12:25Z

Carohere
May 17, 2023

I have received some feedback from Korean SPs. Due to the language barrier, I'm sharing here on their behalf.

“Retrieval takes up bandwidth. Uploading CAR files will make my bandwidth almost full which could affect my submit window & winning post, i.e. block rewards. As a small sp, this's not good for my participation."

"In the long run, retrieval should be available to everyone but at a cost. Whoever wants the data must pay for the retrieval to reduce the risk that miners may face."

1 reply

herrehesse May 17, 2023

Should a specific storage provider fail to manage retrieval requests, for whatever reason, they simply should not be granted verified deals. This is non-negotiable.

In the broader perspective, I wholeheartedly agree that our goal should be transitioning towards a retrieval system that is more manageable and incorporates a payment mechanism.

TakiChain · 2023-05-17T08:38:46Z

TakiChain
May 17, 2023

We have noticed a lot of challenges and controversies about retrieval success in the community. Frankly, I think the investments to verify the retrieval success rate of nodes in the early phase of Filecoin far outweigh the benefits. Moreover, this introduces huge internal friction because network conditions differ so much from place to place. Retrieving the same node may come out with different results that retrievers will have difficulty agreeing on. This will keep increasing community friction and drive more followers away from the project.

1 reply

herrehesse May 17, 2023

I find myself in agreement with the majority of your points, except for one. In my candid view, the "benefits" greatly surpass the investments made in enabling retrieval capabilities. The tenfold multiplier serves as a potent method to scale an operation, making it increasingly efficient and profitable. In return, the least a storage provider MUST do is to safely store valuable data and keep an unsealed copy for quick retrieval.

In our own practice, we retain more than 10PiB of unsealed copies, as it currently stands as the most appropriate course of action. Until further development of CDN's and retrieval methods this is the only proper method.

lilisy90 · 2023-05-17T10:22:24Z

lilisy90
May 17, 2023

Supporting retrieval puts a lot of pressure on both clients and Sps, whether in terms of cost of technology or communication.

1 reply

herrehesse May 17, 2023

Not having support for quick retrieval really puts the community in a tough spot when it comes to checking on the right behaviour of clients, notaries, and storage providers. It 's simply impossible. Keeping an unsealed copy of the data isn't just a choice—it's a responsibility for which SP's are greatly accommodated with a tenfold sector multiplier.

BobbyChoii · 2023-05-17T10:55:47Z

BobbyChoii
May 17, 2023

I would like to draw your attention to the fact that over 1EiB validation data is currently stored offline (not online) into Filecolin Network by default. Why would we have to expect data retrieval and downloading to be done online?

Before the real retrieval network node is built, any online retrieval directly from the client to the Storage Network is only a technical attempt, which will definitely cause potential risks to the stability of the node from Storage Network. If it is considered as a required review condition for Fil+, this is obviously not appropriate.

I suggest that this rule be changed as soon as possible.

1 reply

herrehesse May 17, 2023

Could you clarify what you mean by "1EiB validation data"? As per the rules of Filecoin+, storage providers should be mandated to maintain an unsealed sector for retrieval until a suitable CDN network is ready to overtake this specific task. If we don't do this, the risk will be that the community becomes unable to spot abusive conduct within the Filecoin+ ecosystem. The importance of retrievals cannot be overstated in terms of verifying client authenticity and detecting abusive behaviour within FIL+. This is a hard line.

I agree that what we need is an effective retrieval method that doesn't overburden a storage provider's network or their ability to execute PoSt. This point is crucial. However, it doesn't detract from the fact that a storage provider needs to be accessible and hold an unsealed copy.

herrehesse · 2023-05-17T10:59:45Z

herrehesse
May 17, 2023

@BobbyChoii @TakiChain Both of you are in public dispute, on GitHub and Slack. Why aren't you responsive on either platform but your opinions are voiced here?

2 replies

TakiChain May 18, 2023

We commented on applications days ago. We do not accept any kind of slander. Your dispute about our LDN signature is due to the retrievability of the data. There is no better place to share our views than here.

TakiChain May 18, 2023

As a new joined notary, in TakiChain we take any potentially controversial behavior seriously and are willing to share our perspective. Our door is always open for discussions, feel free to reach out under the application.

Wengeding · 2023-05-17T12:32:52Z

Wengeding
May 17, 2023

This screenshot as below should give us the direction well. Currently filecoin network's greatest value is to help clients store real and useful data, not retrieve it online. It's enough that clients who have stored data can copy and recover it when needed.

We shouldn't give filecoin an unbearable responsibility. If there are redundant barriers, we should remove them.

In my opinion, the retrieval based on filecoin network is a great project. But it will most likely be implemented in filecoin's layer 2 network, not in the present moment.

5 replies

herrehesse May 17, 2023

The multiplier provided by Filecoin+ places a significant strain on the entire community. Without a proper retrieval mechanism in place, we can't gauge its "usefulness" and risk shifting from a verifiable system to one based purely on trust. This outcome is simply unacceptable and it is bound to lead to widespread abuse, as is evidently unfolding at this very moment. As long as individuals rely on the multiplier for growth and revenue, they should comply with the rules, which involve keeping their sectors unsealed and retrievable.

If SP's choose not to comply and make verified deals retrievable, they should focus on regular deals or commit-capacity (CC) deals, but refrain from engaging in verified deals involving datacap.

nothingjustaminer Jul 26, 2023

This screenshot is disappointing and embarrassing to me and maybe to many members of this project.

Filecoin which strives to be infrastructure in web3 space needs to be neutral. Filecoin is not supposed to decide if a sector is useful or useless. No one, even the founder who represent Filecoin community to some extent, shouldn't embed the arrogant bias into Filecoin project. It's a huge mistake.

I strongly disagree with Juan's definition of Filecoin's mission is to onboard "useful" data.
This could be Filplus's mission, but cannot be Filecoin's mission. Be aware

willscott Jul 26, 2023

Committed Capacity, where there is no data at all, is not "infrastructure in web3 space" - there's no infrastructure because there's nothing being stored. How does that help build a storage layer or anything of value?

Supporting committed capacity providers is also a strain on the community / network. We are rewarding actors who are not helping advance the sustainability story for a long term healthy network. The sustainability story for filecoin needs to be around the storage layer - no external party would want to pay you simply to prove that you have hard drives, but they would pay for you to store data.

nothingjustaminer Jul 26, 2023

My point is, market will choose what is useful and what is not. Not me, not you, and certainly not Filecoin as an infrastructure.

herrehesse Jul 26, 2023

@nothingjustaminer The party responsible for payment is the one demonstrating commitment and assigning value to something. Currently, datacap represents a cost for every individual miner out there, they are the paying party. It is crucial to enforce the idea that worthless, empty, or fake data carries a cost for them, and as a result, these entities should not receive datacap.

willscott · 2023-05-17T15:57:37Z

willscott
May 17, 2023

Retrieval is something that in the long term filecoin needs to support. There is much more data in the world that is useful when it can be gotten back when needed.

I think one of the major stumbling blocks today is that we talk about "retrieval" as a binary thing, a yes-or-no.

Instead, I think we will have more success in thinking about retrieval as a sliding scale, because different data is going to need different amounts of retrieval, and the costs should be proportional to that.

I hope ultimately we can get some "tiers" of retrieval to think about, so e.g.

'archival' data needs to be able to be retrievable, within 1 day of request, with reserved throughput of 10mbps, at most once/month.
'warm' data needs to be retrievable, with a latency less than 5 seconds, with reserved throughput of 10mbps, with a 95% availability sla
'hot' data needs to be retrievable, with a latency of less 2 seconds, with reserved throughput of 100mbps, with a 99% availability sla.

These reservations of bandwidth can make it clearer what the underlying cost of infrastructure are going to be and can let SPs price deals to offset the costs of the retrieval burden they're taking on.

0 replies

hannahhoward · 2023-05-17T23:02:16Z

hannahhoward
May 17, 2023

To augment @willscott 's point, I've noticed retrieval is perceived primarily as a burden on SPs.

But I want to point out that data that can be stored and retrieved, at any of the levels @willscott describes, is valuable to a much wider market of potential customers than data that is stored but impossible to get back.

Even offering 'archival' data that can be retrieved reliably with the parameters @willscott describes converts an SP's offering to a business that objectively valuable to large enterprises needing crash recovery.

Getting to 'warm' retrieval opens the market of potential clients much further. It also makes an SP an appropriate "L3" backup for Saturn.

But it's probably also sufficient for many SPs to simply be archival storage, cause a lot of people need that.

Ultimately, I think what we want is alignment between a Fil+ client's needs and the service an SP is offering. And I think it may be worth distinguishing certain 'tranches' of SPs & clients between the tiers @willscott describes.

0 replies

brendalee · 2023-05-17T23:57:33Z

brendalee
May 17, 2023

An additional perspective - it is beneficial to the Filecoin network to have a working end to end data storage and retrieval flow (regardless of retrieval tier as @willscott mentions), not just a pure storage solution with no retrieval. The different teams working on Filecoin have made significant improvements in the retrieval part, but if SPs choose not serve retrievals, many data use cases on Filecoin will be considered incomplete.

Furthermore, although there are many retrieval related projects in progress (such as Saturn, Rhea, etc.) to help with shouldering the majority of retrieval requests on the network - these projects all are dependent on Filecoin SPs serving retrievals, as Filecoin is the layer where we have incentives/guarantees that data is stored (and ideally available to be retrieved).

0 replies

xmcai2016 · 2023-05-18T14:50:54Z

xmcai2016
May 18, 2023
Collaborator

I know the current macroeconomics environment is hard for all of us, and I am trying not to put more burden on anyone in our community. I want us to all succeed in the long term. Having an end-to-end solution to storage is imperative to Filecoin's long-term success.
Without retrievals we cannot generate organic storage demand on Filecoin. Echoing with above, we should have different tiers of retrieval SLAs, and SPs serving archival storage should not be penalized for the extra time they need to unseal etc. Slow but reliable retrievals for archival storage absolutely counts as valid retrievals. The egress cost can be further reduced by dataset replicas sharing one unsealed copy / http endpoint.
This is why I come to you all - please help me help you and all of us align on guidelines that put the minimal burden on SPs while not sacrificing the long-term success of the community. And I am not expecting us to all turn on retrievals overnight. What I wanted to push for is to align on a mutual direction to go toward as a community to ensure our future success.

3 replies

BobbyChoii May 18, 2023

Thanks for your warm input and couldn't agree more. It's the toughest times for the network and we need to give SPs serving archival storage more understanding and support!
Good to see that lots of our community members are already practicing, which is exactly what we are doing.

herrehesse May 18, 2023

@BobbyChoii i would suggest to stop signing applications that are under investigation from another notary.

BobbyChoii May 25, 2023

@herrehesse Thanks for ur reminder that you need to gradually learn to respect the decisions of the community and other notaries. Any notary would have their independent judgment. We have to accept that if all the rules were easy to judge, there would be no need for the role of notary at all.

MegaFil · 2023-05-24T08:35:41Z

MegaFil
May 24, 2023

Retrievability is supposed to be a highly controversial point, for the same SP, clients from different regions often get the totally different results when they try to retrieve, whose results should we take?

As far as I know, many notaries from China have experienced obstacles when using the public Internet to download data from other countries. According to rules of fil+, do we need to restrict non-Chinese SPs from participating in the fil+ project?

4 replies

cryptowhizzard May 24, 2023

@HiFil We should indeed restrict SP's who are unable to serve retrievals. No matter their region.

willscott May 24, 2023

This is why there are a number of entities regularly attempting retrievals, and we describe retrievability as making data available at some percentage level of availability.

There is not so many notaries or measurements from China such that an SP that cannot serve to the specific region would be considered ineligible.

Suyanj May 25, 2023

@willscott I don't think so.
Filecoin needs to be compatible with storage forms in all regions, and we should not restrict European and American SPs from joining the fil+ project, even though these SPs do not meet the retrieval requirements of Chinese notaries and clients; of course, we should not restrict SPs from Chinese regions either, even though European and American notaries cannot retrieve them effectively.

xmcai2016 May 26, 2023
Collaborator

SPs will not be penalized for serving retrievals slowly, as long as they serve retrievals. Serving retrievals, no matter the latency, confirms the quality of data stored and helps the community grow in the long term.
We have a network of Retrieval Bots testing for successful retrievals and can consider adding a few nodes in China as well.

xmcai2016 · 2023-07-20T15:50:26Z

xmcai2016
Jul 20, 2023
Collaborator

Thank you, Fil+ community, for the feedback & comments both on this thread and in our slack threads. We have added the final version of retrieval guidelines to the README at https://github.com/filecoin-project/filecoin-plus-large-datasets/tree/main under Retrieval Guidelines for Data Clients.

1 reply

herrehesse Jul 20, 2023

For the program to work, each stakeholder will need to play their parts in a truthful manner.

This is a problem.

Retrievability of Open data stored through Fil+ #883

dkkapur May 12, 2023 Maintainer

Replies: 17 comments · 21 replies

xmcai2016 May 16, 2023 Collaborator

xmcai2016 May 18, 2023 Collaborator

xmcai2016 May 26, 2023 Collaborator

xmcai2016 Jul 20, 2023 Collaborator

dkkapur
May 12, 2023
Maintainer

Replies: 17 comments 21 replies

xmcai2016
May 16, 2023
Collaborator

xmcai2016
May 18, 2023
Collaborator

xmcai2016 May 26, 2023
Collaborator

xmcai2016
Jul 20, 2023
Collaborator