Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarified scope of lock_id in InitiateFileUploadRequest #226

Merged
merged 2 commits into from
Feb 16, 2024

Conversation

glpatcern
Copy link
Member

@glpatcern glpatcern commented Feb 14, 2024

The rationale of this PR is that an InitiateFileUpload request shall only provide an endpoint, for the client to perform the upload as a separate step. Therefore, any lock validation MUST be performed again at the actual upload time, when the lock_id is to be passed again, and as such there's no need to include it in the payload of InitiateFileUplod.

How is Reva/edge dealing with atomically ensuring the lock is valid currently? @wkloucek and/or @micbar can you shed some light? My only guess is that you'd validate the lock again during the TUS upload. In that case, we may just shortcut.

For Reva/master, this is in the makings (at relatively low priority), that's why I realized only now about this.

@glpatcern glpatcern changed the title Removed lock_id from InitiateFileUploadRequest Remove lock_id from InitiateFileUploadRequest Feb 14, 2024
glpatcern added a commit to glpatcern/reva that referenced this pull request Feb 14, 2024
@micbar
Copy link
Member

micbar commented Feb 14, 2024

@butonic @dragotin fyi

glpatcern added a commit to glpatcern/reva that referenced this pull request Feb 14, 2024
@glpatcern
Copy link
Member Author

From @micbar:

Regarding the lock checking: it should be done during initiateFileUpload and on any later step too.

@glpatcern
Copy link
Member Author

glpatcern commented Feb 15, 2024

Fair enough, after all lock_id is optional and can stay, but the issue remains: the ultimate place where to check is the actual upload.

So possibly a question to @butonic : how is Reva edge validating the lock context on PUT? Which headers do you use/enforce on the clients for them to pass the lock_id? Can we agree on them so to properly implement cs3org/wopiserver#139 ?

@glpatcern glpatcern changed the title Remove lock_id from InitiateFileUploadRequest Clarified scope of lock_id in InitiateFileUploadRequest Feb 15, 2024
glpatcern added a commit to glpatcern/reva that referenced this pull request Feb 15, 2024
@glpatcern
Copy link
Member Author

I'm merging this as it's only a clarification in the comments, the semantic does not change.

The details how to enforce locking by the data transfer protocol returned in response are yet to be followed up in cs3org/wopiserver#139

@glpatcern glpatcern merged commit 6b01495 into cs3org:main Feb 16, 2024
2 checks passed
@butonic
Copy link
Contributor

butonic commented Feb 21, 2024

So possibly a question to @butonic : how is Reva edge validating the lock context on PUT? Which headers do you use/enforce on the clients for them to pass the lock_id? Can we agree on them so to properly implement cs3org/wopiserver#139 ?

There are no direct PUT uploads to the dataprovider. Uploads MUST start with a call to InitiateFIleUpload. That will start an UploadSession, which also carries the Lock. When the UploadSession is finished, and the postprocessing outcome allows the upload to be registered as a new file revision (or as a new file) the lock will be checked before aquiring the low level file lock on the file metadata.

Upload sessions can always be written to, regardless of the protocol used to transfer the bytes. edge uses asynchronous postprocessing, so we check the lock twice: once in ̀InitiateUpload and once before creating the revision after postprocessing. Since all upload protocols only write to a temporary location, they all check the lock twice, because the lock is not relevant for the byte transfer. Only at the beginning (to prevent unnecessary uploads) and at the end for the actual check.

@glpatcern
Copy link
Member Author

Thanks @butonic for these insights. Granted that the dataprovider does not serve direct uploads, we do exactly the same (and we also have an atomic move after uploading to a temp location, though this is an implementation detail). Yet that does not change the question: if I understand correctly, your UploadSession must actually be persisted, along with the InitiateFileUpload.lock_id metadata, such that the lock check in your postprocessing hook can be done - which is the check that ultimately rules, the initial one being an optimization to avoid unnecessary uploads, as you said. Is that the case?

What I implied with my proposal was to "leave the burden" of keeping the state (the lock metadata) on the client side, and enforce that clients pass it on the actual upload (see how I implemented that in the cs3org/wopiserver#139 PR), for the dataprovider to dispatch it to e.g. the UploadSession. Does that make sense?

@butonic
Copy link
Contributor

butonic commented Feb 22, 2024

Yes that is the case. UploadSessions are persisted as a separate json file to keep track of the parent id, the name, the nodeid, the lock, any metadata assigned with the upload, the current offfset and full file size ... regarding the implementation detail we use googles renameio package for the atomicity.

I like the idea of using a JWT as the upload id! Unfortunately, the upload session also has to keep track of the postprocessing state. When all bytes have been transferred we start ASYNC postprocessing and may receive a virus detected event or a delete event. Also if postprocessing is interrupted we need to be able to pick up and restart unfinished uploads. We need to keep track of that upload state. Furthermore, I want to be able to show an incoming upload in the web ui. This requires keeping track of state on the server side.

@glpatcern
Copy link
Member Author

Unfortunately, the upload session also has to keep track of the postprocessing state. When all bytes have been transferred we start ASYNC postprocessing and may receive a virus detected event or a delete event. Also if postprocessing is interrupted we need to be able to pick up and restart unfinished uploads. We need to keep track of that upload state. Furthermore, I want to be able to show an incoming upload in the web ui. This requires keeping track of state on the server side.

Understood, it appears you have much more state to be kept server-side than we do with Reva master, where actually some events (e.g. a concurrent delete) are handled by the storage, and we don't store anything else within Reva.

Anyway, decorating the data transfer with the lock context via dedicated HTTP headers would not cause any harm - you're relying on the URL returned by the dataprovider to be unguessable, like we do, and you're not going to read that additional payload for enforcing the lock at the end of the transfer, but for the protocol specification I still see it makes sense to not impose a stateful implementation.

glpatcern added a commit to glpatcern/reva that referenced this pull request Feb 27, 2024
glpatcern added a commit to glpatcern/reva that referenced this pull request Mar 7, 2024
glpatcern added a commit to glpatcern/reva that referenced this pull request Mar 29, 2024
glpatcern added a commit to glpatcern/reva that referenced this pull request Apr 8, 2024
glpatcern added a commit to glpatcern/reva that referenced this pull request Apr 15, 2024
glpatcern added a commit to glpatcern/reva that referenced this pull request Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants