-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SVCS-548] Chunked Uploads for CloudFiles #289
base: develop
Are you sure you want to change the base?
[SVCS-548] Chunked Uploads for CloudFiles #289
Conversation
…erbutler into cloudfiles * 'develop' of https://github.com/CenterForOpenScience/waterbutler: Add PR template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to mostly keep the review to things related to the chunked-upload commit. I think there were maybe 1 or two style things i commented on that may have been from your original CloudFiles ticket.
also maybe add QA notes for jira ticket. That way whoever is testing knows how to trigger the chunked uploads etc.
waterbutler/core/provider.py
Outdated
@@ -323,7 +323,6 @@ def request(self, *args, **kwargs): | |||
""" | |||
assert src_path.is_dir, 'src_path must be a directory' | |||
assert asyncio.iscoroutinefunction(func), 'func must be a coroutine' | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to delete this blank line
@pytest.mark.asyncio | ||
@pytest.mark.aiohttpretty | ||
async def test_chunked_upload_segment_size(self, | ||
connected_provider, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this test testing over the one above it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aiohttpretty.register_json_uri('GET', revision_url, body=revision_list) | ||
|
||
metadata_url = connected_provider.build_url(path.path) | ||
aiohttpretty.register_uri('HEAD', metadata_url, status=200, headers=file_header_metadata) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 blank lines
""" | ||
if stream.size > self.SEGMENT_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does cloudfiles need a call to handle_naming?
async def delete(self, path, **kwargs): | ||
async def chunked_upload(self, stream, path, check_created=True, fetch_metadata=True): | ||
|
||
created = not (await self.exists(path)) if check_created else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in upload, you have it as
if check_created:
created = not (await self.exists(path))
else:
created = None
which is much more readable. Maybe go back to that version instead of the more confusing 1 liner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability is pretty subjective, if you recognize this as a ternary operator this is a pretty simple statement. You can read more about the ternary operators here: http://book.pythontips.com/en/latest/ternary_operators.html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I might agree with Addison here. If it's a simpler ternary I would agree, but because self.exists(path)
is a little opaque and as soon as that async is added in it gets a little unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created
is a value that isn't used until after requests are made. If requests fail, this result is thrown away. This also means it delays upload until this result is awaited. This should go after the requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QA Notes
Other than that ready for next phase.
1 similar comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd like to review this, i'll try later this week.
|
||
created = not (await self.exists(path)) if check_created else None | ||
|
||
for i, _ in enumerate(range(0, stream.size, self.SEGMENT_SIZE)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these happen sequntially, Would be better to happen in parallel
async def delete(self, path, **kwargs): | ||
async def chunked_upload(self, stream, path, check_created=True, fetch_metadata=True): | ||
|
||
created = not (await self.exists(path)) if check_created else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created
is a value that isn't used until after requests are made. If requests fail, this result is thrown away. This also means it delays upload until this result is awaited. This should go after the requests.
|
||
for i, _ in enumerate(range(0, stream.size, self.SEGMENT_SIZE)): | ||
data = await stream.read(self.SEGMENT_SIZE) | ||
resp = await self.make_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a comment explaining what the function of this request is in the upload.
) | ||
await resp.release() | ||
|
||
resp = await self.make_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a comment explaining what the function of this request is in the upload.
if data.get('subdir'): | ||
return CloudFilesFolderMetadata(data) | ||
elif data['content_type'] == 'application/directory': | ||
return CloudFilesFolderMetadata({'subdir': data['name'] + '/'}) | ||
return CloudFilesFileMetadata(data) | ||
|
||
@ensure_connection | ||
async def create_folder(self, path, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have anything to do with multipart upload or are there more than one ticket here?
Note (Added by Longze)
This PR contains code from: #283.
Ticket
https://openscience.atlassian.net/browse/SVCS-548
Purpose
This PR allows one to upload files larger than 5GBs
Changes
Adds a new method to clould files provider with tests and raises max body limit to arbitrarily high 1 Terra byte.
Side effects
None that I know of.
QA Notes
It will take a long time to upload a > 5 GB files, have something else to do while you wait.
Deployment Notes
None that I know of.