-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make s3.request_timeout configurable #1568
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @metadaddy for adding this, I left one comment regarding S3FS, apart from that it looks good to me 👍
pyiceberg/io/fsspec.py
Outdated
@@ -150,6 +151,9 @@ def _s3(properties: Properties) -> AbstractFileSystem: | |||
if connect_timeout := properties.get(S3_CONNECT_TIMEOUT): | |||
config_kwargs["connect_timeout"] = float(connect_timeout) | |||
|
|||
if request_timeout := properties.get(S3_REQUEST_TIMEOUT): | |||
config_kwargs["request_timeout"] = float(request_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be read_timeout
, looking at: https://github.com/fsspec/s3fs/blob/51e3c80ef380a82081a171de652e2b699753be2b/s3fs/core.py#L473-L479
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko You're entirely correct - thanks for catching that! I'll make the correction and push the updated code.
@@ -394,6 +395,9 @@ def _initialize_oss_fs(self) -> FileSystem: | |||
if connect_timeout := self.properties.get(S3_CONNECT_TIMEOUT): | |||
client_kwargs["connect_timeout"] = float(connect_timeout) | |||
|
|||
if request_timeout := self.properties.get(S3_REQUEST_TIMEOUT): | |||
client_kwargs["request_timeout"] = float(request_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if request_timeout := self.properties.get(S3_REQUEST_TIMEOUT): | ||
client_kwargs["request_timeout"] = float(request_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -116,6 +116,7 @@ For the FileIO there are several configuration options available: | |||
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. | | |||
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. | | |||
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. | | |||
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a Java equivalent, so I'm fine with introducing this one 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i found connect-timeout
which i think is different from request-timeout
https://github.com/apache/iceberg-go/blob/4b645d698fffaa99c235f54bf33f4340a4414bc5/io/s3.go#L47-L53
1675f74
to
87fcad5
Compare
Hi @Fokko - I implemented and pushed your suggested correction. Thanks! |
Looks like theres a lint issue, can you make |
87fcad5
to
3d53f42
Compare
@kevinjqliu Ah - it wanted imports in alphabetical order - I'd just inserted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -116,6 +116,7 @@ For the FileIO there are several configuration options available: | |||
| s3.region | us-west-2 | Configure the default region used to initialize an `S3FileSystem`. `PyArrowFileIO` attempts to automatically resolve the region for each S3 bucket, falling back to this value if resolution fails. | | |||
| s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. | | |||
| s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. | | |||
| s3.request-timeout | 60.0 | Configure socket read timeouts on Windows and macOS, in seconds. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i found connect-timeout
which i think is different from request-timeout
https://github.com/apache/iceberg-go/blob/4b645d698fffaa99c235f54bf33f4340a4414bc5/io/s3.go#L47-L53
Similarly to #218, we see occasional timeout errors when writing data to S3-compatible object storage:
[I don't believe the issue is specific to the fact that I'm using Backblaze B2 rather than Amazon S3 - I saw references to similar error messages with the latter as I was researching this issue.]
The issue happens when the underlying
PUT
operation takes longer than the request timeout, which is set to a default of 3 seconds in the AWS C++ SDK used by Arrow via PyArrow.The changes in this PR allow configuration of
s3.request_timeout
when working directly or indirectly withpyiceberg.io.pyarrow.PyArrowFileIO
, just as #218 allowed configuration ofs3.connect_timeout
.For example, when creating a catalog: