-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 requests after the hyper s3 timeout connection duration fail over large table operations #2301
Comments
@liamphmurphy I'm happy I wasn't the only one facing this. Was questioning our infra setup |
Haha I'm also glad to know I'm not completely crazy! This may be relevant: awslabs/aws-sdk-rust#1079 (comment) If this fix is related, I've noticed the hyper version |
@liamphmurphy just to double check, do you also see the behavior on 0.15.3? That's what we use at the moment |
The linked code you shared is actually not out yet in a Python release. It was recently refactored by @mightyshazam to the AWS sdk's. |
@ion-elgreco Tested with 0.15.3 and yep I get that second more generic error still. And ahh okay understood, guess that explains why changing that env var wasn't doing anything haha. I'll bring that down locally and see if that fixes it |
@liamphmurphy we actually see it happen in 1.5minute already: The above exception was caused by the following exception:
_internal.DeltaError: Generic DeltaTable error: External error: External error: Arrow error: External error: Parquet error: ParquetObjectReader::get_byte_ranges error: Generic S3 error: request or response body error: operation timed out |
I did some debugging and found where this error occurs (at least for me) It happens here:
The
So it appears that it finds the right checkpoint file (which for me is around 800 MB large) and detects any unaccounted for logs made after that checkpoint, but then fails on scanning this stream. |
And for some additional logs in case it's helpful (the
|
@liamphmurphy very useful! so for you it's reading these huge checkpoints then I think in our case I figured whats happenning it's sometimes our incoming data has 0 rows, which was causing the merge to scan the whole table, will create something separate for that but there is crossover in the connection issues |
@ion-elgreco Yeah appears to be. I presume the reason why a stream is used here was to accommodate large checkpoints like this and to not have to load it all in memory, but then that hyper client connection closes before it can finish. 🤔 |
@ion-elgreco we seing a similar issue using Azure. When we run the following code it fails with import os
os.environ["RUST_LOG"]="debug"
from deltalake import DeltaTable
blob_path = "az://<redacted path>"
storage_options = {"AZURE_STORAGE_ACCOUNT_NAME": "<redacted sa>", "AZURE_CONTAINER_NAME":'<redacted container>', 'use_azure_cli': 'true'}
dt = DeltaTable(blob_path, storage_options=storage_options)
dt.optimize.z_order(["StatusDateTime"]) The table consists of 12 parquet of roughly 400mb each. You can see the full error log here: |
@thomasfrederikhoeck can you try this? I think default is quite low I believe storage_options = {"timeout": "120s"} |
@ion-elgreco I just ran it with
|
@liamphmurphy @thomasfrederikhoeck can you try this branch? #2378 Likely not the issue but worthwhile to try @thomasfrederikhoeck on your side can you see the logs of your object store, because it indicates in your error msg that it got closed from that side |
@ion-elgreco just tried #2378 . I still get error with the following logs. This might not be related to something like apache/arrow-rs#5366 and pola-rs/polars#14384 (comment)? With regards to the closing it only happen when I set the timeout to 120s. I'm not 100% how to get logs out from Azure Storage Account.
|
@thomasfrederikhoeck @liamphmurphy can you guys please try out this PR: #2392 |
@ion-elgreco it did not solve it. :-/ I see the following 5 times:
|
How long is your timeout set to? |
@ion-elgreco That is without timeout explicitly set. If I set it to timeout |
@thomasfrederikhoeck I am not really sure what the issue could, because your previous errors showed closed by remote host, so that sounds more like a server-side kill of the connection |
@ion-elgreco if I use the default timeout I don't see the error that the remote closed but in both cases I see I just tried from an Azure VM to maximize network throughput and that was succesfull. Is there some way to limit the concurrency to test if it is because I'm overloading my network when I try from my laptop? I used to have similar issues in polars until this was merged pola-rs/polars#15083 (see pola-rs/polars#14384 (comment) and pola-rs/polars#14384 (comment))
|
@thomasfrederikhoeck do you mind checking if you see the timeout behavior with 0.15.1? In 0.15.2 the log replay got introduced this new replay stream concept using futures, it might be the cause. Ah I see now it happens in Optimize for you, which makes sense I guess. The recordBatchStream is created using a futures::stream. Let me see if I can stretch my knowledge on async rust here.. |
@ion-elgreco I tried 0.15.1 but that unfurtunally didn't help. I wish I could assist but my knowledge on rust is very limited - async or not :-D Let me know if there is anything I can test out. I want to also metion that when running the script above (#2301 (comment)) i sometimes see the following inside the logs:
|
@ion-elgreco I can maybe hire a consultant to look at this if you would have the time to look at a PR given I can't really contribute myself. |
@thomasfrederikhoeck for sure, I am happy to free up time to review an PR :) |
@ion-elgreco I have someone working on it now but given that object_store is changing around the asyncwrite do you think it make sense to continue? Won't it maybe be better to wait for #2436 since it looks like a new relases of the arrow crate might not be to far away: apache/arrow-rs#5688 For reference: |
Yes it's better to first upgrade the object store to using the upload trait and see how much of an impact that is. Perhaps that person can help with refactoring to those APIs. We are however stuck because we can't upgrade beyond Datafusion 37, see here: #2395 , apache/datafusion#10181 |
@thomasfrederikhoeck can you try #2581, which bumps the object_store dep? I refactored the object store wrapper on the python side to use an internal buffer with the Upload trait that occasionally flushes, it may solve this issue. |
@abhiaagarwal Nice! I will test today or tomorrow:-) |
@thomasfrederikhoeck it's merged, try main :) |
@abhiaagarwal it's actually already released 😉 |
@abhiaagarwal @ion-elgreco It looks like it has solved it (not getting any timeouts). NICE! However instead I got the following:
Edit: #2592 |
Hmm, @liamphmurphy can you check if this has been solved on the AWS side? The error you're getting @thomasfrederikhoeck is the exact same one that was reported above, the timeouts seem to be solved. Digging a bit deeper, the error is orginating from here, where it's performing a HTTP range request to get the parquet metadata without downloading the whole file (DuckDB does something similar). I wonder if the error still lies with object_store's handling of HTTP range requests? |
@abhiaagarwal you didn't update the Rust writer IIRC? |
@ion-elgreco yep, but the failure has to do with a read. Perhaps there's a path where it reads the parquet metadata before it performs the delete operation? |
@abhiaagarwal the new error I'm getting now is not the one reported above? Previously, I was getting the timouts as reported above but your PR a have fixed this (I tried few times and didn't see it a single time). So I belive #2592 is new :-) |
It might be related to reading the checkpoint and we also often scan the parquet files |
@thomasfrederikhoeck yeah, I'm just curious as to see whether the same error is observed on the AWS side — the error here seems to be "the internal parquet reader can't decode the metadata returned by the http range request" (where it previously would time out even trying to even execute the range request), which could point to an implementation flaw in the upstream arrow dependency or maybe the object store implementation of azure. |
I'm curious if there's a PR that closed this? Was it the object store bump? |
Environment
Delta-rs version: python v0.16.0
Binding: Python
Environment:
Bug
What happened:
On three of our large Delta tables (out of 500, most of the others are a lot smaller) I've repeatedly been seeing errors related to an S3 timeout. I get two kinds of errors; one indicates an error reading data from a parquet file, the other is just a much more generic error. Here they are:
This can happen in various contexts, but the prominent example in my case is needing to scan the entire Delta table for records matching a value and deleting those matches. I've logged out right before this delete op runs, and the time delta between that log and the error is usually around 5 minutes. This time delta is relevant.
The second one in particular comes up a lot. This is on a table that gets around 250,000 rows added every hour and is partitioned by date
2024-03-18
and hour15
as example.My initial thought was that this was due to a failure to download a larger file such as a checkpoint; but those are "only" 668 MB and it is certainly not unheard of, moving files using various AWS SDK's will gladly handle larger files.
Our current theory is that instead this timeout indicate that the
hyper
client makes a request, then the library does some in-mem operations (such as scanning a large delta table) that takes a long time, and then when delta-rs is ready to reach out to S3 again, it exceeds some kind of keep-alive timeout and fails. This is why that time delta of 5 minutes I mentioned earlier is important.This seems related to this comment:
delta-rs/crates/aws/src/storage.rs
Lines 401 to 406 in 785667e
And finally I think this sort of timeout is expected from the
hyper
client, per this (admittedly old) ticket but it explains this exact situation in another context: hyperium/hyper#763What you expected to happen:
Even on long running operations, such as a scan over the whole table for deletion, the operation would eventually succeed barring no other issues.
How to reproduce it:
IF I AM CORRECT: perform a long running Delta operation, such as a delete on a table with millions of rows and have it try to make a final reach out to S3
More details:
Note this is just a theory. I wanted to get this issue out there to see what others think, I'm going to dig in and see if I can figure out whether it's a issue with the library or something else.
The text was updated successfully, but these errors were encountered: