-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow transfer speeds from URL sources #113
Comments
1MB/s sounds extremely slow -- how many connections/requests per second are you sending to our endpoint? We do have throttling mechanisms if too many requests are made. Do you see any 429 errors on your end? |
@mauriceweber can you comment at all on the hosting architecture, or the most efficient way to initiate file transfers? Are these files hosted on a cloud storage solution like GCS, s3, or cloudfront? Are they hosted on a single larger machine? My previous transfer jobs were not successful, and I'll need to start a new xfer job this week. Knowing how these are hosted will help me form a reasonable estimate for how long the jobs should take, and inform which transfer method I choose. |
Hi @axelmagn, apologies for the late answer! The files are hosted on cloud storage, only publicly accessible via http and requests are rate limited -- it is your responsibility to ensure a limit on the number of requests in order not to get rate limited. We are looking into other solutions for more large scale downloads, so it is more convenient to access the full dataset. |
No worries and thanks for the reply. Edit: how many is too many? |
I am working on ingesting the RPV2 dataset onto GCS buckets using GCP storage transfer jobs. Speeds seem to be incredibly slow (on the order of 100KB/s - 1MB/s), and at this rate it will take on the order of weeks to transfer the files. There's still a possibility that the bottleneck is on my end, but more and more it's looking like the host is either throttling connections or overloaded on I/O.
Can you shed any light on how this dataset is hosted, or what the best transfer methods would be at scale? I've already prototyped out a small pipeline on sampled data, and would like to scale it up in a reasonable timeframe.
The text was updated successfully, but these errors were encountered: