Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow transfer speeds from URL sources #113

Open
axelmagn opened this issue Mar 29, 2024 · 5 comments
Open

slow transfer speeds from URL sources #113

axelmagn opened this issue Mar 29, 2024 · 5 comments

Comments

@axelmagn
Copy link

I am working on ingesting the RPV2 dataset onto GCS buckets using GCP storage transfer jobs. Speeds seem to be incredibly slow (on the order of 100KB/s - 1MB/s), and at this rate it will take on the order of weeks to transfer the files. There's still a possibility that the bottleneck is on my end, but more and more it's looking like the host is either throttling connections or overloaded on I/O.

Can you shed any light on how this dataset is hosted, or what the best transfer methods would be at scale? I've already prototyped out a small pipeline on sampled data, and would like to scale it up in a reasonable timeframe.

@mauriceweber
Copy link
Collaborator

1MB/s sounds extremely slow -- how many connections/requests per second are you sending to our endpoint? We do have throttling mechanisms if too many requests are made. Do you see any 429 errors on your end?

@axelmagn
Copy link
Author

axelmagn commented Apr 2, 2024

Unfortunately because I am using GCP transfer jobs, I don't know the exact number for concurrent connections. However I am running 180 concurrent jobs, and they may be sharing IP addresses between them. No 429 errors have been reported.

The throughput has been quite variable, and recovered quite a bit since the time of posting:
image

Is this dataset hosted through a single server, or is it distributed across nodes in any way?

@axelmagn
Copy link
Author

axelmagn commented Apr 8, 2024

@mauriceweber can you comment at all on the hosting architecture, or the most efficient way to initiate file transfers? Are these files hosted on a cloud storage solution like GCS, s3, or cloudfront? Are they hosted on a single larger machine? My previous transfer jobs were not successful, and I'll need to start a new xfer job this week. Knowing how these are hosted will help me form a reasonable estimate for how long the jobs should take, and inform which transfer method I choose.

@mauriceweber
Copy link
Collaborator

Hi @axelmagn, apologies for the late answer! The files are hosted on cloud storage, only publicly accessible via http and requests are rate limited -- it is your responsibility to ensure a limit on the number of requests in order not to get rate limited. We are looking into other solutions for more large scale downloads, so it is more convenient to access the full dataset.

@axelmagn
Copy link
Author

axelmagn commented Apr 9, 2024

No worries and thanks for the reply.

Edit: how many is too many?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants