Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to copy large list of files/blobs #2905

Open
olsgaard opened this issue Jan 4, 2025 · 2 comments
Open

How to copy large list of files/blobs #2905

olsgaard opened this issue Jan 4, 2025 · 2 comments

Comments

@olsgaard
Copy link

olsgaard commented Jan 4, 2025

I have a list of blob names, ~37.000 stored in a txt file.

I'd like to download them, utilizing the parallel processing of azcopy.

Here is what I tried:

Pass all blob names seperated by semi-colon to azcopys --include-path parameter as documented here

$ azcopy --version

azcopy version 10.27.1

$ include_path=$(paste -sd';' blob_list.txt)
$ azcopy copy \
    "https://${storage_account_name}.blob.core.windows.net/${container_name}?${sas_key}"\
    "$dest_dir" \
    --include-path "$include_path"

/usr/bin/azcopy: Argument list too long

I can't use --include-pattern because the blob names are too different to really be processed into a reasonable number of patterns.

I can loop over all the lines in my blob list, but then it's not parallel.

For some reason ChatGPT thinks I can pass "@blob_list.txt" to the --from-to parameter, but that doesn't seem to match anything in the documentation. I can see someone on StackOverflow is trying to use azcopy jobs create but that looks like it is deprecated at least since version 10.

@vibhansa-msft
Copy link
Member

You can either filter by name or date/time (as in all files updated after a given time or something like that) if possible.
If filters are not possible in your case then running a script and firing multiple azcopy commands is the only option here.

@olsgaard
Copy link
Author

olsgaard commented Jan 9, 2025

Hi Vibhans,

Thank you for the reply.

I'm not sure of you mean looping each file individually. That is not really feasible for 10s of thousands of files, as azcopy spends a few seconds before download starts.

How to loop over multiple files together with azcopy is also not obvious at all.

Using xargs to chunk a list of filenames and loop over these chunks causes weird behaviour in azcopy, where it will print hundreds if not thousands of lines of "INFO: Discarding incorrectly formatted input message" per iteration.

xargs -a filelist.txt | tr ' ' ';' | while read -r chunk; do
    azcopy copy \
        "https://${storage_account_name}.blob.core.windows.net/${container_name}?${sas_key}"\
        "$dest_dir" \
        --include-path "$chunk"
done

Surprisingly, if you do echo | azcopy ... inside the while body, that will not be printed - I think that is related to issue #974

Lastly, there appears to be an undocumented flag --list-of-files described here https://github.com/Azure/azure-storage-azcopy/wiki/Listing-specific-files-to-transfer, but it is not recommended as it is slow for many files

Let me know if there are any recommended solutions to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants