Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to List Blobs in Azure via S3Proxy #717

Open
samuel-davis opened this issue Nov 8, 2024 · 6 comments
Open

Unable to List Blobs in Azure via S3Proxy #717

samuel-davis opened this issue Nov 8, 2024 · 6 comments

Comments

@samuel-davis
Copy link

samuel-davis commented Nov 8, 2024

Problem:
Utilizing S3Proxy configured to talk to ABS I cannot use the marker parameter when I attempt to ListBlobs. I am using a underlying python library in my codebase called deltalake , where the deltalake lib is calling S3Proxy.

The incoming request to S3Proxy from my code is :

http://df-s3-proxy/df-bucket?list-type=2&prefix=av2%2Fsilver8%2F_delta_log%2F&start-after=av2%2Fsilver8%2F_delta_log%2F00000000000000000000

The transformed request that is sent JCloud and then on to ABS in the cloud is :

https://vulcanforgetest.blob.core.windows.net/df-bucket?restype=container&comp=list&prefix=av2/silver8/_delta_log/&marker=av2/silver8/_delta_log/00000000000000000000&maxresults=1000&include=metadata

Finally the exception caused by the marker parameter is :

org.jclouds.azure.storage.AzureStorageResponseException: command [method=org.jclouds.azureblob.AzureBlobClient.public abstract org.jclouds.azureblob.domain.ListBlobsResponse org.jclouds.azureblob.AzureBlobClient.listBlobs(java.lang.String,org.jclouds.azureblob.options.ListBlobsOptions[])[df-bucket, [Lorg.jclouds.azureblob.options.ListBlobsOptions;@41358ef0], request=GET https://vulcanforgetest.blob.core.windows.net/df-bucket?restype=container&comp=list&prefix=av2/silver8/_delta_log/&marker=av2/silver8/_delta_log/00000000000000000000&maxresults=1000&include=metadata HTTP/1.1] failed with code 400, error: AzureError{requestId='9606be37-201e-0072-4006-3223a9000000', code='InvalidQueryParameterValue', message='Value for one of the query parameters specified in the request URI is invalid.
RequestId:9606be37-201e-0072-4006-3223a9000000
Time:2024-11-08T17:46:11.9123328Z', context='{QueryParameterValue=av2/silver8/_delta_log/00000000000000000000, QueryParameterName=marker, Reason=Invalid ListBlobs marker.}'}

You can see above that it basically boils down to this.

image

Ive looked through both the S3Proxy and the Jclouds code and because the error is so abstract and doesn't tell me why this marker parameter is invalid, Im reaching out for some help.

Im more than happy to do a PR if you can point to where this can be resolved.

I should also say, I attempted using the azureblob-sdk provider and while it DOES write data and gets past this error, the data that is written is unable to be read correctly afterwards. deltalake basically reporting that the file sizes are not what they should be ( smaller ). Which is implying to me that the write operation isn't working correctly even with azureblob-sdk

Environment where failure is seen:
Azure Blob Storage account configured with : azureSharedKey in S3Proxy
provider: azureblob

@gaul gaul added the azure label Nov 10, 2024
@gaul
Copy link
Owner

gaul commented Nov 12, 2024

I believe that your S3 client is using a marker that was not supplied by the object store. S3 supports any arbitrary string as a marker, for example:

S3 has objects [a, b, d, f]
list maxResults = 1, marker = a -> returns b
list maxResults = 1, marker = b -> returns d
list maxResults = 1, marker = c -> returns d

However Azure has opaque markers which are not simple strings. Thus if you issue the same set of operations:

Azure has objects [a, b, d, f]
list maxResults = 1, marker = a -> returns b
list maxResults = 1, marker = b -> returns d
list maxResults = 1, marker = c -> emits error

How have you configured your client such that it is using an unexpected marker? Sometimes clients use the last key instead of the marker while listing but S3Proxy has some fixup logic for this. Also some clients try to list large buckets in parallel using random keys but S3Proxy cannot support this for either azureblob or azureblob-sdk.

Please open a separate issue for the azureblob error. While it will likely have the same marker limitation, I am more likely to fix something if data is not being written properly. Note that many changes have landed to it in the last for weeks so please ensure you use the latest version.

@gaul gaul added the needinfo label Nov 12, 2024
@samuel-davis
Copy link
Author

Thank you for the explanation.

In your example above about opaque markers in Azure , this particular example stands out.

list maxResults = 1, marker = c -> emits error

Deltalake is attempting to determine what number to supply to the log before it writes data . I believe it is using 0000000000000 so it can get a list and then use the highest number.

However in this case, on the first write operation, there is nothing in the directory at all yet, so a invalid marker error is thrown. Does this line up with your explanation ?

As for the separate issue, I'll write it up using the azureblob provider with Azurite as a emulator ( works perfectly there since azurite treats all invalid markers as a empty string) . Get the file sizes that were written and then compare to the azurblob-sdk provider with the same emulator , and then the difference when pointing towards the real Azure Blob Storage.

@samuel-davis
Copy link
Author

Lastly ,

I have no control over the underlying s3 requests that are formulated by deltalake in my Python code as their are encapsulated by the library itself.

This library does work perfectly when targeting either S3 storage or when targeting MinIo so I feel like the requests that are being made are correct , but I am honestly unsure.

@gaul
Copy link
Owner

gaul commented Nov 14, 2024

I don't mean that your code is incorrect but that the underlying S3 library may be doing an operation that S3Proxy using Azure cannot support. Can you share specifically which library you use and how you call it?

@jdunham-openai
Copy link

I'm having similar issue using this for Azure Blob Store, my client isn't using the last marker parameter and instead using the last key returned as a marker for the next and failing this exact way. The last marker does actually exist. I'm curious about
Sometimes clients use the last key instead of the marker while listing but S3Proxy has some fixup logic for this
I am running the latest s3proxy, I can try to dive in and see exactly how this s3 client is being used, we don't own it it's for an open source project.

@jdunham-openai
Copy link

I found my issue fixed by: #569 - hope we can get this merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants