Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectStore::get_opts Incorrectly Returns Response Size not Object Size #5272

Closed
CarlKCarlK opened this issue Jan 2, 2024 · 8 comments · Fixed by #5276
Closed

ObjectStore::get_opts Incorrectly Returns Response Size not Object Size #5272

CarlKCarlK opened this issue Jan 2, 2024 · 8 comments · Fixed by #5276
Labels
bug object-store Object Store Interface

Comments

@CarlKCarlK
Copy link

get_opts is documented to return the metadata of the object. When used on InMemory it does the right thing. However, when used on an S3 store, it can return the metadata of a region rather than the object.

Here is repro showing the issue:

use std::collections::HashMap;

use bytes::Bytes;
use object_store::{memory::InMemory, path::Path as StorePath, GetOptions, ObjectStore};
use tokio::runtime;
use url::Url;

fn main() {
    let rt = runtime::Runtime::new().unwrap();
    rt.block_on(async {
        // Create an in-memory object store
        let object_store = InMemory::new();
        let store_path = StorePath::from("data/file1");
        let bytes = Bytes::from_static(b"hello");
        object_store.put(&store_path, bytes).await.unwrap();

        // Ask for the object's metadata
        let object_meta = object_store.head(&store_path).await.unwrap();
        assert_eq!(object_meta.size, 5);

        // We read a region and get the metadata in one go?
        let get_options = GetOptions {
            range: Some(0..2),
            ..Default::default()
        };
        let get_result = object_store
            .get_opts(&store_path, get_options)
            .await
            .unwrap();

        // ============================================================
        // NOTE: The metadata is the same as the one we got from `head`
        // ============================================================
        assert_eq!(get_result.meta.size, 5);
        let bytes = get_result.bytes().await.unwrap();
        assert_eq!(bytes, Bytes::from_static(b"he"));

        // Next, we try to read from a public S3 bucket

        // arn:aws:s3:::ubc-sunflower-genome
        // AWS Region
        // us-west-2
        // AWS CLI Access (No AWS account required)
        // aws s3 ls --no-sign-request s3://ubc-sunflower-genome/
        // >aws s3 ls --no-sign-request s3://ubc-sunflower-genome/sequence/sra/PRJNA322345/SRR3648257.sra
        // 2020-04-05 16:22:18 7064209461 SRR3648257.sra

        // Any AWS credentials will do
        use rusoto_credential::{ProfileProvider, ProvideAwsCredentials};
        let credentials = ProfileProvider::new().unwrap().credentials().await.unwrap();
        let url = "s3://ubc-sunflower-genome/sequence/sra/PRJNA322345/SRR3648257.sra";
        let url = Url::parse(url).unwrap();
        let options: HashMap<&str, &str> = [
            ("aws_access_key_id", credentials.aws_access_key_id()),
            ("aws_secret_access_key", credentials.aws_secret_access_key()),
            ("aws_region", "us-west-2"),
        ]
        .iter()
        .cloned()
        .collect();
        let (object_store, store_path) = object_store::parse_url_opts(&url, options).unwrap();

        // Get the metadata
        let object_meta = object_store.head(&store_path).await.unwrap();
        assert_eq!(object_meta.size, 7064209461);

        // Can read a region and get the metadata in one go?
        let get_options = GetOptions {
            range: Some(0..2),
            ..Default::default()
        };
        let get_result = object_store
            .get_opts(&store_path, get_options)
            .await
            .unwrap();

        // ============================================================
        // NOTE: The metadata is NOT the same as the one we got from `head`.
        // `meta` is documented as "The [`ObjectMeta`] for this object".
        // However, it is not the metadata for the object, but the metadata
        // for the region we read.
        // ============================================================
        assert_eq!(get_result.meta.size, 2);
        let bytes = get_result.bytes().await.unwrap();
        assert_eq!(bytes, Bytes::from_static(b"NC"));
    })
}
[package]
name = "object-store-issue"
version = "0.1.0"
edition = "2021"

[dependencies]
tokio = { version = "1.35.0", features = ["full"] }
object_store = { version = "0.8.0", features = ["aws"] }
bytes = "1.5.0"
url = "2.5.0"
rusoto_credential = "0.48.0"
@CarlKCarlK CarlKCarlK added the bug label Jan 2, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

I think the correct behaviour here is what S3 is doing, if you specify a Range this will be reflected in the returned payload. S3 doesn't provide a header containing the object length, just the ContentLength of the returned payload. We should probably document this behaviour more clearly and make InMemory and any other implementations behave similarly. Thank you for the report, this is a good find

@CarlKCarlK
Copy link
Author

Thanks for your response. The API could, of course, work either way. Let me add a vote to the InMemory (and, I think, File) way:

  • If a user needs the length of the returned region, they can use get_result.range.len(). They don't need get_result.meta.size. (I like thinking of meta has something that is fixed, not changing on each read.)
  • When reading a file for the first time, it is often the case that we want to know the length of the total file and some header bytes. It would be nice to be able to do this in one call. For example, I'm reading a genomics file called PLINK Bed. I need the file's total length and to check that it has the correct magic numbers https://en.wikipedia.org/wiki/List_of_file_signatures and to see which of the two BED flavors I have.

I can also, of course argue that other side, too. :-)

  • Perhaps, AWS, etc. actually requires two calls to get the object length and the value of a region, so the current S3 way would be both easier to implement and a better reflection of the behind-the-scenes work.

@tustvold
Copy link
Contributor

tustvold commented Jan 2, 2024

I think we are forced to follow the S3 approach, as that is all the HTTP APIs provide and we cannot perform multiple requests as this would not only be inefficient but would break atomicity.

If a user needs the length of the returned region

FWIW it is actually the etag and version metadata that are important to be bundled with the response, and the motivator for returning ObjectMeta.

@clbarnes
Copy link
Contributor

clbarnes commented Jan 3, 2024

The Content-Range header in the response can contain the length of the full resource (but doesn't have to) as well as the range returned - does anyone familiar with the stores know whether the servers deign to give us that information? In #5222 we just toss it but that can be changed. I think most stores return the range size (S3 way), so some behaviour is changing either way. Maybe if we decide to use the range size we just deprecate the field and have people do object_meta.range.len() (hoping nobody's started iterating through it, as len here represents the remaining items in ExactSizeIterator); if we want to return the resource size, we could change the name to total_size and make it an Option in case a range is requested and we get a header like Content-Range: 0-100/*.

One possible wrinkle is the way that responses in the local file system are handled - in all other stores, Bytes representing the requested range are returned, but for the local store a handle to the whole file is returned, which then needs to be sliced using ObjectMeta.range. Does that change the mental model of what .size means?

@Xuanwo
Copy link
Member

Xuanwo commented Jan 3, 2024

The Content-Range header in the response can contain the length of the full resource (but doesn't have to) as well as the range returned - does anyone familiar with the stores know whether the servers deign to give us that information?

Yep: storage services like s3, azblob and gcs will return the total size in content-range header.

@tustvold
Copy link
Contributor

tustvold commented Jan 3, 2024

If the full length is available to us, we should definitely return that in the metadata, that would be much less confusing. I'll have a play

@tustvold
Copy link
Contributor

tustvold commented Jan 3, 2024

#5276 appears to pass against the emulators at least, will run some tests against the real services. Thank you all for the input 👍

@tustvold tustvold changed the title object_store's S3 get_opts incorrectly returns meta of region rather than object. ObjectStore::get_opts Incorrectly Returns Response Size not Object Size Jan 3, 2024
tustvold added a commit that referenced this issue Jan 3, 2024
* Fix ObjectMeta::size for range requests (#5272)

* Docs

* Update object_store/src/lib.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Add tests

---------

Co-authored-by: Andrew Lamb <[email protected]>
@tustvold tustvold added the object-store Object Store Interface label Jan 5, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

label_issue.py automatically added labels {'object-store'} from #5222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug object-store Object Store Interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants