Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Navigating sub-directories/buckets? #61

Open
justinperkins opened this issue Mar 2, 2012 · 9 comments
Open

Navigating sub-directories/buckets? #61

justinperkins opened this issue Mar 2, 2012 · 9 comments
Labels

Comments

@justinperkins
Copy link

Once you dig into a bucket and retrieve the object contents, you just get a giant list of everything. When you're trying to read/list contents on a per-directory basis this proves difficult.

Any way or future plan to allow navigating through sub-directories (or buckets, if that's what they really are)?

@qoobaa
Copy link
Owner

qoobaa commented Mar 2, 2012

Hm, that'd be nice thing to have. If you have an idea how to implement that, please submit a pull request.

Cheers.

@chouck
Copy link

chouck commented Mar 3, 2012

I had the same issue yesterday. The comment about sending a :delimiter to find_all in objects_extension.rb implies that is the way to do it, but the code doesn't parse the returned data correctly.

I ended up adding the following code locally to gain this functionality:

module S3
  class Bucket
    def directory_list(options = {})
      options = {:delimiter => "/"}.merge(options)
      response = bucket_request(:get, :params => options)
      parse_directory_list_result(response.body)
    end

    def parse_directory_list_result(xml)
      names = []
      rexml_document(xml).elements.each("ListBucketResult/CommonPrefixes/Prefix") { |e| names << e.text }
      names
    end
  end
end

And then just call it with

bucket.directory_list :prefix => "foo/bar/baz/"

Sorry its not a full blown pull request, but I don't have the time to make one right now. You are, of course, welcome to do whatever you want with this. I'd suggest also removing the documentation about :delimiter from objects_extension.rb, since its a bit of a red-herring.

Thanks for writing this module in the first place,
-Chris

@qoobaa
Copy link
Owner

qoobaa commented Mar 5, 2012

Chouck, can you try to add some tests, and create a pull request for that?

@justinperkins
Copy link
Author

I'm not really sure if that patch really solves the issue I was experiencing. I want to list just the top-level directories within a given bucket, then with each one of those, this patch becomes effective since you can take a given directory/sub-bucket and pass it into the directory_list method.

UPDATE: This works wonderfully. Sorry for the preemptive comment.

@justinperkins
Copy link
Author

(sorry for spam)

To get objects within a given subdirectory this patch does not totally solve the problem, you still have to iterate over the entire collection and select just the objects you care about. ala ...

all_objects_in_my_bucket = s3_service.buckets.find('some bucket').objects
objects_grouped_by_sub_dir = s3_service.buckets.find('some bucket').directory_list(:prefix => 'some directory with many sub directories').inject({}) { |memo, dir|
  memo[dir] = all_objects_in_my_bucket.select { |o| o.key.include?(dir) }
  memo
}

There's got to be a better way.

@chouck
Copy link

chouck commented Mar 8, 2012

True, but I think you and I are trying to solve different problems.

I have a huge tree of data and I wanted a list of sub-directories 3 layers down (which are dynamically generated, so I don't have a fixed list). I don't want all of the files in each of those sub-directories, in fact, I only want one or two out of the thousands that are in each sub-tree.

If I'm understanding what you are saying, it sounds like you want something more like:

my_bucket = s3_service.buckets.find('some bucket')
prefix_list = my_bucket.directory_list(:prefix => 'some directory with many sub directories')
prefix_list.each { |prefix| objects_grouped_by_sub_dir[prefix] = my_bucket.objects.find_all(:prefix => prefix)}

@justinperkins
Copy link
Author

Yes! Guess I should've dug in on the source some more. Thanks.

@qoobaa qoobaa added the bug label Jul 6, 2015
@ericmwalsh
Copy link

Sorry to resurrect an old thread but I needed this feature very much, any progress with this or a linked PR? I wouldn't mind creating it!

@ericmwalsh
Copy link

ericmwalsh commented Feb 9, 2017

Also I expanded on what @chouck created:

module S3
  class Bucket
    # this method recurses if the response coming back
    # from S3 includes a truncation flag (IsTruncated == 'true')
    # then parses the combined response(s) XML body
    # for CommonPrefixes/Prefix AKA directories
    def directory_list(options = {}, responses = [])
      options = {:delimiter => "/"}.merge(options)
      response = bucket_request(:get, :params => options)

      if is_truncated?(response.body)
        directory_list(options.merge({:marker => next_marker(response.body)}), responses << response.body)
      else
        parse_xml_array(responses + [response.body], options)
      end
    end

    private

    def parse_xml_array(xml_array, options = {}, clean_path = true)
      names = []
      xml_array.each do |xml|
        rexml_document(xml).elements.each("ListBucketResult/CommonPrefixes/Prefix") do |e|
          if clean_path
            names << e.text.gsub((options[:prefix] || ''), '').gsub((options[:delimiter] || ''), '')
          else
            names << e.text
          end
        end
      end
      names
    end

    def next_marker(xml)
      marker = nil
      rexml_document(xml).elements.each("ListBucketResult/NextMarker") {|e| marker ||= e.text }
      if marker.nil?
        raise StandardError
      else
        marker
      end
    end

    def is_truncated?(xml)
      is_truncated = nil
      rexml_document(xml).elements.each("ListBucketResult/IsTruncated") {|e| is_truncated ||= e.text }
      is_truncated == 'true'
    end
  end
end

This handles listing out directories when you run into a key limit (due to the S3 API MaxKeys hard limit of 1000 keys). The request will recurse and grab all responses before parsing and returning them. I also added the ability to return "clean directory names" (folder names only) in lieu of returning the entire key/path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants