Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image/cas: Add a generic CAS interface #40

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wking
Copy link
Contributor

@wking wking commented Oct 11, 2016

And implement that interface for tarballs based on the specs image-layout.

Also add a new oci-cas command so folks can access the new read functionality from the command line.

The Context interface follows the pattern recommended here, allowing callers to cancel long running actions (e.g. push/pull over the network for engine implementations that communicate with a remote store).

This is the current first commit in #5, to see if nibbling away at that PR one commit at a time makes review any easier. I have a directory-based backend in #5 if you want to see what that looks like. In another commit in #5 I replace the image/walker.go functionality with this new API.

@hustcat
Copy link
Contributor

hustcat commented Nov 18, 2016

Great job!

IMO, we should add List API to return all digests for Engine interface, such as:

// List Return all digests
List(ctx context.Context) (digests []string, err error)

This is needed for GC in the future.

@hustcat
Copy link
Contributor

hustcat commented Nov 18, 2016

And more, I'd like to add Exist API:

// Exist Return true if digest is exist, else return false
Exist(ctx context.Context, digest string) (bool,  error)

@wking
Copy link
Contributor Author

wking commented Nov 18, 2016

On Thu, Nov 17, 2016 at 07:49:24PM -0800, Ye Yin wrote:

IMO, we should add List API to return all digest for Engine interface, such as:

List(ctx context.Context) (digest []string, err error)

This is needed for GC in the future.

Listing is expensive, and I'd rather have GC be an opaque function
1. But we can certainly revisit this if/when the rest of the CAS
API lands, when it will be easier to push concrete GC implementations.
In the mean time, leaving list off is safer, because adding List is a
backwards compatible change while removing it is a backwards
incompatible change.

@wking
Copy link
Contributor Author

wking commented Nov 18, 2016

On Thu, Nov 17, 2016 at 08:25:25PM -0800, Ye Yin wrote:

And more, I'd like to add Exist API:

// Exist Return true if digest is exist, else return false
Exist(ctx context.Context, digest string) (bool,  error)

Is this for “I want to put this blob, whose hash I already know, but
don't want to bother if that blob is already in storage”? Maybe in
the context of copying between CAS engines? And you think that
calling Get and immediately closing any succesfully returned reader
would be too expensive (e.g. if the engine was backed by HTTP and you
could do a HEAD instead of a GET)? I think there might be a solid
case in there, but I'd rather wait and get the more obviously critical
portions of the API landed first. Then we can circle back and look at
this sort of optimization in a second pass.

@hustcat
Copy link
Contributor

hustcat commented Nov 18, 2016

Is this for “I want to put this blob, whose hash I already know, but
don't want to bother if that blob is already in storage”? Maybe in
the context of copying between CAS engines? And you think that
calling Get and immediately closing any succesfully returned reader
would be too expensive ...

Yeah, exactly:)

@hustcat
Copy link
Contributor

hustcat commented Nov 18, 2016

Listing is expensive, and I'd rather have GC be an opaque function
[1]. But we can certainly revisit this if/when the rest of the CAS
API lands, when it will be easier to push concrete GC implementations.
In the mean time, leaving list off is safer, because adding List is a
backwards compatible change while removing it is a backwards
incompatible change...

IMO, List is a very basic API, GC is one case. Get all digests is very common requirement, such as statistics, just like list all references. Whether expensive is related with underlay implementation. For directory walker, maybe not a expensive thing?

@wking
Copy link
Contributor Author

wking commented Nov 18, 2016

On Thu, Nov 17, 2016 at 10:46:44PM -0800, Ye Yin wrote:

IMO, List is a very basic API, GC is one case. Get all digests
is very common requirement, such as statistics, just like list all
references…

Listing all references is a necessary evil ;). To collect statistics,
you can use the ref-engine's List to walk refs (or subsets thereof),
and then walk the Merkle DAG down from each ref to discover blobs. If
you want a particular blob to show up in lists (and not be garbage
collected), just add a new ref to pin it. I don't see a use case for
listing blobs in the absence of refs outside of GC.

Whether expensive is related with underlay implementation. For
directory walker, maybe not a expensive thing?

For a small, local, directory-backed CAS engine, blob-list performance
would be acceptable. However, my goal is to write a generic API so
higher-level tools can manipulate CAS without having to care about the
engine implementation. If your CAS engine involves millions of blobs
accessed via a distributed hash table or sharded across many hosts,
walking significant chunks of that is going to be expensive. And
garbage collection is likely to be more sophisticated than
stop-the-world mark and sweep. Unless we have a very convincing use
case for all CAS engines to provide a List API, I'd rather avoid it.

And again, the CAS API I'm proposing here covers all the use-cases we
currently have inside image-tools. If we land this without List and
later decide that List is necessary, a follow-up PR can add it later.
It should only matter for this PR if your argument is that you need
List for a use-case this repository already addresses.

@xiekeyang
Copy link
Contributor

xiekeyang commented Nov 18, 2016

@wking I'm reading it but not debug yet.

Can we use Get(ctx context.Context, desc descriptor), to instead Get(ctx context.Context, digest string). I select descriptor as parameter in image/reader.go because it bring helpful information, like:
if hdr.Name == filepath.Join("blobs", desc.algo(), desc.hash()) instead your:

targetName := fmt.Sprintf("./blobs/%s/%s", algorithm, hash)
if header.Name == targetName{
}

Some variables in Get() is likely superfluous.

@wking
Copy link
Contributor Author

wking commented Nov 18, 2016

On Fri, Nov 18, 2016 at 12:20:55AM -0800, xiekeyang wrote:

Can we use Get(ctx context.Context, desc descriptor), to instead
Get(ctx context.Context, digest string).

I've avoided going with the image-spec Descriptor because while size
information may be useful (e.g. you may want to compare the expected
size against the retrieved blob), the media type information is not
useful (the CAS engine should not care or need to track the types of
stored blobs). I'd really like to avoid the image-tools descriptor,
because:

  • It's currently private, so I can't use it from image/cas.
  • Converting between image-spec and image-tools descriptor types is
    an unnecessary headache for users.
  • The desc.algo() and desc.hash() helpers are only really interesting
    to CAS-engine implementers. For everyone else, the digest might as
    well be an opaque string.

In a later #5 commit I add blobPath 1 and refPath 2 helpers to
abstract these out. If it would make you more comfortable with this
commit I can move those definitions over here.

@xiekeyang
Copy link
Contributor

In a later #5 commit I add blobPath [1] and refPath [2] helpers to
abstract these out. If it would make you more comfortable with this
commit I can move those definitions over here.

Good!

@xiekeyang
Copy link
Contributor

@wking

Also add a new oci-cas command so folks can access the new read functionality from the command line.

I think CAS engine interface is great, but I'm not sure that command oci-cas get is useful. Do we really need this command? (I might misunderstand the new read functionality)

@wking
Copy link
Contributor Author

wking commented Nov 18, 2016

On Fri, Nov 18, 2016 at 03:34:16AM -0800, xiekeyang wrote:

In a later #5 commit I add blobPath [1] and refPath [2] helpers to
abstract these out. If it would make you more comfortable with
this commit I can move those definitions over here.

Good!

Is that “yes, squash those into this commit”? Or is it “ah, then this
commit is ok because after we chew through the rest of #5 we'll be in
a good place”?

@xiekeyang
Copy link
Contributor

Is that “yes, squash those into this commit”? Or is it “ah, then this
commit is ok because after we chew through the rest of #5 we'll be in
a good place”?

Sorry, I mean could you please squash those of blobPath and refPath to this commit?

@wking
Copy link
Contributor Author

wking commented Nov 18, 2016

On Fri, Nov 18, 2016 at 03:40:35AM -0800, xiekeyang wrote:

I think CAS engine interface is great, but I'm not sure that command
oci-cas get is useful.

As #5 continues, oci-cas fills in to cover the rest of the interface
(put and delete). Again, I'm happy to shuffle things around if you'd
like all of that pulled into this commit, or all of the command line
stuff pushed out into a later commit.

Do we really need this command?

“Really need” is hard to put a finger on. None of our Go code shells
out to oci-cas ;). But the command line wrappers are easy to write,
and I like providing a non-Go API for folks who are interested in
interacting with OCI-compatible CAS engines. I certainly don't think
oci-cas is going to be the highest-level command line tool we offer,
but I don't see a reason to not offer it. For an analogy in another
plumbing-rich context, see Git's cat-file 1 and hash-object 2.

@wking
Copy link
Contributor Author

wking commented Nov 18, 2016

On Fri, Nov 18, 2016 at 03:57:19AM -0800, xiekeyang wrote:

Sorry, I mean could you please squash those of blobPath and
refPath to this commit?

blobPath shifted in with 23d8fb7fbae6d9. The ref interface is in a
follow-up commit, but I've adjusted all of #5 to take this
early-*Path-helper approach if you want to check that out.

@xiekeyang
Copy link
Contributor

xiekeyang commented Nov 23, 2016

In general I agree this PR. cc @stevvooe, WDYT?

@stevvooe
Copy link
Contributor

@xiekeyang I don't think this is the right model. It doesn't have a transaction interface for verifying and setting blobs. Half of the interface is useless for the application of tar files and I am not sure what problems this solves for image-tools.

For this to be useful, we'd have to refactor the entire thing. As I said when this PR was in image, please use the docker/distribution repo, and the myriad other Go CAS projects, to inform the API.

@wking
Copy link
Contributor Author

wking commented Dec 1, 2016 via email

@wking wking mentioned this pull request Jan 20, 2017
@bsatlas
Copy link

bsatlas commented Feb 16, 2017

What's the progress on this? I could really use this right about now.

@stevvooe
Copy link
Contributor

@PointMicro This approach has some major design problems. What are you trying to do exactly? Maybe I can point your towards something useful.

@wking
Copy link
Contributor Author

wking commented Feb 16, 2017 via email

@stevvooe
Copy link
Contributor

I'm still not clear on what these design problems are, but I'm happy
to fix them if you point them out.

And I am not going to sit down and reiterate them again. I have provided several examples, there are multiple projects to draw from (umoci, containerd, docker/distribution, camlistore, etc.) that all have addressed these issues.

Furthermore, it treats tar archives as a random access storage medium and makes multiple poor assumptions.

@wking
Copy link
Contributor Author

wking commented Feb 16, 2017 via email

@stevvooe
Copy link
Contributor

@wking The problem here is that you are too focused on "image layout" which is a minor part of the image format.

Or are you fine with my API, and just concerned with the
implementation?

No. You're API is naive, racy and memory intensive.

I don't see anyone else implementing tar-backed
image-layout support, but please point me at them if I've missed them.

Exactly. This makes no sense. The only way to do this with a CAS store would be to have a transactional addition of blobs. Effectively, you have a transaction for each blob addition, then a transaction that creates the tar file.

@wking
Copy link
Contributor Author

wking commented Feb 16, 2017 via email

@wking
Copy link
Contributor Author

wking commented Jul 27, 2017

I've just pushed 5df5883fe6a35b rebasing this work onto master, since @q384566678 expressed renewed interest in #5, of which this PR is the next component. I didn't make any changes to image/cas, and most of the rebase was adjusting to the oci-image-tools consolidation from #103.

@wking
Copy link
Contributor Author

wking commented Jul 27, 2017

And I've just pushed fe6a35bcda19fe fixing some go fmt issues and replacing the deprecated os.SEEK_SET with io.SeekStart to fix this lint error.

And implement that interface for tarballs based on the specs
image-layout.  I plan on adding other backends and methods later, but
this is enough for a proof of concept getter.

Also add a new 'oci-image-tool cas get ...' subcommand so folks can
access the new read functionality from the command line.

In a subsequent commit, I'll replace the image/walker.go functionality
with this new API.

The Context interface follows the pattern recommended in [1], allowing
callers to cancel long running actions (e.g. push/pull over the
network for engine implementations that communicate with a remote
store).  Passing a Context instance along to NewEngine gives us a way
to cancel engine initialization which happens to take too long.
That's unlikely to happen for NewTarEngine, but it's easy enough to
pass it on down just in case.

blobPath's separator argument will allow us to use
string(os.PathSeparator)) once we add directory support.

[1]: https://blog.golang.org/context

Signed-off-by: W. Trevor King <[email protected]>
@wking
Copy link
Contributor Author

wking commented Jul 27, 2017

I've pushed cda19fe0c9e84e fixing a missing error check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants