-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing an external storage driver #1551
Conversation
Ref #1539, where the discussion of storage drivers came up most recently. cc/ @marwan-at-work @wburningham since you've been in discussion on this issue This is a draft PR currently with just the gRPC definitions. I'd like some feedback on how the API looks before I go implement anything more. Thanks all! 😄 |
proto/external_storage.proto
Outdated
|
||
service ProxyStorage { | ||
|
||
rpc GetModule (ModuleMetadata) returns (ModuleData) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert in proto. But does this mean @info
or @mod
will download zip_file
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that matched the interface, at the time (before #406 landed).
This service definition would need to be changed to match the storage interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that Athens never downloads module@version metadata without also downloading the zip file. That said, does it make more sense to split this int GetModuleZip
and GetModuleMetadata
?
proto/external_storage.proto
Outdated
message ModuleData { | ||
ModuleMetadata metadata = 1; | ||
string mod_file = 2; | ||
string info = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should info be a json?
message ModuleInfo {
string version = 1;
string timstamp =2;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it should. Thanks @xytan0056 !
@Kunde21 @xytan0056 I updated the interface according to your comments. Do you mind taking another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, left you a comment for CI diff check
@@ -0,0 +1,555 @@ | |||
// Code generated by protoc-gen-go. DO NOT EDIT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Should we add a CI check that diffs
the commited auto-generated with the one that is getting generated by the proto file on CI. So this way we know that auto-generate code that we commit is up to date with what is expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geototti21 good idea, yes I think we should
@arschles I see this is still in draft, lemme know if you want me to give this a review now or later :) |
@marwan-at-work yes please! I'd really like your input on the interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of design questions off the bat.
But I have a few thoughts (that I'm not sure about them myself just yet):
-
Why gRPC in particular? The only reason I can think of is that we would then be able to provide a new "library" for them to import and implement the interface it exposes...I kinda like that, but I wonder if we can do the same thing without having to go with gRPC. This way people don't have to stick to gRPC which is http/2 only...people can implement a storage in good old http, which will work nicely behind aws/gcp load balancers and whatnot.
-
What about "extending" the download protocol? The download protocol already gives us 4/6 of these operations, so why not extend it with POST equivalents to persist things:
POST /{module}/@v/{version}.info
and so on and so forth.
Anyway, I'm just brainstorming as I write, but nice work so far!
|
||
// ModuleZip is the zip file of a module's source code | ||
message ModuleZip { | ||
bytes zip_file = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be a big performance problem? Go by default loads 10 dependencies concurrently as far as I recall, and if you have multiple go clients, this can quickly become a problem.
Would be interesting to test it against a couple of concurrent go mod download
s and see how it performs.
The only option I see is to use the stream
clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marwan-at-work if the problem you're talking about is how many concurrent calls to GetZip
Athens is making, I think we can use GoGetWorkers
and the accompanying machinery to limit the number of times we hit that endpoint. The other side could also decide how many concurrent calls that it wants to serve concurrently.
I don't think we get any worse from a memory usage perspective if we limit concurrent calls to the GetZip
like we do with go mod download
. Streaming definitely doesn't hurt us though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the problem you're talking about is how many concurrent calls to GetZip Athens is making
That's just one of the problems :) I think the biggest one is the fact that we'd have to completely load the module zip in memory on its way in and on its way out. Even if we limit save/get operations, then the Go command will time out because it would simply take too long to try and process less than 10 gets at a time.
And removing redundant exists field. Just using the error to indicate existance
@marwan-at-work I have some answers 😄 For (1), I chose gRPC because its code generation (for servers) & versioning come for free. If we end up streaming per #1551 (comment), then that will be another reason (we wouldn't need to roll a websocket solution ourselves). There are other code generation options, but gRPC is the most widely used and supported that I know of. Second to gRPC would be Swagger/OpenAPI, which Kubernetes uses and, from my understanding, has had to hack a bit to get it to work for its purposes. I haven't seen any major cloud load balancers that don't support HTTP/2. If someone has on-prem LB's that don't, we can include instructions in our docs on how to use grpc-web with their server. For (2), the major reason I didn't roll a custom solution is because gRPC more widely supported and understood than our homegrown HTTP client/server would be. And at the risk of sounding redundant, the code generation will end up saving us a lot of time when people start building servers 😄 |
We wouldn't need to do websockets for streaming. You can still stream with pure http in the same way that the Go Download Protocol does for zips, and we can use multi-part for sending files up. Definitely don't want you to change this PR though to completely get rid of gRPC, we can potentially jam on both ideas and see which one provides a better user experience :) My hope is that users wouldn't have to learn something completely new to them just in order to implement a storage for Athens. So what we can also do, whether we go gRPC or http, is provide some helper functions where we'd hide away the transport itself. |
Long response ahead, sorry in advance 😄
Oh, I misunderstood your suggestion. I'd be cool with doing the the as-is HTTP chunked streaming as we do now, yes! I'm also cool with streaming server -> client with gRPC (the docs for client streaming show roughly how we could do it) if we go this route.
I'm with you there. I'll ditch gRPC in an instant if it ends up being hard on server authors (I am going to be one of those authors right after this gets merged 😉) I just auto-generated the server stub and checked it into The downside of all of this is that gRPC is fairly easy to learn and the tooling ecosystem is decent, but at the end of the day, it's still new underlying technology for most people. Obviously I think the tradeoff is worth it. @Kunde21 @tylerchr @marpio @xytan0056 @geototti21 you have all either built alternative backends for Athens, or are involved with this topic in some other way. Would any of you be up for giving feedback on how comfortable you'd be writing and running your own gRPC-based Athens backend?
I'm cool with that. Would you be up with PRing something small showing what you're thinking for the HTTP protocol? I appreciate you all talking this through with me! 💚 |
What is the problem I am trying to address?
Over time, requests have come in for a storage driver for "X". Over time, the Athens codebase and config has grown as we've added more drivers
How is the fix applied?
I've added a new "external" storage driver that uses a gRPC API to communicate with any external server for storage. This is so that anybody can generate gRPC stubs for their language of choice, build their own storage driver, and then configure Athens to use it. This can be done without compiling additional code into Athens.
Mention the issue number it fixes or add the details of the changes if it doesn't have a specific issue.
We have had considerable discussion and PRs on this topic (external storage drivers) over time, and I am hoping that this PR can move us toward a solution 😄
Closes #1353
Fixes #1110
Fixes #1130
Fixes #1131