-
Notifications
You must be signed in to change notification settings - Fork 6
Add backup and restore for bootstrapping #75
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good approach -- do a backup on a schedule and POST it to an endpoint, which then would take care of the rest. Other than one thing I saw in the code I have a few ideas.
First, I see that the changeservers do the backup on a schedule. What if the backup attempt fails? Can each one log the error and try again in less time using an exponential backoff? That way we might survive a temporary glitch better.
Same on a boot of a new changeserver -- if it can't get the backup, what does it do?
I also see that the backup server is trying to manage storage by having a bunch of separate files. GCP isn't like a regular filesystem -- you can overwrite the same "file" and the whole thing will either change atomically or fail. So maybe we only just need one GCS "file" that we keep overwriting.
(Also maybe we should checksum the file first, so we can skip the GCS update if nothing changed since the last backup. Otherwise we keep paying money to overwrite the file over and over. I think that GCS has a way to do this easily.)
Finally, I'm not sure I like the idea of generating a "secret" and putting it in a parameter. At the very least we'll need a way to rotate it. I'm not sure what else to do though -- we'll have to think about it. (One thing we can do is use GCP IAM for this.)
I think this is a clever usage of Cloud Functions. Is there some way that we can unit-test that code using regular node?
bootstrap/bootstrap.go
Outdated
if err != nil { | ||
return err, false | ||
} | ||
defer req.Body.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean "res.body.Close()" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. :) Thanks.
Yeah, I intentionally just allowed the startup to fail if was told to restore, but can't. It could to a retry... or some external process could do a retry. As for the backup, yes, as backoff would definitely make sense. What you see with the backup server isn't it using different files for different versions... it will actually overwrite the same file as you suggest. The ability to have multiple file IDs is to allow more than one cluster of change server (ie. with different tenant information) to backup to the same bucket. But, yes, a checksum could potentially be used to avoid rewriting the file... or perhaps the change servers could realize that they have nothing new to backup and not even try. Finally, I agree the "secret" thing isn't a fantastic solution. It was just a hack to enable some kind of basic level of authz. I'm not sure of the correct solution here. |
Still playing with an automated presubmit check. It failed because this branch is based on an older master -- not a real failure yet. |
Scott -- I haven't merged this because it says "not ready yet." Do we have a plan to get it ready? |
It still needs unit tests at a minimum. Also probably needs a more thoughtful approach to authz. There is no plan to finish this at present. |
Submitted for feedback
Untested. Don't merge.