Skip to content
This repository has been archived by the owner on Feb 17, 2022. It is now read-only.

Schedule automatic Cassandra backup #29

Open
seanshahkarami opened this issue Sep 5, 2017 · 3 comments
Open

Schedule automatic Cassandra backup #29

seanshahkarami opened this issue Sep 5, 2017 · 3 comments

Comments

@seanshahkarami
Copy link
Member

seanshahkarami commented Sep 5, 2017

Since we're about to start a lot of work on beehive, we should make sure we have a Cassandra backup process in place.

I went ahead and built a tool to pull datasets, we just need to schedule it and have a place to keep the backups: https://github.com/waggle-sensor/beehive-server/tree/master/data-exporter

The current missing half is having a complementary script to do a restore, but at least we have the raw data available now.

@seanshahkarami
Copy link
Member Author

seanshahkarami commented Sep 6, 2017

The main thing blocking this is choosing a reliable storage location. One interesting point is, if we're ready to consider S3 as a possible redundancy location, we automatically get an access controlled, directory-like interface to the data stored there. In other words, we could go ahead and start pointing some people to this for pulling datasets.

@seanshahkarami
Copy link
Member Author

After thinking this over last night, I don't think this is right thing to have for the real deployment. It was more of a emergency reaction to how beehive1 is currently deployed. Cassandra is setup using a single node, so you don't get any of Cassandra's resilience guarantees during failure...

Cassandra is designed specifically to use replication and eventual consistency between a cluster nodes. In production, you'd have a number of nodes running in a cluster so you can drop a certain number of nodes at anytime and still continue running without data loss. Building on top of that reliability feature is the right way to go, if we're using Cassandra.

@seanshahkarami
Copy link
Member Author

seanshahkarami commented Sep 28, 2017

Turns out clustering works beautifully in the example I tried... It only took 10 minutes to setup a 3 node cluster on my own machine and load it with some test data. Using a keyspace with a replication factor of 2, things worked as expected - any single node could be taken completely offline and I still had access to the entire dataset. Just something to think about for production deployment...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant