Shoots is dataframe storage server for Pandas.
- Designed with ease of use for Pandas users as the primary design goal.
- Supports multiple clients simultaneous read and write clients.
- Has built in functions for resampling.
- Use Apache Parquet files for efficiency on disk.
- Uses Apache Arrow in memory for efficiency in memory and data transfers.
Shoots is stored on github. Issues and contributions welcome.
The server tries to be a fairly faithful Apache Flight Server, meaning that you should be able to use the Apache Arrow Flight client libraries directly. It is entirely built upon the upstream Apache Arrow project.
The client pieces wrap the Apache FlightClient to offer an interface for pandas developers, abstracting away the Apache Arrow and Flight concepts.
There is a pypi package so you can install using pip(3):
pip3 install shoots
This will allow you to run the Shoots server directly from the cli along with any command line arguments as documented below:
shoots-server
Running the server is a simple matter of running the python module, depending on your system:
python shoots_server.py
or
python3 shoots_server.py
ShootsServer supports the follow CLI arguments:
--port
: Port number to run the Flight server on. Defaults to 8081.--bucket_dir
: Path to the bucket directory. Defaults to ./buckets.--host
: Host IP address for where the server will run. Defaults to localhost.
For example, to run on localhost, but for a different port and bucket directory:
python3 shoots_server.py --port=8082 --bucket_dir="/foo/bar"
To enable TLS on the server, provide an SSL certificate and key.
--cert_file
: Path to file for cert file for TLS. Defaults to None.--key_file
: Path to file for key file for TLS. Defaults to None.
To enable JWT-based authentication, provide a secret string:
--secret
: A secret string use to generate a JWT and authorize clients with that JWT. TLS must be enabled.
These options can also be set via environment variables.
SHOOTS_PORT
SHOOTS_BUCKET_DIR
SHOOTS_HOST
SHOOTS_CERT_FILE
SHOOTS_KEY_FILE
SHOOTS_SECRET
You can also start up the server in Python. It is best to start it on a thread or you won't be able to cleanly shut it down.
from pyarrow.flight import Location
import threading
location = Location.for_grpc_tcp("localhost", 8081)
server = ShootsServer(location, bucket_dir="/foo/bar") #bucket_dir is optional
server_thread = threading.Thread(target=server.run)
server_thread.start()
To run the server with TLS enabled start the server with both the a certificate and key. This is accomplished by passing the strings for the certificate and key as a tuple.
with open(cert_file, 'r') as cert_file_content:
cert_data = cert_file_content.read()
with open(key_file, 'r') as key_file_content:
key_data = key_file_content.read()
server = ShootsServer(location,
bucket_dir=self.bucket_dir,
certs=(cert_data, key_data))
server.start()
Note below that if you are usig a self-signed certificate, you should create the certificate and key with a root certificate that can be shared with the client, so that the client can verify that the it is the actually expected server that is responding.
The server can require a JWT from the client to authenticate that the client is legit. This requires TLS to be enabled so that the jwt is not passed around in clear text. To enable JWT authantication, supply a secret to the server, generate a JWT, and then the client can use that token to authenticate with the server.
server = ShootsServer(self.location,
bucket_dir=self.bucket_dir,
certs=(cert_data,key_data),
secret="some_secret_to_generte a token")
token = server.generate_admin_jwt() # give the token to the client
server.start()
Note that a token will be generated and printed to standard out at start up as well.
See below for how to create a client that uses the token.
Shoots supports a shutdown
action. You can call it from the shoots client:
from shoots_client import ShootsClient
shoots = ShootsClient("localhost", 8081)
shoots.shutdown()
The Shoots shutdown function handles the threading, so you can simply call it.
server.shutdown()
ShootsClient requires a host name and port number:
shoots = ShootsClient("localhost", 8081)
You can enable tls by setting use TLS to True.
shoots = ShootsClient("localhost", 8081, True)
If the server is using a self signed certicate for TLS, you need to provide the root certificate to the client. You do this by passing in the certificate a string.
root_cert = ""
with open(path_to_root_cert) as root_cert_file:
root_cert = root_cert_file.read()
shoots = ShootsClient("localhost", 8081, True, root_cert)
If the server requires token authentication, then assuming you have the token, use it when you create the client object.
client = ShootsClient("localhost",
port,
True,
root_cert,
token=token)
client.ping()
Use the client library to create an instance of the client, and put()
a dataframe. Assuming you are running locally:
from shoots_client import ShootsClient
from shoots_client import PutMode
import pandas as pd
shoots = ShootsClient("localhost", 8081)
df = pd.read_csv('sensor_data.csv')
shoots.put("sensor_data", dataframe=df, mode=PutMode.REPLACE)
You can simply get a dataframe back by using its name:
df0 = shoots.get("sensor_data")
print(df0)
You can also submit a sql query to bring back a subset of the data:
sql = 'select "Sensor_1" from sensor_data where "Sensor_2" < .2'
df1 = shoots.get("sensor_data", sql=sql)
print(df1)
Shoots use Apache DataFusion for executing SQL. The DataFusion dialect is well document.
You can retrieve a list of dataframes and their schemas, using the list()
method.
results = shoots.list()
print("dataframes stored:")
for r in results:
print(r["name"])
dataframes stored:
sensor_data
You can delete a dataframe using the delete()
method:
shoots.delete("sensor_data")
You can resample (aka "downsample") dataframes on the server by sending either a command for a time series dataframe, or just send SQL for any arbitrary dataframe.
self.client.resample(source="my_source_dataframe",
target="my_resampled_dataframe",
sql="SELECT * FROM my_source_dataframe LIMIT 10",
mode=PutMode.APPEND)
self.client.resample(source="my_source_dataframe",
target="my_resampled_dataframe",
rule="10s",
time_col="timestamp",
aggregation_func="mean",
mode=PutMode.APPEND)
You can organize your dataframes in buckets. This is essentially a directory where your dataframes are stored.
Buckets are implicitly created as needed if you use the "bucket" parameter in put()
:
shoots.put("sensor_data", dataframe=df, mode=PutMode.REPLACE, bucket="my-bucket")
df1 = shoots.get("sensor_data", bucket="my-bucket")
print(df1)
You can use the buckets()
method to list available buckets:
print("buckets:")
print(shoots.buckets())
buckets:
['my-bucket', 'foo']
You can delete buckets with the delete_bucket()
method. You can force a deletion of all the dataframes contained in a bucket by using BucketDeleteMode.DELETE_CONTENTS
, otherwise you need to delete all of the dataframes first.:
print("buckets before deletion:")
print(shoots.buckets())
shoots.delete_bucket("my-bucket", mode=BucketDeleteMode.DELETE_CONTENTS)
print("buckets after deletion:")
print(shoots.buckets())
buckets before deletion:
['my-bucket', 'foo']
buckets after deletion:
['foo']
There are tests for checking running in insecure mode (no TLS or JWt), TLS only (no JWT), or JWT(also with TLS). The tests are designed to run without extra dependencies.
To run the tests, navigate to the project directory, add the project directory to your python path, and run the tests (from the project directory):
$ export PYTHONPATH="/path/to/shoots:$PYTHONPATH"
$ python3 tests/run_tests.py
This will run all 3 test cases in parallel.
To run a single test, you can drop into the tests directory and run the test directly:
$ cd tests
tests $ python3 -m unittest tls_test.TLSTest
There is an additional test for testing large data sets which is, by necessity, slow. As such it is not included in the ```run_tests.py`` program. To run this test:
tests $ python3 -m unittest large_datasets_test.LargeDatasetsTest
This edition of the code is licensed under the MIT license.
I intend to work on the following in the coming weeks, in no particular order:
- add a runtime option for the root bucket directory, use it for testing
- pip packaging
- pattern matching for
list()
- downsampling via sql on the server
- combining dataframes on the server
- compressing and cleaning dataframes on the server
- authentication
- UI with SQL tree view browser and editor