Make sure your gcloud credentials have been setup.
gcloud auth application-default login
Download the datamon binary for mac or for linux on the Releases Page or use the shell wrapper
Example:
tar -zxvf datamon.mac.tgz
For non kubernetes use, it's necessary to supply gcloud credentials.
# Replace path to gcloud credential file. Use absolute path
% datamon config create --email [email protected] --name "Ritesh H Shukla" --credential /Users/ritesh/.config/gcloud/application_default_credentials.json
Inside a kubernetes pod, Datamon will use kubernetes service credentials.
% datamon config create --name "Ritesh Shukla" --email [email protected]
Check the config file, credential file will not be set in kubernetes deployment.
% cat ~/.datamon/datamon.yaml
metadata: datamon-meta-data
blob: datamon-blob-data
email: [email protected]
name: Ritesh H Shukla
credential: /Users/ritesh/.config/gcloud/application_default_credentials.json
Datamon repos are analogous to git repos.
% datamon repo create --description "Ritesh's repo for testing" --repo ritesh-datamon-test-repo
The last line prints the commit hash.
If the optional --label
is omitted, the commit hash will be needed to download the bundle.
% datamon bundle upload --path /path/to/data/folder --message "The initial commit for the repo" --repo ritesh-test-repo --label init
Uploaded bundle id:1INzQ5TV4vAAfU2PbRFgPfnzEwR
List all the bundles in a particular repo.
% datamon bundle list --repo ritesh-test-repo
Using config file: /Users/ritesh/.datamon/datamon.yaml
1INzQ5TV4vAAfU2PbRFgPfnzEwR , 2019-03-12 22:10:24.159704 -0700 PDT , Updating test bundle
List all the labels in a particular repo.
% datamon label list --repo ritesh-test-repo
Using config file: /Users/ritesh/.datamon/datamon.yaml
init , 1INzQ5TV4vAAfU2PbRFgPfnzEwR , 2019-03-12 22:10:24.159704 -0700 PDT
Download a bundle by either hash
datamon bundle download --repo ritesh-test-repo --destination /path/to/folder/to/download --bundle 1INzQ5TV4vAAfU2PbRFgPfnzEwR
or label
datamon bundle download --repo ritesh-test-repo --destination /path/to/folder/to/download --label init
List all files in a bundle
datamon bundle list files --repo ritesh-test-repo --bundle 1ISwIzeAR6m3aOVltAsj1kfQaml
Also uses --label
flag as an alternate way to specify the bundle in question.
Download a single file from a bundle
datamon bundle download file --file datamon/cmd/repo_list.go --repo ritesh-test-repo --bundle 1ISwIzeAR6m3aOVltAsj1kfQaml --destination /tmp
Can also use the --label
as an alternate way to specify the particular bundle.
% datamon label set --repo ritesh-test-repo --label anotherlabel --bundle 1ISwIzeAR6m3aOVltAsj1kfQaml
Uploaded bundle id:1INzQ5TV4vAAfU2PbRFgPfnzEwR
Labels are a mapping type from human-readable strings to commit hashes.
There's one such map per repo, so in particular setting a label or uploading a bundle with a label that already exists overwrites the commit hash previously associated with the label: There can be at most one commit hash associated with a label. Conversely, multiple labels can refer to the same bundle via its commit hash.
Current use of Datamon at One Concern with respect to intra-Argo workflow usage relies on the kubernetes sidecar pattern wherein a shared volume (transport layer) ramifies application layer communication to coordinate between the main container, where a data-science program accesses data provided by Datamon and produces data for Datamon to upload, and the sidecar container, where Datamon provides data for access (as hierarchical filesystems, as SQL databases, etc.). After the main container's DAG-node-specific data-science program outputs data (to shared Kubernetes volume, to a PostgreSQL instance in the sidecar, and so on), the sidecar container uploads the results of the data-science program to GCS.
Ensuring that data is ready for access (sidecar to main-container messaging) as well as notification that the data-science program has produced output data to upload (main-container to sidecar messaging), is the responsibility of a few shell scripts shipped as part and parcel of the Docker images that practicably constitute sidecars. While there's precisely one application container per Argo node, a Kubernetes container created from an arbitrary image, sidecars are additional containers in the same Kubernetes pod -- or Argo DAG node, we can say, approximately synonymously -- that concert datamon-based data-ferrying setups with the application container.
Aside: as additional kinds of data sources and sinks are added, we may also refer to "sidecars" as "batteries," and so on as semantic drift of the shell scripts shears away feature creep in the application binary.
There are currently two batteries-included® images
gcr.io/onec-co/datamon-fuse-sidecar
provides hierarchical filesystem accessgcr.io/onec-co/datamon-pg-sidecar
provides PostgreSQL database access
Both are versioned along with github releases of the desktop binary. to access recent releases listed on the github releases page, use the git tag as the Docker image tag: At time of writing, v0.7 is the latest release tag, and (with some elisions)
spec:
...
containers:
- name: datamon-sidecar
- image: gcr.io/onec-co/datamon-fuse-sidecar:v0.7
...
would be the corresponding Kubernetes YAML to access the sidecar container image.
Aside: historically, and in case it's necessary to roll back to an now-ancient
version of the sidecar image, releases were tagged in git without the v
prefix,
and Docker tags prepended v
to the git tag.
For instance, 0.4
is listed on the github releases page, while
the tag v0.4
as in gcr.io/onec-co/datamon-fuse-sidecar:v0.4
was used when writing
Dockerfiles or Kubernetes-like YAML to accesses the sidecar container image.
Users need only place the wrap_application.sh
script located in the root directory
of each of the sidecar containers within the main container.
This
can be accomplished
via an initContainer
without duplicating version of the Datamon sidecar
image in both the main application Dockerfile as well as the YAML.
When using a block-storage GCS product, we might've specified a data-science application's
Argo DAG node with something like
command: ["app"]
args: ["param1", "param2"]
whereas with wrap_application.sh
in place, this would be something to the effect of
command: ["/path/to/wrap_application.sh"]
args: ["-c", "/path/to/coordination_directory", "-b", "fuse", "--", "app", "param1", "param2"]
That is, wrap_application.sh
has the following usage
wrap_application.sh -c <coordination_directory> -b <sidecar_kind> -- <application_command>
where
<coordination_directory>
is an empty directory in a shared volume (anemptyDir
using memory-backed storage suffices). each coordination directory (not necessarily the volume) corresponds to a particular DAG node (i.e. Kubernetes pod) and vice-versa.<sidecar_kind>
is in correspondence with the containers specified in the YAML and may be amongfuse
postgres
<application_command>
is the data-science application command exactly as it would appear without the wrapper script. That is, the wrapper script, relies the conventional UNIX syntax for stating that options to a command are done being declared.
Meanwhile, each sidecar's datamon-specific batteries have their corresponding usages.
Provides filesystem representations (i.e. a folder) of datamon bundles.
Since bundles' filelists are serialized filesystem representations,
the wrap_datamon.sh
interface is tightly coupled to that of the self-documenting
datamon
binary itself.
./wrap_datamon.sh -c <coord_dir> -d <bin_cmd_I> -d <bin_cmd_J> ...
-c
the same coordination directory passed towrap_application.sh
-d
all parameters, exactly as passed to the datamon binary, except as a single scalar (quoted) parameter, for one of the following commandsconfig
sets user information associated with any bundles created by the nodebundle mount
provides sources for data-science applicationsbundle upload
provides sinks for data-science applications
Multiple (or none) bundle mount
and bundle upload
commands may be specified,
and at most one config
command is allowed so that an example wrap_datamon.sh
YAML might be
command: ["./wrap_datamon.sh"]
args: ["-c", "/tmp/coord", "-d", "config create --name \"Coord\" --email [email protected]", "-d", "bundle upload --path /tmp/upload --message \"result of container coordination demo\" --repo ransom-datamon-test-repo --label coordemo", "-d", "bundle mount --repo ransom-datamon-test-repo --label testlabel --mount /tmp/mount --stream"]
or from the shell
./wrap_datamon.sh -c /tmp/coord -d 'config create --name "Coord" --email [email protected]' -d 'bundle upload --path /tmp/upload --message "result of container coordination demo" --repo ransom-datamon-test-repo --label coordemo' -d 'bundle mount --repo ransom-datamon-test-repo --label testlabel --mount /tmp/mount --stream'
Provides Postgres databases as bundles and vice versa.
Since the datamon binary does not include any Postgres-specific notions,
the UI here is more decoupled than that of wrap_datamon.sh
.
The UI is specified via environment variables
such that wrap_datamon.sh
is invoked without parameters.
The script looks for precisely one dm_pg_opts
environment variable
specifying global options for the entire script and any number of
dm_pg_db_<db_id>
variables, one per database.
Aside on serialization format
Each of these environment variables each contain a serialized dictionary according the the following format
<entry_sperator><key_value_seperator><entry_1><entry_seperator><entry_2>...
where <entry_sperator>
and <key_value_seperator>
are each a single
character, anything other than a .
, and each <entry>
is of one of
two forms, either <option>
or <option><key_value_seperator><arg>
.
So for example
;:a;b:c
expresses something like a Python map
{'a': True, 'b' : 'c'}
or shell option args
<argv0> -a -b c
Every database created in the sidecar corresponding to a dm_pg_db_<db_id>
env var is uploaded to datamon and optionally initialized by a previously
uploaded database.
The opts in the above serialization format availble to specify are
p
IP port used to connect to the databasem
message written to the database's bundlel
label written to the bundler
repo containing bundlesr
repo containing the source bundlesl
label of the source bundlesb
source bundle id
that affect the availability of the database from the application container or the upload of the database to datamon are
Meanwhile, dm_pg_opts
uses options
c
the<coord_dir>
as in the FUSE sidecarV
whether to ignore postgres version mismatch, eithertrue
orfalse
(for internal use)S
without an<arg>
is causes the wrapper script to sleep instead of exiting, which can be useful for debug.
The recommended way to install datamon in your local environemnt is to use the
deploy/datamon.sh
wrapper script. This script is responsible for downloading
the datamon binary from the Releases Page,
keeping a local cache of binaries, and exec
ing the binary. So parameterization
of the shell script is the same as parameterization as the binary: the shell script
is transparent.
Download the script, set it to be executable, and then try the version
verb in the
wrapped binary to verify that the binary is installed locally. There are several
auxilliary programs required by the shell script such as grep
and wget
. If these
are not installed, the script will exit with a descriptive message, and the missing
utility can be installed via brew
or otherwise.
curl https://raw.githubusercontent.com/oneconcern/datamon/master/deploy/datamon.sh -o datamon
chmod +x datamon
./datamon version
It's probably most convenient to have the wrapper script placed somewhere on your shell's path, of course.
As with the Kubernetes sidecar guide, this section covers a particular operationalization of Datamon at One Concern wherein we use the program along with some auxilliary programs, all parameterized via a shell script and shipped in a Docker image, in order to periodically backup a shared block store and remove files according to their modify time.
The docker image is called gcr.io/onec-co/datamon-datamover
and is tagged with
versions just as the Kubernetes sidecar, v<release_number>
, where v0.7
is the first
tag that will apply to the Datamover.
The datamover
image contains two shell wrappers, backup
and datamover
.
Both fulfill approximately the same purpose, backing up files from an NFS share
to datamon. The main difference is that backup
uses standard *nix utils,
while datamover
uses an auxilliary util maintained alongside datamon.
Their respective parameters are as follows:
-d
backup directory. required if-f
not present. this is the recommended way to specify files to backup from a kubernetes job.-f
backup filelist. list of files to backup.-u
unlinkable filelist. when specified, files that can be safely deleted after the backup are written to this list. when unspecified, files are deleted bybackup
.-t
set totrue
orfalse
in order to run in test mode, which at present does nothing more than specify the datamon repo to use.
-d
backup directory. required.-l
bundle label. defaults todatamover-<timestamp>
-t
timestamp filter before. a timestamp string in system local time among several formats, including<Year>-<Month>-<Day>
as in2006-Jan-02
<Year><Month><Day><Hour><Minute>
as in0601021504
<Year><Month><Day><Hour><Minute><second>
as in060102150405
defaults to090725000000
-f
filelist directory. defaults to/tmp
and is the location to writeupload.list
, the files that datamon will attempt to upload as part of the backupuploaded.list
, the files that have been successfully uploaded as part of the backupremovable.list
, the files that have been successfully uploaded and that have a modify time before the specified timestamp filter
-c
concurrency factor. defaults to 200. tune this down in case of the NFS being hammered by too many reads during backup.-u
unlink, a boolean toggle. whether to unlink the files inremoveable.list
as part of thedatamover
script. defaults to off/false/not present.
Please file GitHub issues for features desired in addition to any bugs encountered.
Datamon is a datascience tool that helps managing data at scale. The primary goal of datamon is to allow versioned data creation, access and tracking in an environment where data repositories and their lifecycles are linked.
Datamon links the various sources of data, how they are processed and tracks the output/new data that is generated from the existing data.
Datamon is composed of
- Datamon Core
- Datamon Content Addressable Storage
- Datamon Metadata
- Data access layer
- CLI
- FUSE
- SDK based tools
- Data consumption integrations.
- CLI
- Kubernetes integration
- InPod Filesystem
- GIT LFS
- Jupyter notebook
- JWT integration
- ML/AI pipeline run metadata: Captures the end to end metadata for a ML/AI pipeline runs.
- Datamon Query: Allows introspection on pipeline runs and data repos.
Datamon includes a
- Blob storage: Deduplicated storage layer for raw data
- Metadata storage: A metadata storage and query layer
- External storage: Plugable storage sources that are referenced in bundles.
For blob and metadata storage datamon guarantees geo redundant replication of data and is able to withstand region level failures.
For external storage based on the external source, the redundancy and ability to access can vary.
Repo: Analogous to git repos. A repo in datamon is a dataset that has a unified lifecycle. Bundle: A bundle is a point in time readonly view of a rep:branch and is composed of individual files. Analogous to a commit in git. Labels: A name given to a bundle, analogous to tags in git. Example: Latest, production
Planned features:
Branch: A branch represents the various lifecycles data might undergo within a repo. Runs: ML pipeline run metadata that includes the versions of compute and data in use for a given run of a pipeline.
Data access layer is implemented in 3 form factors
- CLI Datamon can be used as a standalone CLI provided developer has access privileges to the backend storage. A developer can always setup datamon to host their own private instance for managing and tracking their own data.
- Filesystem: A bundle can be mounted as a file system in Linux or Mac and new bundles can be generated as well.
- Specialized tooling can be written for specific use cases. Example: Parallel ingest into a bundle for high scaled out throughput.
Datamon integrates with kubernetes to allow for pod access to data and pod execution synchronization based on dependency on data. Datamon also caches data within the cluster and informs the placement of pods based on cache locality.
Datamon will act as a backend for [GIT LFS](oneconcern#79
Datamon allows for Jupyter notebook to read in bundles in a repo and process them and create new bundles based on data generated
Datamon API/Tooling can be used to write custom services to ingest large data sets into datamon. These services can be deployed in kubernetes to manage the long duration ingest.
This was used to move data from AWS to GCP.