M3D stands for Metadata Driven Development and is a cloud and platform agnostic framework for the automated creation, management and governance of metadata and data flows from multiple source to multiple target systems. The main features and design goals of M3D are:
- Cloud and platform agnostic
- Enforcement global data model including speaking names and business objects
- Governance by conventions instead of maintaining state and logic
- Lightweight and easy to use
- Flexible development of new features
- Stateless execution with minimal external dependencies
- Enable self-service
- Possibility to extend to multiple destination systems (currently AWS EMR)
M3D consists of two components. m3d-api which we're providing in this repo, and m3d-engine containing the main logic using Apache Spark.
M3D can be used for:
- Creation of data lake environments
- Management and governance of metadata
- Data flows from multiple sources
- Data flows to multiple target systems
- Algorithms as data frame transformations
adidas is not responsible for the usage of this software for different purposes that the ones described in the use cases.
M3D is based on a layered architecture, using AWS S3 buckets as storage and Spark/Scala for processing. Using the M3D api you can create data lake environments in a reproducible way. These are the layers defined in the M3D architecture:
- At the lowest level we have the inbound layer, where raw data is uploaded by source systems. The format of the source data is not fixed and a number of formats are supported by M3D. Only this layer is accessible by external non-M3D governed systems.
- On top of the inbound layer, we have the landing layer, in which archived raw data from the inbound layer is stored together with the metadata that is used for further loading to the lake. It can be used for exploration on the raw files and for reprocessing but does not provide a Hive schema.
- The next layer is the lake layer where data is persisted in parquet format for consumption by applications. This layer should be accessed using Hive. Also there are lake-to-lake algorithms that read from this layer and write to it.
- The top layer is the lake-out layer which is a virtual layer for globally standardized semantic names.
Graphically, the architecture of M3D looks like this:
- You will need to create four S3 buckets: inbound, landing, lake and application. The latter will contain the jar artifact from the M3D-engine.
- An account for managing clusters in the AWS console.
- A host machine with internet access.
- An access key with permissions to write to the specified buckets and to create/delete EMR clusters.
- Databases for landing, lake and lake_out.
The quickest way to get started with M3D API is to use GUI installer available for different platforms (Windows/Linux/Mac). With the GUI installer you can setup m3d-api and m3d-engine on a remote host (or localhost if you have an unix based system) and load local tables into AWS an EMR environment right out the box. This of course requires an active AWS account that can be created by visiting this link. If you already have an AWS account, make sure to get your access key and secret access key for successful installation and deployment of environment in EMR. You can go to this repository to build the installer the UI for your preferred OS.
After the installation completed the final steps of the GUI installer are:
- Display sample data to be uploaded from on premises storage to AWS Cloud
- Display the structure of tables in on premises database to be created in AWS cloud
- Create the environment in the AWS Cloud
- Upload the data to an S3 inbound bucket
- Start the EMR cluster
- Execute the Full Load spark algorithm contained in the m3d-engine to put data in the lake layer.
- Shutdown EMR resources
For advanced users, you may use conda for installing M3D by entering the following command in your terminal: conda install -c some-channel m3d-api.
- create_table: Creates a table in the AWS environment based on TCONX Files.
- drop_table: Drops a table in the AWS environment. The files will remain in storage.
- truncate_table: Removes all files of a table from storage.
- create_lake_out_view: Executes an HQL statement to generate a view in the AWS environment.
- drop_lake_out_view: Removes a given view in the AWS environment.
- load_table: Loads a table in AWS from an specified source.
- run_algorithm: Executes an algorithm available in m3d-engine.
- create_emr_cluster: Initializes an EMR cluster in AWS.
- delete_emr_cluster: Terminates an EMR cluster in AWS.
- -function: Name of the function to execute.
- -config: Location of the configuration json file. An example of configuration json file is provided below
{ "emails": [ "[email protected]" ], "dir_exec": "/tmp/", "python": { "main": "m3d_main.pyc", "base_package": "m3d" }, "subdir_projects": { "m3d_engine": "m3d-engine/target/scala-2.11/", "m3d_api": "m3d-api/" }, "tags": { "full_load": "full_load", "delta_load": "delta_load", "append_load": "append_load", "table_suffix_stage": "_stg1", "table_suffix_swap": "_swap", "config": "config", "system": "system", "algorithm": "algorithm", "table": "table", "view": "view", "upload": "upload", "pushdown": "pushdown", "aws": "aws", "hdfs": "hdfs", "file": "file" }, "data_dict_delimiter": "|" }
- -cluster_mode: Specifies whether the function should execute in a cluster or on a single node.
- -destination_system: Name of the system to which data will be loaded.
- -destination_database: Name of the destination database.
- -destination_environment: Name of the different environments (test, dev, preprod, prod, etc.)
- -destination_table: Name of the table in the destination_database of the destination_system where data will be written to.
- -algorithm_instance: Name of the algorithm from m3d-engine to be executed
- -load_type: Type of the load algorithm to be executed (FullLoad, DeltaLoad, or AppendLoad).
- -ext_params: parameters in JSON format expected by an algorithm in M3D-engine.
- -spark_params: Spark parameters in JSON format.
- -core_instance_count: Number of executor nodes in the EMR cluster.
- -core_instance_type: AWS node instance type for each executor node in the EMR cluster.
- -master_instance_type: AWS node instance type for the master node in the EMR cluster.
- -emr_version: Version of EMR to use in for EMR clusters.
Not all arguments are mandatory for API calls. Please check the source code to identify required parameters for the API you would like to use.
As an example of M3D capabilities, we provide an example that will load data from data files into AWS. Prerequisites: cd
into your working directory where you have m3d-api and m3d-engine copied, whether it was from conda or from the GUI installer. For M3D-engine, you will need the compiled jar or build it manually with SBT.
Before you proceed, make sure you have everything in the prerequisites section completed and that entries in the config.json file has been made to match your setup. Also, make sure the relevant information is in the tconx file, such as column names, lake table name, destination database, etc. Note that for the example below, destination_database is set to emr_db, destination_system is emr and destination_environment is test. For table_name, we use test_table. Database name in M3D layers, should match the names you defined in the prerequisites section.
The steps are the following:
-
Upload a csv file containing the data to be uploaded to the lake. You can use aws cli for placing the file in inbound bucket.
aws s3 cp s3://your-inbound-bucket/test/data.csv
-
Create an instance of EMR cluster
python m3d_main.py -function create_emr_cluster \ -core_instance_type m4.large \ -master_instance_type m4.large \ -core_instance_count 3 \ -destination_system emr \ -destination_database emr_database \ -destination_environment test \ -config /relative/to/m3d-api/config/m3d/config.json \ -emr_version emr-5.23.0
-
Create the environment in AWS by invoking the API create_table
python m3d_main.py -function create_table \ -config /relative/to/m3d-api/config/m3d/config.json \ -destination_system emr \ -destination_database emr_database \ -destination_environment test \ -destination_table table_name \ -emr_cluster_id id-of-started-cluster
-
Trigger the FullLoad algorithm in M3D-engine to load from inbound into lake layer.
python m3d_main.py -function load_table \ -config /relative/to/m3d-api/config/m3d/config.json \ -destination_system emr \ -destination_database emr_database \ -destination_environment test \ -destination_table table_name \ -load_type FullLoad \ -emr_cluster_id id-of-started-cluster
-
OPTIONAL: Shutdown EMR cluster - Normally, after completion of a load job, you will stop the current EMR cluster, but if you would like to connect to your cluster after the loading job is completed, you can avoid executing this final API call to keep the cluster running. Afterwards, you can open HUE to query data via hive in the running EMR cluster. This can be done by connecting to the master instance if it was configured as suggested in this guide.
python m3d_main.py -function delete_emr_cluster \ -config /relative/to/m3d-api/config/m3d/config.json \ -destination_system emr \ -destination_database emr_database \ -destination_environment test \ -emr_cluster_id id-of-started-cluster
© adidas AG
adidas AG publishes this software and accompanied documentation (if any) subject to the terms of the Apache 2.0 license with the aim of helping the community with our tools and libraries which we think can be also useful for other people. You will find a copy of the Apache 2.0 license in the root folder of this package. All rights not explicitly granted to you under the Apache 2.0 license remain the sole and exclusive property of adidas AG.
NOTICE: The software has been designed solely for the purpose of automated creation, management and governance of metadata and data flows. The software is NOT designed, tested or verified for productive use whatsoever, nor or for any use related to high risk environments, such as health care, highly or fully autonomous driving, power plants, or other critical infrastructures or services.
If you want to contact adidas regarding the software, you can mail us at [email protected].
For further information open the adidas terms and conditions page.
- What is a TCONX file? It is a JSON file containing the definition of a table to be created in an Hadoop environment. Entries in the file include destination database, table name in lake, table columns, name of columns in the different M3D layers. For an example of what a TCONX file looks like, you can take a look at the samples subdirectory in this repo. It is important to note that the parameters mentioned above (table name, environment, etc) are part of the TCONX file naming convention. In samples/tconx-(emr)-(emr_database)-(test)-(prefix)_(table_name).json, we can find the following parts in parenthesis:
- emr - this is the destination system
- emr_database - this is the destination database
- test - this is the destination environment
- prefix - this is the name of the source system generating the data
- table_name - name of the table which the tconx file was generated for.