This program will transfer tables from MySQL like databases to Redshift.
- Apache Spark
- Mysql Or Amazon RDS
- Amazon S3
- Amazon Redshift
Build project using SBT and create fat jar using sbt assembly
.
This is spark application and requires platform which can run spark on it like yarn, spark-cluster & mesos.
- Submitting spark job on yarn :-
spark-submit \
--class com.goibibo.sqlshift.SQLShift \
--master yarn \
sqlshift-assembly-0.1.jar -td table-details.json
- Submitting spark job locally :-
spark-submit \
--class com.goibibo.sqlshift.SQLShift \
--master local[*] \
sqlshift-assembly-0.1.jar -td table-details.json
- More Options
MySQL to Redshift DataPipeline
Usage: SQLShift [options]
-td, --table-details <path to Json>
Table details json file path including
-mail, --mail-details <path to properties file>
Mail details property file path(For enabling mail)
-aof, --alert-on-failure
Alert only when fails
-rc, --retry-count <count>
How many times to retry on failed transfers
-lmr, --log-metrics-reporter
Enable metrics reporting in logs
-jmr, --jmx-metrics-reporter
Enable metrics reporting through JMX
-mws, --metrics-window-size <value>
Metrics window size in seconds. Default: 5 seconds
--help Usage Of arguments
- Table configuration :- Json schema can found here. Basic template is :-
[
{
"mysql":
{
"db": "",
"hostname": "",
"portno": "",
"username": "",
"password": ""
},
"tables": [
{
"name": "",
"incremental": "",
"mergeKey": "",
"addColumn": "",
"partitions": "",
"isSplittable": true / false,
"isAppendOnly": true / false,
"preLoadCmd":"",
"postLoadCmd":""
}
],
"redshift":
{
"schema": "",
"hostname": "",
"portno": "",
"username": "",
"password": ""
},
"s3":
{
"location": "",
"accessKey": "",
"secretKey": ""
}
}
]
- Mail configuration for alerting :-
alert.host =
alert.port = 25
alert.username =
alert.password =
alert.to =
alert.cc =
alert.subject =
- Full Dump:- Let's say local mysql server is running on
localhost
with listening port3306
. Server has userroot
withadmin
as password. Now we have to migrate complete tabletest
fromdefault
database to redshift in schemabase
which is running on remote host.
[
{
"mysql": {
"db": "default",
"hostname": "localhost",
"portno": 3306,
"username": "root",
"password": "admin"
},
"tables": [
{
"name": "test"
}
],
"redshift": {
"schema": "base",
"hostname": "redshift.amazonaws.com",
"portno": 5439,
"username": "admin",
"password": "admin"
},
"s3": {
"location": "s3a://bucket/",
"accessKey": "access key",
"secretKey": "secret key"
}
}
]
Note[1]- Full dump of tables is good but if table is too large(like in 10s of GBs) then it can take time. So if you want to lift small part of table then you can do it using incremental facility provided.
-
Incremental:- In incremental just add
where
condition withoutwhere
clause in configuration. Let's say tabletest
hasmodified_at
column which is timestamp, changes when an entry/record is updated or created and we need data of day.Value in incremental should be SQL which is compatible with MYSQL.
[
{
"mysql": {...},
"tables": [
{
"name": "test",
"incremental": "modified_at BETWEEN '2016-12-30 00:00:00' AND '2016-12-30 23:59:59'"
}
],
"redshift": {...},
"s3": {...}
}
]
Note[2]- Program will works better if you have only one integer primary key in MySQL Table otherwise you have to
provide distkey
which should be integer and unique. This distkey
helps program to split data and process it in
parallel. If your table doesn't have any of these options then add "isSplittable": false
field to json so it will not
split table.
Note[3]- It is better to have an auto-incremental column in MySQL which should be only PRIMARY KEY because it will help application process data in parallel.
- Non-Splittable:- If table is non-splittable then add
"isSplittable": false
. But large table needs to be splitted otherwise it will became impossible to migrate table in real-time. By default this field isfalse
if not passed.
[
{
"mysql": {...},
"tables": [
{
"name": "test",
"isSplittable": true
}
],
"redshift": {...},
"s3": {...}
}
]
- Add Columns:- If you wanted additional column in redshift table with some processing which can done in SQL
compatible with Redshift. Suppose you want to extract
year
andmonth
fromcreated_at
column you can do it in following way:-
[
{
"mysql": {...},
"tables": [
{
"name": "test",
"addColumn": "EXTRACT(year from created_at)::INT AS year, EXTRACT(month from created_at)::INT AS month"
}
],
"redshift": {...},
"s3": {...}
}
]
Please use the following support channels:
- Issues - For bug reporting or feature enhancements.
- E-Mail - Mail any of us on any issue.