pg2bq is made for one thing: export tables from PostgreSQL to Google BigQuery.
It's useful to keep the data at both places "in-sync" (using cron, airflow, or whatever to schedule the export on a regular basis). If your metadata are on PostgreSQL, but your realtime data are in BigQuery, it's probable you want to join them.
Note: internally, it is using the Spark framework for the sake of simplicity, but no Hadoop cluster is needed. It is configured as a "local" cluster by default, meaning the application is running standalone.
- Download the release made on GitHub: pg2bq-1.0.3.zip
- Create a Service Account to Google Cloud which has access to GCS and BigQuery, and create a json key
- Create a configuration file
configuration.conf
for pg2bq to know where to grab and put the data (HOCON):
jdbc {
url = "jdbc:postgresql://mypg:5432/mydb"
user = "myuser"
password = "mypwd"
tables = [ "user", "campaign", "website" ]
}
gcloud {
project = "gcloud-project-id"
service-account-key-path = "/path/to/service-account-key.json"
bq.dataset = "mypg"
gcs.tmp-bucket = "pg-export-tmp"
}
- Run the application specifying the config file:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json ./bin/pg2db -Dconfig.file=configuration.conf
- Done!
Add this to a scheduler every 10min and enjoy your JOINs in BigQuery.
- It exports the data from the tables into DataFrames
- It saves them into GCS as
.avro
to keep the schema along the data: this will avoid to specify/create the BigQuery table schema beforehands. - It starts BigQuery jobs to import those
.avro
into the respective BigQuery tables.
To run the application in dev mode, create a proper application.conf
and run the app:
- via the IDE, just run
Main
- via sbt, by pre-packaging the whole thing:
$ sbt stage && ./target/universal/stage/bin/pg2bq