Transposing/Pivoting/Rotating the orientation of a table is a very common task that is performed as part of a standard report generation workflow. While some relational databases provide a built-in pivot function of some sort, it can also be done via standard SQL.
As an example, the following table can be pivoted using BigQuery Standard SQL:
SELECT
id,
MAX(CASE
WHEN class = 'HVAC' THEN SALES END) AS HVAC_SALES, MAX(CASE
WHEN class = 'GENERATORS' THEN SALES END) AS GENERATORS_SALES
FROM
`project-id.dataset_id.table_id`
GROUP BY
id;
However, this can get significantly more complicated as:
- Number of pivot fields increase (single pivot field class in the above example).
- Number of distinct values in pivot fields increase (two distinct values HVAC and GENERATORS in the above example).
- Number of pivot values increase (single pivot value sales in the above example).
The most common approach to pivoting a complex table would be a two step approach:
- Run a custom script to analyze the table and generate a SQL statement such as the one above.
- Run the dynamically generated SQL to pivot the table and write the output to another table.
This could also be done using a convenient Dataflow pipeline as described below.
Pivot - A Dataflow pipeline that can be used to pivot a BigQuery table across any number of pivot fields and values. This pipeline allows the user to specify a comma separated list of fields across which the table should be rotated in addition to a comma separated list of fields that are rotated. i.e. The user can specify:
- Key fields along which the table is rotated (id in the above example).
- Pivot fields that should be rotated (class in the above example).
- Pivot values that should be rotated (sales in the above example).
The pipeline will perform various steps to complete the pivot process:
- Validate that the fields are valid and have the correct datatypes.
- Read the data from an input BigQuery table.
- Analyze the pivot fields and dynamically generate the correct schema.
- Pivot every record based on the dynamically generated schema.
- Write the pivoted records into a target BigQuery table.
- Java 8
- Maven 3
Build the entire project using the maven compile command.
mvn clean compile
Run all unit tests.
mvn clean test
The above input table shows a slightly more complex example. In order to pivot this table, we have the following inputs:
- keyFields = id,locid
- pivotFields = class,on_sale,state
- pivotValues = sale_price,count
The desc field is ignored and will not be in the output table.
The Pivot pipeline will create a new pivot table based on the inputs.
Execute the pipeline using the maven exec:java command.
MY_PROJECT=my-project-id
MY_STAGING_BUCKET=my-staging-bucket-name
MY_DATASET_ID=my-dataset-id
MY_SOURCE_TABLE_ID=my-source-table-id
MY_TARGET_TABLE_ID=my-target-table-id
mvn compile exec:java -Dexec.mainClass=com.google.cloud.pso.pipeline.Pivot -Dexec.cleanupDaemonThreads=false -Dexec.args=" \
--project=$MY_PROJECT \
--runner=DataflowRunner \
--stagingLocation=gs://${MY_STAGING_BUCKET}/staging \
--tempLocation=gs://${MY_STAGING_BUCKET}/tmp \
--inputTableSpec=${MY_PROJECT}:${MY_DATASET_ID}.${MY_SOURCE_TABLE_ID} \
--outputTableSpec=${MY_PROJECT}:${MY_DATASET_ID}.${MY_TARGET_TABLE_ID} \
--keyFields=id,locid \
--pivotFields=class,on_sale,state \
--valueFields=sale_price,count"