With Google Cloud Dataproc, you can deploy various Hive versions and use them like on-premise. One of the challenges in migration is authorization. It is possible to deploy Apache Ranger on Dataproc to preserve the authorization rules. Ranger-Hive plugin works as it is, and recently Ranger-GCS plugin has been introduced to replace Ranger-HDFS rules.
After an initial migration, many teams wants to adopt:
- BigQuery datawarehouse for its serverless architecture instead of maintaining Hive infrastructure
- BigLake API as a single interface to data stored in GCS, AWS S3 or Azure Data Lake. BigLake provides the exact fine grained access control rules as BigQuery.
One of the challenges in this, in addition to data migration and SQL translation, is the adaptation of Apache Ranger authorization rules to BigLake/BigQuery IAM. In this work, we provide an assessment tool for you to see what can be copied as it is and what you have to re-architecture.
The table below compares these authorization systems.
<style> #greenrow { text-align: left; background-color: #A8DAB5; color: #5F6368 } #redrow { text-align: left; background-color: #F6AEA9; color: #5F6368 } #yellowrow { text-align: left; background-color: #FDE293; color: #5F6368 } #rowheader { text-align: center; font-weight: bold; color: #3C4043; background-color: #E8EAED; } </style>BigLake/BigQuery IAM | Ranger Hive Plugin | |
---|---|---|
Users and Groups | supported | supported |
Permission Mapping |
|
|
Authorization Granularity | Dataset, Table, Column, Row | |
Data Masking on Columns | supported | supported |
Policy Labels | supported | supported |
Tag Based Access Control |
|
supported |
Allow Exceptions | not supported | supported |
Deny Policies | not supported. Preview for GCS. Please check the latest information. | supported |
Policy Priorities (Priorities of 0 or 1) | Not supported since IAM is additive. In Ranger, policy priorities are used to allow access temporarily by overriding other policies. We can achieve temporary access by using group memberships |
supported |
Wildcard | Wildcard (*) are not supported in IAM.
For instance, we cannot create IAM rules for finance_* However, we can group resources in Projects, Folders etc. |
supported |
Validity Period | Validity period is not supported since
BigQuery IAM does not support conditions yet (please check the latest information). Nevertheless, temporary access can be given by temporary group membership, which is also the recommended since it does not require changes in the IAM policy and hence no need to run a CI/CD pipeline. |
supported |
The assessment tool is in fact a dashboard where we:
- load Ranger policies to BigQuery,
- apply transformations via views
- display the results in a Looker studio dashboard
- Export Ranger policies for hive
- Convert them to newline delimited json format so that BQ can parse. You may use
jq
:
cat <RANGER EXPORT>.json | jq -c '.policies[]' > ranger_policies.jsonl
-
We should create or re-use a GCS bucket to upload the ranger_polices.jsonl. Then we need BigQuery tables and views for analysis. Under terraform folder you find all the necessary scripts to create all.
cd terraform gcloud auth application-default login # Create and set the parameters in terraform.tfvars file # vi terraform.tfvars terraform apply
-
copy the policy export to GCS:
export PROJECT_ID=<your project id> gsutil cp ranger_policies.jsonl gs://$PROJECT_ID-ranger-assessment/
-
load into BQ:
export REGION=<region for the dataset> bq load --location=$REGION \ --source_format=NEWLINE_DELIMITED_JSON \ $DATASET_ID.hive_ranger_policies \ gs://$PROJECT_ID-ranger-assessment/ranger_policies.jsonl
-
If you have used the terraform you may skip creating the views. Otherwise: Create
hive_policies_view
from the BigQuery console:create or replace durmusio_experiements.hive_policies_view as ( select *, array_length(validitySchedules) > 0 as temporary_policy, (array_length(denyPolicyItems) > 0 or isDenyAllElse) as hasDenyPolicy, array_length(policyItems) > 0 as hasAllowPolicy, array_length(allowExceptions) > 0 as hasAllowExceptions, (resources.udf is not null or resources.url is not null) as isNonBQResource, (array_length(resources.column.values) > 0 and regexp_contains(ARRAY_TO_STRING(resources.column.values, ','), r'[0-9A-Za-z_]' ) ) as isColumnBased, (array_length(resources.table.values) > 0 and regexp_contains(ARRAY_TO_STRING(resources.table.values, ','), r'[0-9A-Za-z_]' ) and regexp_contains(ARRAY_TO_STRING(resources.column.values, ','), r'\*' ) ) as isTableBased, (array_length(resources.database.values) > 0 and not regexp_contains(ARRAY_TO_STRING(resources.table.values, ','), r'[0-9A-Za-z_]' ) and not regexp_contains(ARRAY_TO_STRING(resources.column.values, ','), r'[0-9A-Za-z_]' ) ) as isDatabaseBased, (resources.column.isExcludes or resources.database.isExcludes or resources.table.isExcludes) as hasExcludes, ["Authorization", "Column Masking", "Row Level Filtering"][OFFSET(policyType)] as policyTypeName, from `[YOUR PROJECT ID].ranger_assessment.fk_hive_policies` )
Then create statistics view:
create or replace view ranger_assessment.hive_policy_stats (AllPolicies, ValidPolicies, Deny, AllowExceptions, Temporary, HighPriority, HasExcludes) as ( select (select count(1) from `YOUR_PROJECT_ID.ranger_assessment.hive_policies_view`) as number_of_policies, (select count(1) from `YOUR_PROJECT_ID.ranger_assessment.hive_policies_view` where not ( hasDenyPolicy = true or hasAllowExceptions = true or hasExcludes = true or temporary_policy = true or policyPriority = 1 )) as number_of_valid_policies, (select count(1) from `YOUR_PROJECT_ID.ranger_assessment.hive_policies_view` where hasDenyPolicy = true ) as number_of_deny_policies, (select count(1) from `YOUR_PROJECT_ID.ranger_assessment.hive_policies_view` where hasAllowExceptions = true ) as number_of_allow_exceptions, (select count(1) from `YOUR_PROJECT_ID.ranger_assessment.hive_policies_view` where temporary_policy = true ) as number_of_temporary_policies, (select count(1) from `YOUR_PROJECT_ID.ranger_assessment.hive_policies_view` where policyPriority > 0 ) as number_of_high_priority_policies, (select count(1) from `YOUR_PROJECT_ID.ranger_assessment.hive_policies_view` where HasExcludes = true ) as number_of_policies_with_excludes, )
-
Finally make a copy of the data studio example and change the source to your own BigQuery dataset. If you don't have access to dashboard to copy, please request access.
Copyright 2023 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.