Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable users to store sample-lookup-optimized tables #573

Merged
merged 4 commits into from
May 19, 2020

Conversation

samanvp
Copy link
Contributor

@samanvp samanvp commented Mar 31, 2020

This PR includes the following changes:

  • Add a new input flag and validate its value.
  • Add BQ queries to flatten call column.
  • Extract the schema of the flatten table.
  • Add unit tests to verify the correctness of extracted schema.
  • Create tables using extracted schema which are partitioned on call_sample_id column.
  • Copy date from variant-lookup-optimized tables to sample-lookup-optimized tables.

We will add integration tests in a follow up PR.

@samanvp samanvp force-pushed the sample_opt_tables branch from 4d177c5 to 6cb91f7 Compare March 31, 2020 21:55
@samanvp samanvp force-pushed the sample_opt_tables branch 10 times, most recently from c74ad88 to 279906d Compare April 25, 2020 05:30
@samanvp samanvp force-pushed the sample_opt_tables branch 14 times, most recently from 7dcad29 to 3f3666a Compare May 14, 2020 04:04
@samanvp samanvp requested a review from tneymanov May 14, 2020 04:12
@@ -42,9 +42,7 @@ steps:
- '--project ${PROJECT_ID}'
- '--image_tag ${COMMIT_SHA}'
- '--run_unit_tests'
- '--run_preprocessor_tests'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sharding_config_path, append):
if (output_table_base_name !=
bigquery_util.get_table_base_name(output_table_base_name)):
raise ValueError(('Output table cannot contain "{}" we reserve this '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit from previous implementation: 'Output table cannot contain "{}". We reserve this ' (remove 1 space at the end, stop sentence at {})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
_run_table_creation_command(bq_command)

class FlattenCallColumn(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is getting unruly, don't you think? Maybe move this to a new file? If you do so, please make sure to update copyright accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'd like to move both this class and LoadAvro to a new module. Let's do this change in a separate PR, this one is big already.
Submitted #598

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
_run_table_creation_command(bq_command)

class FlattenCallColumn(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_GCS_DELETE_FILES_COMMAND = 'gsutil -m rm -f -R {ROOT_PATH}'
_BQ_LOAD_JOB_NUM_RETRIES = 5
_BQ_NUM_RETRIES = 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we reducing retry times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug artifact, removed.

break
logging.info('Copy to table query was successful: %s', output_table_id)

def _create_temp_flatten_table(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm naming doesn't really represent what's happening here, but there is no really good name...

How about _create_temp_flatten_table_with_1_row at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

SCHEMA_FILE_PATH=schema_file_path)
result = os.system(bq_command)
if result != 0:
logging.error('Failed to extract flatten table schema using "%s" command',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log errors but don't throw exceptions? Below makes more sense, but here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method returns a boolean to indicate the successful completion of its task. I throw an exception in the caller vcf_to_bq.py.

parser.add_argument(
'--sample_lookup_optimized_output_table',
default='',
help=('In addition to the default output tables (which are optimized '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to allow either output_table, sample_lookup_optimized_output_table or both instead of enforcing output_table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, we fill sample_lookup_optimized_output_table using BQ queries based on the schema and content of output_table.
On a more conceptual level, we assume VT is expected to always create variant lookup optimized tables while sample lookup optimized tables are needed by only some of the users.

client, parsed_args.output_table,
parsed_args.sharding_config_path, parsed_args.append)

if parsed_args.sample_lookup_optimized_output_table:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure sample_lookup_optimized_output_table does not equal output_table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this corner case!

help=('In addition to the default output tables (which are optimized '
'for variant look up queries), you can store a second copy of '
'your data in BigQuery tables that are optimized for sample '
'look up queries. Note that setting this option will double your '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tables are going to be unflattened, are their storage costs going to be more than the original output_tables? Or do they end up roughly equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If data is not joint genotyped then it's approximately equal.
I updated the help to make this issue clear, thanks for spotting it.

Copy link
Contributor Author

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done, thanks!

_GCS_DELETE_FILES_COMMAND = 'gsutil -m rm -f -R {ROOT_PATH}'
_BQ_LOAD_JOB_NUM_RETRIES = 5
_BQ_NUM_RETRIES = 3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug artifact, removed.

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
_run_table_creation_command(bq_command)

class FlattenCallColumn(object):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'd like to move both this class and LoadAvro to a new module. Let's do this change in a separate PR, this one is big already.
Submitted #598

@@ -42,9 +42,7 @@ steps:
- '--project ${PROJECT_ID}'
- '--image_tag ${COMMIT_SHA}'
- '--run_unit_tests'
- '--run_preprocessor_tests'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -482,6 +501,156 @@ def create_sample_info_table(output_table_id):
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
_run_table_creation_command(bq_command)

class FlattenCallColumn(object):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

break
logging.info('Copy to table query was successful: %s', output_table_id)

def _create_temp_flatten_table(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

SCHEMA_FILE_PATH=schema_file_path)
result = os.system(bq_command)
if result != 0:
logging.error('Failed to extract flatten table schema using "%s" command',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method returns a boolean to indicate the successful completion of its task. I throw an exception in the caller vcf_to_bq.py.

parser.add_argument(
'--sample_lookup_optimized_output_table',
default='',
help=('In addition to the default output tables (which are optimized '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, we fill sample_lookup_optimized_output_table using BQ queries based on the schema and content of output_table.
On a more conceptual level, we assume VT is expected to always create variant lookup optimized tables while sample lookup optimized tables are needed by only some of the users.

help=('In addition to the default output tables (which are optimized '
'for variant look up queries), you can store a second copy of '
'your data in BigQuery tables that are optimized for sample '
'look up queries. Note that setting this option will double your '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If data is not joint genotyped then it's approximately equal.
I updated the help to make this issue clear, thanks for spotting it.

client, parsed_args.output_table,
parsed_args.sharding_config_path, parsed_args.append)

if parsed_args.sample_lookup_optimized_output_table:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this corner case!

sharding_config_path, append):
if (output_table_base_name !=
bigquery_util.get_table_base_name(output_table_base_name)):
raise ValueError(('Output table cannot contain "{}" we reserve this '
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@samanvp samanvp force-pushed the sample_opt_tables branch from 4f6e0f6 to 0197e0b Compare May 19, 2020 03:16
@samanvp samanvp force-pushed the sample_opt_tables branch 2 times, most recently from 243eea9 to fab844d Compare May 19, 2020 03:50
samanvp added 2 commits May 19, 2020 07:37
 + Add input flag and validate its value.
 + Add BQ queries to flatten call column.
 + Extract the schema of the flatten table.
 + Add unit tests to verify the correctness of extracted schema.
@samanvp samanvp force-pushed the sample_opt_tables branch 3 times, most recently from 924ff45 to 79bbd6e Compare May 19, 2020 11:55
@samanvp samanvp force-pushed the sample_opt_tables branch from 79bbd6e to d4d290e Compare May 19, 2020 11:57
Copy link
Collaborator

@tneymanov tneymanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, aside from minor changes.

@@ -508,6 +527,199 @@ def create_sample_info_table(output_table_id):
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
_run_table_creation_command(bq_command)

class FlattenCallColumn(object):
"""Flattens call column to convert varinat opt tables to sample opt tables."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/varinat/variant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
cp_query += ' LIMIT 1' # We need this table only to extract its schema.
self._copy_to_flatten_table(full_output_table_id, cp_query)
logging.info('A new table with 1 row was crated: %s', full_output_table_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/crated/created

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

select_list = []
for column in column_names:
if column != ColumnKeyConstants.CALLS:
select_list.append(_MAIN_TABLE_ALIAS + '.' + column + ' AS `'+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

MAIN_TABLE_ALIAS=_MAIN_TABLE_ALIAS,
CALL_COLUMN=ColumnKeyConstants.CALLS,
CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
cp_query += ' LIMIT 1' # We need this table only to extract its schema.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this into const too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will be used in any other method in this module. So for now, if you don't mine, I will keep it here.

Copy link
Contributor Author

@samanvp samanvp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done, thanks!

@@ -508,6 +527,199 @@ def create_sample_info_table(output_table_id):
SCHEMA_FILE_PATH=SAMPLE_INFO_TABLE_SCHEMA_FILE_PATH)
_run_table_creation_command(bq_command)

class FlattenCallColumn(object):
"""Flattens call column to convert varinat opt tables to sample opt tables."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

select_list = []
for column in column_names:
if column != ColumnKeyConstants.CALLS:
select_list.append(_MAIN_TABLE_ALIAS + '.' + column + ' AS `'+
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

MAIN_TABLE_ALIAS=_MAIN_TABLE_ALIAS,
CALL_COLUMN=ColumnKeyConstants.CALLS,
CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
cp_query += ' LIMIT 1' # We need this table only to extract its schema.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will be used in any other method in this module. So for now, if you don't mine, I will keep it here.

CALL_TABLE_ALIAS=_CALL_TABLE_ALIAS)
cp_query += ' LIMIT 1' # We need this table only to extract its schema.
self._copy_to_flatten_table(full_output_table_id, cp_query)
logging.info('A new table with 1 row was crated: %s', full_output_table_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@samanvp samanvp merged commit 0154b55 into googlegenomics:master May 19, 2020
@samanvp samanvp deleted the sample_opt_tables branch May 19, 2020 20:57
tneymanov pushed a commit to tneymanov/gcp-variant-transforms that referenced this pull request Jun 1, 2020
)

* Enable users to store sample look up optimized tables

 + Add input flag and validate its value.
 + Add BQ queries to flatten call column.
 + Extract the schema of the flatten table.
 + Add unit tests to verify the correctness of extracted schema.

* First round of comments

* Sync with googlegenomics#596

* Second round of comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants