Variant Transforms splits the output table into several smaller tables each containing variants of a specific region of a genome. We call this procedure sharding. Sharding the output table significantly reduces query costs and also optimizes the cost and time of running Variant Transforms' pipeline. In fact, without sharding, importing large inputs such as 1000Genomes to BigQuery would be very costly and time consuming, especially if variant merging options are set.
Variant Transforms uses a config file to perform sharding. Our default sharding config file splits variants into 25 tables, one for each chromosome. Following figure schematically shows how Variant Transforms performs sharding:
Splitting output tables based on the reference_name
guarantees that
per-chromosome queries will only process variants of the chromosome under study.
Also, sharding is required if we want to utilize BigQuery
partitioining.
You can define your shards to be more finely grained than the default config we have provided (which is at the chromosome level). For example based on your use case you might have multiple tables per chromosome. Also, as the name suggests, homo_sapiens_default is designed for the human genome. If you need a sharding config for organisms other than human, you can modify the default config file. In the next section, we explain the details of our config file.
If you need help modifying the default config file for your use case, then please reach out to us by submitting a Github issue so we can help.
The config file is set using the --sharding_config_path
flag and is
formatted as a YAML file with a
straightforward structure. Here
you can find the default config file that splits the output table into 25 tables,
one per chromosome plus an extra residual shard.
We use this config file as the default for all runs of Variant Transforms.
In other words, --sharding_config_path
default value points to this default
config file. Here is a snippet of that file:
- output_table:
table_name_suffix: "chr1"
regions:
- "chr1"
- "1"
partition_range_end: 249,240,615
This config file defines a shard that will include all variants whose
reference_name
is equal to one of the given values for regions field,
namely chr1
or 1
. Note that the reference_name
string is case-sensitive,
so if your variants have Chr1
or CHR1
they will not be matched to this shard.
The final output table name for this shard will have __chr1
suffix. More precisely,
if you use the following flag to set the name of your output table:
--output_table my-project:my_dataset.my_table
then the output table corresponding to this shard (which contains chromosome 1 variants) will be stored at:
my-project:my_dataset.my_table__chr1
You can use any string as a suffix for your table names. Here, for simplicity,
we used the same string (chr1
) for both reference_name
matching
(under region
field) and table suffix.
Other than table_name_suffix
and regions
, each output_table
entry
in the sharding config file has another field: partition_range_end
.
The value of this field is used for conducting integer range partitioning
on the output table. We use this number and then split it into 3999 partitions,
which is the maximum allowed number of partitions minus 1. For example,
in the previous example, partition_range_end is set to 249,240,615
.
Based on this number we choose the
width of partition intervals
to be 62340
. This means:
- All variants with
start_position
in[0, 62340)
range will be in the first partition. - All variants with
start_position
in[62340, 2*62340)
will be in the second partition. - ...
- All variants with
start_position
in[3998*62340, 3999*62340)
will be in the 3999th partition.
As one can see, the partition range end value ends up 3999 * 62340 = 249,297,660
which is larger than the original value given in the sharding config 249,240,615
.
We deliberately made the range end a bit larger to make sure we accommodate cases
where the user does not exactly know the size of a chromosome. Please refer to
calculate_optimal_range_interval
method in
partitioning.py
module for more details. We only use 3999 partitions because we reserve the
last one as the default partition. The default partition contains any variants
that do not belong to any of the other 3999 aforementioned partitions.
Examining our default sharding config file
shows that we use different partition_range_end
values for each chromosome.
Following table summarizes these values for a couple of chromosomes:
Chromosome | Partition Range End |
---|---|
chr1 |
249,240,615 |
chr2 |
243,189,284 |
chr3 |
198,233,091 |
chr21 |
48,119,890 |
chr22 |
51,244,528 |
chrX |
156,030,808 |
chrY |
57,227,415 |
These values correspond to the size of each chromosome. We found these values by
examining the gnomAD dataset and finding the largest start_position
value for
each chromosome using gnomAD on GCP Marketplace
BigQuery tables. Using these partition_range_end
values enables us to perfectly
fit 3999 partitions to each table and fully take advantage of cost-saving benefits
of BigQuery partitioning. We call this adaptive partitioning, because each
chromosome has a different partitioning interval.
Output tables with adaptive partitioning have one downside: BigQuery does not allow to create a view over tables with different partitioning intervals. Similarly, BigQuery does not allow querying multiple tables using wildcard if selected tables have different partitioning intervals. Both of these two cases fail with this error message:
"errors": [
{
"message": "Wildcard matched incompatible partitioning tables,
first table PROJECT_ID:DATASET_ID.variants__chr1,
first incompatible table PROJECT_ID:DATASET_ID.variants__chr10.",
"domain": "global",
"reason": "invalid"
}
],
Because these two features might be critical to the workflow of some of our users,
we offer another partitioning config file
that utilizes fixed partitioning. This means we use the same partition_range_end
value for all chromosomes, namely the largest chromosome: chr1
. We know this
slightly degrades the query cost reduction potential of BigQuery partitioning,
especially for smaller chromosomes such as chr21
or chr22
, but doing so allows
our users to create views and run queries over multiple tables. In order to run
Variant Transforms with fixed partitioning, please add the following flag to your
vcf_to_bq
command:
--sharding_config_path gcp_variant_transforms/data/sharding_configs/homo_sapiens_fixed_partitions.yaml
As we mentioned earlier, sharding can be done at a more fine grained level and does not have to be limited to chromosomes. For example, the following config defines two shards that contain variants of chromosome X:
- output_table:
table_name_suffix: "chrX_first_100M"
regions:
- "chrX:0-100,000,000"
partition_range_end: 100,000,000
- output_table:
table_name_suffix: "chrX_remaining"
regions:
- "chrX:100,000,000-999,999,999"
partition_range_end: 156,030,808
If the start position of a variant on chromosome X is less than 100,000,000
then it will be assigned to chrX_first_100M
table, otherwise it will be
assigned to chrX_remaining
table.
All shards defined in a config file follow the same principle: variants will
be assigned to them based on their defined regions. The only exception is the
residual
shard, which acts as the default shard: all variants not assigned
to any shard will end up in this shard. For example, consider the following
config file:
- output_table:
table_name_suffix: "first_50M"
regions:
- "chr1:0-50,000,000"
- "chr2:0-50,000,000"
- "chr3:0-50,000,000"
partition_range_end: 50,000,000
- output_table:
table_name_suffix: "second_50M"
regions:
- "chr1:50,000,000-100,000,000"
- "chr2:50,000,000-100,000,000"
- "chr3:50,000,000-100,000,000"
partition_range_end: 100,000,000
- output_table:
table_name_suffix: "all_remaining"
regions:
- "residual"
This config file splits all the variants into 3 tables:
first_50M
: all variants ofchr1
,chr2
, andchr3
whosestart_position
is< 50M
second_50M
: all variants ofchr1
,chr2
, andchr3
whosestart_position
is>= 50M
and< 100M
all_remaining
: all remaining variants. This includes:- All variants of
chr1
,chr2
, andchr3
with start position>= 100M
- All variants of other chromosomes.
- All variants of
Using the residual
shard, you can make sure your output tables will include
all input variants. However, if you don't need the residual variants, then
you can simply remove the last shard from your config file. If you removed
the last shard from the previous example, then you would have only 2 tables
as output, and variants that do not match either of those two shards would
be dropped from the final output.
This feature can be used more broadly for filtering out unwanted variants from the output tables. Filtering reduces the cost of running Variant Transforms.