Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
This is a simple transformer that is resizing the input tables to a specified size.
- resizing based on in-memory size of the tables.
- resized based on the number of rows in the tables.
Tables can be either split into smaller sizes or aggregated into larger sizes.
A docker file that can be used for building docker image. You can use
make build
The set of dictionary keys holding BlockListTransform configuration for values are as follows:
- max_rows_per_table - specifies max documents per table
- _max_mbytes_per_table - specifies max size of table, according to the size_type value.
- size_type - indicates how table size is measured. Can be one of
- memory - table size is measure by the in-process memory used by the table
- disk - table size is estimated as the on-disk size of the parquet files. This is an estimate only as files are generally compressed on disk and so may not be exact due varying compression ratios. This is the default.
Only one of the max_rows_per_table and max_mbytes_per_table may be used.
When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the launcher and map to the configuration keys above.
--resize_max_rows_per_table RESIZE_MAX_ROWS_PER_TABLE
Max number of rows per table
--resize_max_mbytes_per_table RESIZE_MAX_MBYTES_PER_TABLE
Max table size (MB). Size is measured according to the --resize_size_type parameter
--resize_size_type {disk,memory}
Determines how memory is measured when using the --resize_max_mbytes_per_table option.
'memory' measures the in-process memory footprint and
'disk' makes an estimate of the resulting parquet file size.
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.