All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Add support for AWS Glue v4.0
- Updated dependencies
- Added support for --additional-python-modules
- Fix Glue version setter
- Add support for Glue version 3.0
- Allow tags to be passed to Glue jobs
- Add support for
binary
type
- Set the default AWS Glue version to 2.0
- Added
partitions
andprimary_key
properties to table metadata
- Update CHANGELOG and pyproject.toml missed in release v7.1.0
- Allow use of decimal data type
- Added
sensitivity
andredacted
properties to column metadata - Added
sensitivity
property to table metadata
- Added the ability to automatically generate a TableMeta object from parquet metadata, using
tablemeta_from_parquet_meta
- Added the ability to update an existing database with new tables - see `` and
meta.get_existing_database_from_glue_catalogue
and `DatabaseMeta.update_glue_database` - Fixed bug that meant the use of complex types (arrays and structs) didn't actually work in Athena
- Users can now include jars in a
glue_jars
folder, and they will be uploaded to s3 and made available in the glue environment
- GlueJob now sets a timeout parameter for glue jobs. This can set to specific times (in minutes) using the
timeout_override_minutes
property - Relaxed package requirements on jsonschema
- Removed
requirements.txt
as no longer used
- Removing validator from column description as it was too strict.
- ETL Manager now points to a web schema for tables (will get schema from package if cannot access schema web link - but will output warning)
- Updated package setup to
pyproject.toml
- Replaced travis for github actions
- Glue jobs now run using Python 3 and Spark 2.4 as default
- ETL manager now allows use of STRUCT and ARRAY col types in your hive metadata tables.
- Method function in TableMeta
refresh_paritions
renamed torefresh_partitions
. refresh_partitions
function now wait for athena to complete the query. This should avoid errors where you hit limits of concurrent Athena queries (max 4) when usingrefresh_all_table_partitions
(from DatabaseMeta class).
- Two new input arguments to GlueJob method function
wait_for_completion
.- Input
back_off_retries
now is the number of retries to boto API to avoid Throttling Error. Retries are done with exponential back off. cleanup_if_successful
will delete the glue job if thewait_for_completion
doesn't raise an error. i.e. Glue job completes successfully.
- Input
- Fixed issue 91 and 92
- Improved python format
- Refactored to Python 3.6
- Fixed unknown issue where arguments passed into function were not copied (same memory location)
- Added argument
wait_seconds
toGlueJob
class functionwait_for_job_completion()
to set number of seconds between job status checks. Default unchanged.
- Updated output from
GlueJob
class functionwait_for_job_completion()
(when verbose is set to True), now states how long Glue has been running the job.
- Fixed bug where glue_specific would not write to json or be a key in dictionary from TableMeta class
to_dict()
method. - Fixed bug where default table ddl templates would be overwritten causing mixed table definitions (see issue no. 80) for specific example and fix.
- If meta has partition property if none or empty list then this property will no longer be passed to dict (and therefore not to json)
- If meta has glue_specific property if none or empty dict then this property will no longer be passed to dict (and therefore not to json)
- DatabaseMeta method function
test_column_types_align
now tests that all column types match across all tables in database object.
- bug meant that new nullable column property was only being set if nullable was True.
- now allows newline json files as athena compatable tables (note still does not support struct or array column types - still on the todo list)
- Improved
delete_glue_database
method function to only catch/allow specific error (database does not exist)
- Meta data cols now has
enum
,pattern
andnullable
properties wait_for_completition
method function now has verbose input param that prints out status with time stamp everytime boto checks on the glue jobupdate_column
method function ofTableMeta
class now takes kwargs that match the properties of the column. (Input params ofnew_type
,new_name
, etc will no longer work). e.g. new functionality works astab.update_column('col1', type = 'int')
.
- Changed back end execution of
MSK REPAIR TABLE
call to athena. Have moved frompyathenajdbc
toboto3
to reduce number of package dependencies. etl_manager no longer requirespyathenajdbc
(which also means do not need Java installed).
- removed check that throws error for
-
in job parameter name due to the new Glue parameterenable-metrics
--conf
allowed as job param to enable spark configuration for AWS Glue
- Database meta class will now throw error if database already exists when calling create_glue_database
- setup.py now installs package dependencies
- wait_for_completion method in GlueJob class now raises error if glue job was manually stopped
- updated setup.py to match github version
- Initial release