Skip to content

Commit

Permalink
Merge pull request #45 from bxparks/develop
Browse files Browse the repository at this point in the history
merge v1.0 into master
  • Loading branch information
bxparks authored Apr 4, 2020
2 parents e37cec6 + 3bf559d commit 15b68d1
Show file tree
Hide file tree
Showing 11 changed files with 233 additions and 110 deletions.
48 changes: 48 additions & 0 deletions .github/workflows/pythonpackage.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: BigQuery Schema Generator CI

on:
push:
branches: [ develop ]
pull_request:
branches: [ develop ]

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
# 3.5 does not support f-strings
python-version: [3.6, 3.7, 3.8]

steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v1
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
# pip install -r requirements.txt
- name: Lint with flake8
run: |
pip install flake8
# Stop the build for most python errors.
# W503 and W504 are both enabled by default and contradictory, so we
# have to suppress one of them.
# E501 complains that 80 > 79 columns, but 80 is the default line wrap
# in vim.
flake8 . --count --ignore E501,W503 --show-source --statistics
# Exit-zero treats all errors as warnings. Vim editor defaults to 80.
# The complexity warning is not useful... in fact the whole thing is
# not useful, so turn it off.
# flake8 . --count --exit-zero --max-complexity=10 --max-line-length=80
# --statistics
- name: Test with unittest
run: |
python -m unittest
9 changes: 8 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
# Changelog

* Unreleased
* 1.0 (2020-04-04)
* Fix `--sanitize_names` for recursive RECORD fields (Thanks riccardomc@,
see #43).
* Clean up how unit tests are run, trying my best to figure out
Python's convolution package importing mechanism.
* Add GitHub Actions continuous integration pipelines with flake8 checks and
automated unit testing.
* 0.5.1 (2019-06-17)
* Add `--sanitize_names` to convert invalid characters in column names and
to shorten them if too long. (See #33; thanks @jonwarghed).
to shorten them if too long. (See #33; thanks jonwarghed@).
* 0.5 (2019-06-06)
* Add input and output parameters to run() to allow the client code using
`SchemaGenerator` to redirect the input and output files. (See #30).
Expand Down
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.PHONY: tests

tests:
python3 -m unittest
43 changes: 31 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ $ generate-schema < file.data.json > file.schema.json
$ generate-schema --input_format csv < file.data.csv > file.schema.json
```

Version: 0.5.1 (2019-06-19)
Version: 1.0 (2020-04-04)

## Background

Expand Down Expand Up @@ -44,18 +44,33 @@ the input dataset.

## Installation

Install from [PyPI](https://pypi.python.org/pypi) repository using `pip3`.
If you want to install the package for your entire system globally, use
Install from [PyPI](https://pypi.python.org/pypi) repository using `pip3`. There
are too many ways to install packages in Python. The following are in order
highest to lowest recommendation:

1) If you are using a virtual environment (such as
[venv](https://docs.python.org/3/library/venv.html)), then use:
```
$ sudo -H pip3 install bigquery_schema_generator
$ pip3 install bigquery_schema_generator
```
If you are using a virtual environment (such as
[venv](https://docs.python.org/3/library/venv.html)), then you don't need
the `sudo` coommand, and you can just type:

2) If you aren't using a virtual environment you can install into
your local Python directory:

```
$ pip3 install bigquery_schema_generator
$ pip3 install --user bigquery_schema_generator
```

3) If you want to install the package for your entire system globally, use
```
$ sudo -H pip3 install bigquery_schema_generator
```
but realize that you will be running code from PyPI as `root` so this has
security implications.

Sometimes, your Python environment gets into a complete mess and the `pip3`
command won't work. Try typing `python3 -m pip` instead.

A successful install should print out something like the following (the version
number may be different):
```
Expand Down Expand Up @@ -644,16 +659,20 @@ took 67s on a Dell Precision M4700 laptop with an Intel Core i7-3840QM CPU @

## System Requirements

This project was initially developed on Ubuntu 17.04 using Python 3.5.3. I have
tested it on:
This project was initially developed on Ubuntu 17.04 using Python 3.5.3, but it
now requires Python 3.6 or higher, I think mostly due to the use of f-strings.

I have tested it on:

* Ubuntu 18.04, Python 3.7.7
* Ubuntu 18.04, Python 3.6.7
* Ubuntu 17.10, Python 3.6.3
* Ubuntu 17.04, Python 3.5.3
* Ubuntu 16.04, Python 3.5.2
* MacOS 10.14.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
* MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)

The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
and 3.8.

## Changelog

See [CHANGELOG.md](CHANGELOG.md).
Expand Down
124 changes: 74 additions & 50 deletions bigquery_schema_generator/generate_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ class SchemaGenerator:
# Detect floats inside quotes.
FLOAT_MATCHER = re.compile(r'^[-]?\d+\.\d+$')

# Valid field name characters of BigQuery
FIELD_NAME_MATCHER = re.compile(r'[^a-zA-Z0-9_]')

def __init__(self,
input_format='json',
infer_mode=False,
Expand Down Expand Up @@ -114,8 +117,8 @@ def __init__(self,

# This option generally wants to be turned on as any inferred schema
# will not be accepted by `bq load` when it contains illegal characters.
# Characters such as #, / or -. Neither will it be accepted if the column name
# in the schema is larger than 128 characters.
# Characters such as #, / or -. Neither will it be accepted if the
# column name in the schema is larger than 128 characters.
self.sanitize_names = sanitize_names

def log_error(self, msg):
Expand Down Expand Up @@ -323,7 +326,6 @@ def get_schema_entry(self, key, value):
if not value_mode or not value_type:
return None

# yapf: disable
if value_type == 'RECORD':
# recursively figure out the RECORD
fields = OrderedDict()
Expand All @@ -332,39 +334,48 @@ def get_schema_entry(self, key, value):
else:
for val in value:
self.deduce_schema_for_line(val, fields)
schema_entry = OrderedDict([('status', 'hard'),
('filled', True),
('info', OrderedDict([
('fields', fields),
('mode', value_mode),
('name', key),
('type', value_type),
]))])
# yapf: disable
schema_entry = OrderedDict([
('status', 'hard'),
('filled', True),
('info', OrderedDict([
('fields', fields),
('mode', value_mode),
('name', key),
('type', value_type),
])),
])
elif value_type == '__null__':
schema_entry = OrderedDict([('status', 'soft'),
('filled', False),
('info', OrderedDict([
('mode', 'NULLABLE'),
('name', key),
('type', 'STRING'),
]))])
schema_entry = OrderedDict([
('status', 'soft'),
('filled', False),
('info', OrderedDict([
('mode', 'NULLABLE'),
('name', key),
('type', 'STRING'),
])),
])
elif value_type == '__empty_array__':
schema_entry = OrderedDict([('status', 'soft'),
('filled', False),
('info', OrderedDict([
('mode', 'REPEATED'),
('name', key),
('type', 'STRING'),
]))])
schema_entry = OrderedDict([
('status', 'soft'),
('filled', False),
('info', OrderedDict([
('mode', 'REPEATED'),
('name', key),
('type', 'STRING'),
])),
])
elif value_type == '__empty_record__':
schema_entry = OrderedDict([('status', 'soft'),
('filled', False),
('info', OrderedDict([
('fields', OrderedDict()),
('mode', value_mode),
('name', key),
('type', 'RECORD'),
]))])
schema_entry = OrderedDict([
('status', 'soft'),
('filled', False),
('info', OrderedDict([
('fields', OrderedDict()),
('mode', value_mode),
('name', key),
('type', 'RECORD'),
])),
])
else:
# Empty fields are returned as empty strings, and must be treated as
# a (soft String) to allow clobbering by subsquent non-empty fields.
Expand All @@ -374,13 +385,15 @@ def get_schema_entry(self, key, value):
else:
status = 'hard'
filled = True
schema_entry = OrderedDict([('status', status),
('filled', filled),
('info', OrderedDict([
('mode', value_mode),
('name', key),
('type', value_type),
]))])
schema_entry = OrderedDict([
('status', status),
('filled', filled),
('info', OrderedDict([
('mode', value_mode),
('name', key),
('type', value_type),
])),
])
# yapf: enable
return schema_entry

Expand Down Expand Up @@ -435,8 +448,8 @@ def infer_value_type(self, value):
# Implement the same type inference algorithm as 'bq load' for
# quoted values that look like ints, floats or bools.
if self.INTEGER_MATCHER.match(value):
if int(value) < self.INTEGER_MIN_VALUE or \
self.INTEGER_MAX_VALUE < int(value):
if (int(value) < self.INTEGER_MIN_VALUE
or self.INTEGER_MAX_VALUE < int(value)):
return 'QFLOAT' # quoted float
else:
return 'QINTEGER' # quoted integer
Expand Down Expand Up @@ -618,11 +631,13 @@ def is_string_type(thetype):
]


def flatten_schema_map(schema_map,
keep_nulls=False,
sorted_schema=True,
infer_mode=False,
sanitize_names=False):
def flatten_schema_map(
schema_map,
keep_nulls=False,
sorted_schema=True,
infer_mode=False,
sanitize_names=False,
):
"""Converts the 'schema_map' into a more flatten version which is
compatible with BigQuery schema.
Expand All @@ -647,7 +662,8 @@ def flatten_schema_map(schema_map,
else schema_map.items()
for name, meta in map_items:
# Skip over fields which have been explicitly removed
if not meta: continue
if not meta:
continue

status = meta['status']
filled = meta['filled']
Expand Down Expand Up @@ -679,16 +695,24 @@ def flatten_schema_map(schema_map,
else:
# Recursively flatten the sub-fields of a RECORD entry.
new_value = flatten_schema_map(
value, keep_nulls, sorted_schema, sanitize_names)
schema_map=value,
keep_nulls=keep_nulls,
sorted_schema=sorted_schema,
infer_mode=infer_mode,
sanitize_names=sanitize_names,
)
elif key == 'type' and value in ['QINTEGER', 'QFLOAT', 'QBOOLEAN']:
# Convert QINTEGER -> INTEGER, similarly for QFLAT and QBOOLEAN.
new_value = value[1:]
elif key == 'mode':
if infer_mode and value == 'NULLABLE' and filled:
new_value = 'REQUIRED'
else:
new_value = value
elif key == 'name' and sanitize_names:
new_value = re.sub('[^a-zA-Z0-9_]', '_', value)[0:127]
new_value = SchemaGenerator.FIELD_NAME_MATCHER.sub(
'_', value,
)[0:127]
else:
new_value = value
new_info[key] = new_value
Expand Down
31 changes: 16 additions & 15 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,29 @@
try:
import pypandoc
long_description = pypandoc.convert('README.md', 'rst', format='md')
except:
except: # noqa: E722
# If unable to convert, try inserting the raw README.md file.
try:
with open('README.md', encoding="utf-8") as f:
long_description = f.read()
except:
except: # noqa: E722
# If all else fails, use some reasonable string.
long_description = 'BigQuery schema generator.'

setup(name='bigquery-schema-generator',
version='0.5.1',
description='BigQuery schema generator from JSON or CSV data',
long_description=long_description,
url='https://github.com/bxparks/bigquery-schema-generator',
author='Brian T. Park',
author_email='[email protected]',
license='Apache 2.0',
packages=['bigquery_schema_generator'],
python_requires='~=3.5',
entry_points={
'console_scripts': [
setup(
name='bigquery-schema-generator',
version='1.0',
description='BigQuery schema generator from JSON or CSV data',
long_description=long_description,
url='https://github.com/bxparks/bigquery-schema-generator',
author='Brian T. Park',
author_email='[email protected]',
license='Apache 2.0',
packages=['bigquery_schema_generator'],
python_requires='~=3.6',
entry_points={
'console_scripts': [
'generate-schema = bigquery_schema_generator.generate_schema:main'
]
}
},
)
Loading

0 comments on commit 15b68d1

Please sign in to comment.