Merge pull request #45 from bxparks/develop

merge v1.0 into master
bxparks · Apr 4, 2020 · 15b68d1 · 15b68d1
2 parents e37cec6 + 3bf559d
commit 15b68d1
Show file tree

Hide file tree

Showing 11 changed files with 233 additions and 110 deletions.
diff --git a/.github/workflows/pythonpackage.yml b/.github/workflows/pythonpackage.yml
@@ -0,0 +1,48 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: BigQuery Schema Generator CI 
+
+on:
+  push:
+    branches: [ develop ]
+  pull_request:
+    branches: [ develop ]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy: 
+      matrix:
+        # 3.5 does not support f-strings
+        python-version: [3.6, 3.7, 3.8]
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v1
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        # pip install -r requirements.txt
+    - name: Lint with flake8
+      run: |
+        pip install flake8
+        # Stop the build for most python errors.
+        # W503 and W504 are both enabled by default and contradictory, so we
+        # have to suppress one of them.
+        # E501 complains that 80 > 79 columns, but 80 is the default line wrap
+        # in vim.
+        flake8 . --count --ignore E501,W503 --show-source --statistics
+
+        # Exit-zero treats all errors as warnings. Vim editor defaults to 80.
+        # The complexity warning is not useful... in fact the whole thing is
+        # not useful, so turn it off.
+        # flake8 . --count --exit-zero --max-complexity=10 --max-line-length=80
+        # --statistics
+    - name: Test with unittest
+      run: |
+        python -m unittest
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,9 +1,16 @@
 # Changelog
 
 * Unreleased
+* 1.0 (2020-04-04)
+    * Fix `--sanitize_names` for recursive RECORD fields (Thanks riccardomc@,
+      see #43).
+    * Clean up how unit tests are run, trying my best to figure out
+      Python's convolution package importing mechanism.
+    * Add GitHub Actions continuous integration pipelines with flake8 checks and
+      automated unit testing.
 * 0.5.1 (2019-06-17)
     * Add `--sanitize_names` to convert invalid characters in column names and
-      to shorten them if too long. (See #33; thanks @jonwarghed).
+      to shorten them if too long. (See #33; thanks jonwarghed@).
 * 0.5 (2019-06-06)
     * Add input and output parameters to run() to allow the client code using
       `SchemaGenerator` to redirect the input and output files. (See #30).

diff --git a/Makefile b/Makefile
@@ -0,0 +1,4 @@
+.PHONY: tests
+
+tests:
+	python3 -m unittest
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ $ generate-schema < file.data.json > file.schema.json
 $ generate-schema --input_format csv < file.data.csv > file.schema.json
 ```
 
-Version: 0.5.1 (2019-06-19)
+Version: 1.0 (2020-04-04)
 
 ## Background
 
@@ -44,18 +44,33 @@ the input dataset.
 
 ## Installation
 
-Install from [PyPI](https://pypi.python.org/pypi) repository using `pip3`.
-If you want to install the package for your entire system globally, use
+Install from [PyPI](https://pypi.python.org/pypi) repository using `pip3`. There
+are too many ways to install packages in Python. The following are in order
+highest to lowest recommendation:
+
+1) If you are using a virtual environment (such as
+[venv](https://docs.python.org/3/library/venv.html)), then use:
 ```
-$ sudo -H pip3 install bigquery_schema_generator
+$ pip3 install bigquery_schema_generator
 ```
-If you are using a virtual environment (such as
-[venv](https://docs.python.org/3/library/venv.html)), then you don't need
-the `sudo` coommand, and you can just type:
+
+2) If you aren't using a virtual environment you can install into
+your local Python directory:
+
 ```
-$ pip3 install bigquery_schema_generator
+$ pip3 install --user bigquery_schema_generator
 ```
 
+3) If you want to install the package for your entire system globally, use
+```
+$ sudo -H pip3 install bigquery_schema_generator
+```
+but realize that you will be running code from PyPI as `root` so this has
+security implications.
+
+Sometimes, your Python environment gets into a complete mess and the `pip3`
+command won't work. Try typing `python3 -m pip` instead.
+
 A successful install should print out something like the following (the version
 number may be different):
 ```
@@ -644,16 +659,20 @@ took 67s on a Dell Precision M4700 laptop with an Intel Core i7-3840QM CPU @
 
 ## System Requirements
 
-This project was initially developed on Ubuntu 17.04 using Python 3.5.3. I have
-tested it on:
+This project was initially developed on Ubuntu 17.04 using Python 3.5.3, but it
+now requires Python 3.6 or higher, I think mostly due to the use of f-strings.
+
+I have tested it on:
 
+* Ubuntu 18.04, Python 3.7.7
 * Ubuntu 18.04, Python 3.6.7
 * Ubuntu 17.10, Python 3.6.3
-* Ubuntu 17.04, Python 3.5.3
-* Ubuntu 16.04, Python 3.5.2
 * MacOS 10.14.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
 * MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
 
+The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
+and 3.8.
+
 ## Changelog
 
 See [CHANGELOG.md](CHANGELOG.md).

diff --git a/bigquery_schema_generator/generate_schema.py b/bigquery_schema_generator/generate_schema.py
@@ -73,6 +73,9 @@ class SchemaGenerator:
     # Detect floats inside quotes.
     FLOAT_MATCHER = re.compile(r'^[-]?\d+\.\d+$')
 
+    # Valid field name characters of BigQuery
+    FIELD_NAME_MATCHER = re.compile(r'[^a-zA-Z0-9_]')
+
     def __init__(self,
                  input_format='json',
                  infer_mode=False,
@@ -114,8 +117,8 @@ def __init__(self,
 
         # This option generally wants to be turned on as any inferred schema
         # will not be accepted by `bq load` when it contains illegal characters.
-        # Characters such as #, / or -. Neither will it be accepted if the column name
-        # in the schema is larger than 128 characters.
+        # Characters such as #, / or -. Neither will it be accepted if the
+        # column name in the schema is larger than 128 characters.
         self.sanitize_names = sanitize_names
 
     def log_error(self, msg):
@@ -323,7 +326,6 @@ def get_schema_entry(self, key, value):
         if not value_mode or not value_type:
             return None
 
-        # yapf: disable
         if value_type == 'RECORD':
             # recursively figure out the RECORD
             fields = OrderedDict()
@@ -332,39 +334,48 @@ def get_schema_entry(self, key, value):
             else:
                 for val in value:
                     self.deduce_schema_for_line(val, fields)
-            schema_entry = OrderedDict([('status', 'hard'),
-                                        ('filled', True),
-                                        ('info', OrderedDict([
-                                            ('fields', fields),
-                                            ('mode', value_mode),
-                                            ('name', key),
-                                            ('type', value_type),
-                                        ]))])
+            # yapf: disable
+            schema_entry = OrderedDict([
+                ('status', 'hard'),
+                ('filled', True),
+                ('info', OrderedDict([
+                    ('fields', fields),
+                    ('mode', value_mode),
+                    ('name', key),
+                    ('type', value_type),
+                ])),
+            ])
         elif value_type == '__null__':
-            schema_entry = OrderedDict([('status', 'soft'),
-                                        ('filled', False),
-                                        ('info', OrderedDict([
-                                            ('mode', 'NULLABLE'),
-                                            ('name', key),
-                                            ('type', 'STRING'),
-                                        ]))])
+            schema_entry = OrderedDict([
+                ('status', 'soft'),
+                ('filled', False),
+                ('info', OrderedDict([
+                    ('mode', 'NULLABLE'),
+                    ('name', key),
+                    ('type', 'STRING'),
+                ])),
+            ])
         elif value_type == '__empty_array__':
-            schema_entry = OrderedDict([('status', 'soft'),
-                                        ('filled', False),
-                                        ('info', OrderedDict([
-                                            ('mode', 'REPEATED'),
-                                            ('name', key),
-                                            ('type', 'STRING'),
-                                        ]))])
+            schema_entry = OrderedDict([
+                ('status', 'soft'),
+                ('filled', False),
+                ('info', OrderedDict([
+                    ('mode', 'REPEATED'),
+                    ('name', key),
+                    ('type', 'STRING'),
+                ])),
+            ])
         elif value_type == '__empty_record__':
-            schema_entry = OrderedDict([('status', 'soft'),
-                                        ('filled', False),
-                                        ('info', OrderedDict([
-                                            ('fields', OrderedDict()),
-                                            ('mode', value_mode),
-                                            ('name', key),
-                                            ('type', 'RECORD'),
-                                        ]))])
+            schema_entry = OrderedDict([
+                ('status', 'soft'),
+                ('filled', False),
+                ('info', OrderedDict([
+                    ('fields', OrderedDict()),
+                    ('mode', value_mode),
+                    ('name', key),
+                    ('type', 'RECORD'),
+                ])),
+            ])
         else:
             # Empty fields are returned as empty strings, and must be treated as
             # a (soft String) to allow clobbering by subsquent non-empty fields.
@@ -374,13 +385,15 @@ def get_schema_entry(self, key, value):
             else:
                 status = 'hard'
                 filled = True
-            schema_entry = OrderedDict([('status', status),
-                                        ('filled', filled),
-                                        ('info', OrderedDict([
-                                            ('mode', value_mode),
-                                            ('name', key),
-                                            ('type', value_type),
-                                        ]))])
+            schema_entry = OrderedDict([
+                ('status', status),
+                ('filled', filled),
+                ('info', OrderedDict([
+                    ('mode', value_mode),
+                    ('name', key),
+                    ('type', value_type),
+                ])),
+            ])
         # yapf: enable
         return schema_entry
 
@@ -435,8 +448,8 @@ def infer_value_type(self, value):
                 # Implement the same type inference algorithm as 'bq load' for
                 # quoted values that look like ints, floats or bools.
                 if self.INTEGER_MATCHER.match(value):
-                    if int(value) < self.INTEGER_MIN_VALUE or \
-                        self.INTEGER_MAX_VALUE < int(value):
+                    if (int(value) < self.INTEGER_MIN_VALUE
+                            or self.INTEGER_MAX_VALUE < int(value)):
                         return 'QFLOAT'  # quoted float
                     else:
                         return 'QINTEGER'  # quoted integer
@@ -618,11 +631,13 @@ def is_string_type(thetype):
     ]
 
 
-def flatten_schema_map(schema_map,
-                       keep_nulls=False,
-                       sorted_schema=True,
-                       infer_mode=False,
-                       sanitize_names=False):
+def flatten_schema_map(
+        schema_map,
+        keep_nulls=False,
+        sorted_schema=True,
+        infer_mode=False,
+        sanitize_names=False,
+):
     """Converts the 'schema_map' into a more flatten version which is
     compatible with BigQuery schema.
 
@@ -647,7 +662,8 @@ def flatten_schema_map(schema_map,
         else schema_map.items()
     for name, meta in map_items:
         # Skip over fields which have been explicitly removed
-        if not meta: continue
+        if not meta:
+            continue
 
         status = meta['status']
         filled = meta['filled']
@@ -679,16 +695,24 @@ def flatten_schema_map(schema_map,
                 else:
                     # Recursively flatten the sub-fields of a RECORD entry.
                     new_value = flatten_schema_map(
-                        value, keep_nulls, sorted_schema, sanitize_names)
+                        schema_map=value,
+                        keep_nulls=keep_nulls,
+                        sorted_schema=sorted_schema,
+                        infer_mode=infer_mode,
+                        sanitize_names=sanitize_names,
+                    )
             elif key == 'type' and value in ['QINTEGER', 'QFLOAT', 'QBOOLEAN']:
+                # Convert QINTEGER -> INTEGER, similarly for QFLAT and QBOOLEAN.
                 new_value = value[1:]
             elif key == 'mode':
                 if infer_mode and value == 'NULLABLE' and filled:
                     new_value = 'REQUIRED'
                 else:
                     new_value = value
             elif key == 'name' and sanitize_names:
-                new_value = re.sub('[^a-zA-Z0-9_]', '_', value)[0:127]
+                new_value = SchemaGenerator.FIELD_NAME_MATCHER.sub(
+                    '_', value,
+                )[0:127]
             else:
                 new_value = value
             new_info[key] = new_value

diff --git a/setup.py b/setup.py
@@ -4,28 +4,29 @@
 try:
     import pypandoc
     long_description = pypandoc.convert('README.md', 'rst', format='md')
-except:
+except:  # noqa: E722
     # If unable to convert, try inserting the raw README.md file.
     try:
         with open('README.md', encoding="utf-8") as f:
             long_description = f.read()
-    except:
+    except:  # noqa: E722
         # If all else fails, use some reasonable string.
         long_description = 'BigQuery schema generator.'
 
-setup(name='bigquery-schema-generator',
-      version='0.5.1',
-      description='BigQuery schema generator from JSON or CSV data',
-      long_description=long_description,
-      url='https://github.com/bxparks/bigquery-schema-generator',
-      author='Brian T. Park',
-      author_email='[email protected]',
-      license='Apache 2.0',
-      packages=['bigquery_schema_generator'],
-      python_requires='~=3.5',
-      entry_points={
-          'console_scripts': [
+setup(
+    name='bigquery-schema-generator',
+    version='1.0',
+    description='BigQuery schema generator from JSON or CSV data',
+    long_description=long_description,
+    url='https://github.com/bxparks/bigquery-schema-generator',
+    author='Brian T. Park',
+    author_email='[email protected]',
+    license='Apache 2.0',
+    packages=['bigquery_schema_generator'],
+    python_requires='~=3.6',
+    entry_points={
+        'console_scripts': [
             'generate-schema = bigquery_schema_generator.generate_schema:main'
         ]
-      }
+    },
 )