Py3 upgrade and Pacer Refactoring (#171)

* lots of changes to bring into line for Python 3.6 using six and other tricks. added tox for testing. still an issue with the title case function due to how python handles unicode strings now. * found a possible fix for the unicode issue in py3. bit of a hack...but tries to see if a string starts with unicode or not. * added python 3.5 and 3.6 to travis file. * turning off Debug in the title case test. * refixing the requirements to be exact versions for now. put a py2/3 compatability wrapper function around calls to the requests response objects. * set requests to new version that works locally. fixed an issue with the mock not closing a connection. removed my stupid broken non-fix for test_pacer.py * refactored cookie creation so be a bit more explicit in setting a cookie jar instance. refactored out posts to PACER as it turns out you need some black magick voodoo to form the post body into something it will enjoy. * bumped requests version back down to same version as CL for now. added mocks dependency for unit tests (to tox.ini and requirements-dev.txt start refactoring some of the Pacer stuff into a PacerSession class that extends requests.Session to handle PACER nuances tests passing locally with tox using free login. * cleaned up setup.py and moved some test requirements out of base requirements.txt file. still need to update README.rst about changes. refactored the BadLoginException into the juriscraper.pacer.http module as it fits better next to the place that raises it. added default timeout value of 300 to pacer sessions since it seemed commonly set elsewhere * relaxing error condition for logins * attempt to refactor PACER login to use central auth service while still supporting the legacy test site that does not seem supported at the moment. * slimming down the tests to focus on key functionality vs. breadth of courts. * changes to README.rst, minor tweaks related to code review. * segregated python2 and python3 specific regex due to issues with unicode raw string literals. minor tweaks per code review. * added new exception class to distinguish bad pacer credentials, changed login to test site based on "psc" court_id instead of username of tr1234
freelawproject · Feb 2, 2017 · 33756fc · 33756fc
1 parent 7fbe326
commit 33756fc
Show file tree

Hide file tree

Showing 34 changed files with 642 additions and 313 deletions.
diff --git a/.gitignore b/.gitignore
@@ -15,3 +15,5 @@ juriscraper.egg-info/
 # Private PACER stuff and test fixtures
 juriscraper/pacer/private_settings.py
 tests/fixtures/cassettes/
+
+.tox
diff --git a/.travis.yml b/.travis.yml
@@ -2,6 +2,8 @@ sudo: false
 language: python
 python:
 - '2.7'
+- '3.5'
+- '3.6'
 script: python setup.py test
 install: pip install -U setuptools ; pip install .
 cache: pip

diff --git a/README.rst b/README.rst
@@ -44,15 +44,21 @@ First step: Install Python 2.7.x, then:
 
 ::
 
-    # install the dependencies
-    sudo apt-get install libxml2-dev libxslt-dev libyaml-dev
-
-    # Install PhantomJS
-    wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
-    tar -x -f phantomjs-1.9.7-linux-x86_64.tar.bz2
-    sudo mkdir -p /usr/local/phantomjs
-    sudo mv phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/phantomjs
-    rm -r phantomjs-1.9.7*  # Cleanup
+    # -- Install the dependencies
+    # On Ubuntu/Debian Linux:
+        sudo apt-get install libxml2-dev libxslt-dev libyaml-dev
+    # On macOS with Homebrew <https://brew.sh>:
+        brew install libyaml
+
+    # -- Install PhantomJS
+    # On Ubuntu/Debian Linux
+        wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
+        tar -x -f phantomjs-1.9.7-linux-x86_64.tar.bz2
+        sudo mkdir -p /usr/local/phantomjs
+        sudo mv phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/phantomjs
+        rm -r phantomjs-1.9.7*  # Cleanup
+    # On macOS with Homebrew:
+        brew install phantomjs
 
     # Finally, install the code.
     pip install juriscraper
@@ -74,15 +80,15 @@ We also generally use Intellij with PyCharm installed. These are useful because
 
 For scrapers to be merged:
 
--  ``python setup.py test`` must pass, listing the results for any new
-   scrapers. This will be run automatically by
+-  Running tests via ``tox`` must pass, listing the results for any new
+   scrapers. The test suite will be run automatically by
    `Travis-CI <https://travis-ci.org/freelawproject/juriscraper>`__. If changes are being made to the pacer code, the pacer tests must also pass when run. These tests are skipped by default. To run them, set environment variables for PACER_USERNAME and PACER_PASSWORD.
--  a \*\_example\* file must be included in the ``tests/examples``
+-  A \*\_example\* file must be included in the ``tests/examples``
    directory (this is needed for the tests to run your code).
--  your code should be
+-  Your code should be
    `PEP8 <http://www.python.org/dev/peps/pep-0008/>`__ compliant with no
    major Pylint problems or Intellij inspection issues.
--  your code should efficiently parse a page, returning no exceptions or
+-  Your code should efficiently parse a page, returning no exceptions or
    speed warnings during tests on a modern machine.
 
 When you're ready to develop a scraper, get in touch, and we'll find you
@@ -117,8 +123,8 @@ Instead of installing Juriscraper via pip, do the following:
 ::
 
     git clone https://github.com/freelawproject/juriscraper.git .
-    python setup.py install
-
+    pip install -r requirements.txt
+    python setup.py test
 
 Usage
 =====
@@ -188,8 +194,8 @@ Tests
 =====
 
 We got that! You can (and should) run the tests with
-``python setup.py test``. This will iterate over all of the
-``*_example*`` files and run the scrapers against them.
+``tox``. This will run ``python setup.py test`` for all supported Python runtimes,
+iterating over all of the ``*_example*`` files and run the scrapers against them.
 
 In addition, we use `Travis-CI <https://travis-ci.org/>`__ to
 automatically run the tests whenever code is committed to the repository

diff --git a/juriscraper/AbstractSite.py b/juriscraper/AbstractSite.py
@@ -1,9 +1,8 @@
-import re
 import json
 import certifi
 import hashlib
 import requests
-
+import six
 
 from datetime import date, datetime
 from requests.adapters import HTTPAdapter
@@ -139,7 +138,7 @@ def _clean_attributes(self):
                     if attr == 'download_urls':
                         sub_item = sub_item.strip()
                     else:
-                        if isinstance(sub_item, basestring):
+                        if isinstance(sub_item, six.string_types):
                             sub_item = clean_string(sub_item)
                         elif isinstance(sub_item, datetime):
                             sub_item = sub_item.date()
@@ -178,7 +177,7 @@ def _check_sanity(self):
         for attr in self._all_attrs:
             if self.__getattribute__(attr) is not None:
                 lengths[attr] = len(self.__getattribute__(attr))
-        values = lengths.values()
+        values = list(lengths.values())
         if values.count(values[0]) != len(values):
             # Are all elements equal?
             raise InsanityException("%s: Scraped meta data fields have differing"
@@ -236,10 +235,10 @@ def _date_sort(self):
             obj_list_attrs = [self.__getattribute__(attr) for attr in
                               self._all_attrs if
                               isinstance(self.__getattribute__(attr), list)]
-            zipped = zip(*obj_list_attrs)
+            zipped = list(zip(*obj_list_attrs))
             zipped.sort(reverse=True)
             i = 0
-            obj_list_attrs = zip(*zipped)
+            obj_list_attrs = list(zip(*zipped))
             for attr in self._all_attrs:
                 if isinstance(self.__getattribute__(attr), list):
                     self.__setattr__(attr, obj_list_attrs[i][:])
@@ -249,7 +248,7 @@ def _make_hash(self):
         """Make a unique ID. ETag and Last-Modified from courts cannot be
         trusted
         """
-        self.hash = hashlib.sha1(str(self.case_names)).hexdigest()
+        self.hash = hashlib.sha1(str(self.case_names).encode()).hexdigest()
 
     def _get_adapter_instance(self):
         """Hook for returning a custom HTTPAdapter
@@ -339,7 +338,11 @@ def _return_request_text_object(self):
             if 'json' in self.request['request'].headers.get('content-type', ''):
                 return self.request['request'].json()
             else:
-                text = self._clean_text(self.request['request'].text)
+                payload = self.request['request'].content
+                if six.PY2:
+                    payload = self.request['request'].text
+
+                text = self._clean_text(payload)
                 html_tree = self._make_html_tree(text)
                 html_tree.rewrite_links(fix_links_in_lxml_tree,
                                         base_href=self.request['url'])

diff --git a/juriscraper/OpinionSite.py b/juriscraper/OpinionSite.py
@@ -1,4 +1,4 @@
-from AbstractSite import AbstractSite
+from juriscraper.AbstractSite import AbstractSite
 
 
 class OpinionSite(AbstractSite):

diff --git a/juriscraper/OralArgumentSite.py b/juriscraper/OralArgumentSite.py
@@ -1,4 +1,4 @@
-from AbstractSite import AbstractSite
+from juriscraper.AbstractSite import AbstractSite
 
 
 class OralArgumentSite(AbstractSite):

diff --git a/juriscraper/lib/date_utils.py b/juriscraper/lib/date_utils.py
@@ -108,11 +108,11 @@ def parse_dates(s, debug=False, sane_start=datetime.datetime(1750, 1, 1),
 
     # Ditch unicode (_timelex() flips out on unicode if the system has
     # cStringIO installed -- the default)
-    if isinstance(s, unicode):
-        s = s.encode('ascii', 'ignore')
+    #if isinstance(s, six.text_type):
+    #    s = s.encode('ascii', 'ignore')
 
     # Fix misspellings
-    for i, j in MISSPELLINGS.iteritems():
+    for i, j in MISSPELLINGS.items():
         s = s.replace(i, j)
 
 
@@ -127,7 +127,7 @@ def parse_dates(s, debug=False, sane_start=datetime.datetime(1750, 1, 1),
             hit_default_day_and_month = (d.month == DEFAULT.month and d.day == DEFAULT.day)
             if not any([hit_default_year, hit_default_day_and_month]):
                 if debug:
-                    print "Item %s parsed as: %s" % (item, d)
+                    print("Item %s parsed as: %s" % (item, d))
                 if sane_start < d < sane_end:
                     dates.append(d)
         except OverflowError:

diff --git a/juriscraper/lib/html_utils.py b/juriscraper/lib/html_utils.py
@@ -1,7 +1,7 @@
 #!/usr/bin/env python
 # encoding: utf-8
-from urlparse import urlsplit
-from urlparse import urlunsplit
+from six import text_type
+from six.moves.urllib.parse import urlsplit, urlunsplit
 
 import re
 from lxml import html
@@ -78,7 +78,11 @@ def set_response_encoding(request):
             # HTTP headers. This way it is done before r.text is accessed
             # (which would do it with vanilla chardet). This is a big
             # performance boon, and can be removed once requests is upgraded
-            request.encoding = chardet.detect(request.content)['encoding']
+            if isinstance(request.content, text_type):
+                as_bytes = request.content.encode()
+                request.encoding = chardet.detect(as_bytes)['encoding']
+            else:
+                request.encoding = chardet.detect(request.content)['encoding']
 
 
 def clean_html(text):
@@ -100,7 +104,7 @@ def clean_html(text):
     # attribute, but we remove it in all cases, as there's no downside to
     # removing it. This moves our encoding detection to chardet, rather than
     # lxml.
-    if isinstance(text, unicode):
+    if isinstance(text, text_type):
         text = re.sub(r'^\s*<\?xml\s+.*?\?>', '', text)
 
     # Fix </br>

diff --git a/juriscraper/lib/importer.py b/juriscraper/lib/importer.py
@@ -34,9 +34,9 @@ def find_all_attr_or_punt(court_id):
             # juriscraper.opinions.united_states.federal_appellate.ca1,
             # therefore, we add it to our list!
             module_strings.append(court_id)
-        except ImportError, e:
+        except ImportError as e:
             # Something has gone wrong with the import
-            print "Import error: %s" % e
+            print("Import error: %s" % e)
             return []
 
     find_all_attr_or_punt(court_id)
@@ -51,5 +51,5 @@ def site_yielder(iterable, mod):
         try:
             site._download_backwards(i)
             yield site
-        except HTTPError, e:
+        except HTTPError as e:
             continue
diff --git a/juriscraper/lib/log_tools.py b/juriscraper/lib/log_tools.py
@@ -24,28 +24,28 @@ def make_default_logger(file_path=LOG_FILENAME):
                 maxBytes=5120000,
                 backupCount=7
             )
-        except IOError, e:
+        except IOError as e:
             if e.errno == 2:
-                print "\nWarning: %s: %s. " \
+                print("\nWarning: %s: %s. " \
                       "Have you created the directory for the log?" % (
                           e.strerror,
                           file_path,
-                      )
+                      ))
             elif e.errno == 13:
-                print "\nWarning: %s: %s. " \
+                print("\nWarning: %s: %s. " \
                       "Cannot access file as user: %s" % (
                           e.strerror,
                           file_path,
                           getpass.getuser(),
-                      )
+                      ))
             else:
-                print "\nIOError [%s]: %s\n%s" % (
+                print("\nIOError [%s]: %s\n%s" % (
                     e.errno,
                     e.strerror,
                     traceback.format_exc()
-                )
-            print "Juriscraper will continue to run, and all logs will be " \
-                  "sent to stdout."
+                ))
+            print("Juriscraper will continue to run, and all logs will be " \
+                  "sent to stdout.")
             handler = logging.StreamHandler(sys.stdout)
         handler.setFormatter(
             logging.Formatter('%(asctime)s - %(levelname)s: %(message)s')