Skip to content

Commit

Permalink
Py3 upgrade and Pacer Refactoring (#171)
Browse files Browse the repository at this point in the history
* lots of changes to bring into line for Python 3.6 using six and other tricks. added tox for testing. still an issue with the title case function due to how python handles unicode strings now.

* found a possible fix for the unicode issue in py3. bit of a hack...but tries to see if a string starts with unicode or not.

* added python 3.5 and 3.6 to travis file.

* turning off Debug in the title case test.

* refixing the requirements to be exact versions for now. put a py2/3 compatability wrapper function around calls to the requests response objects.

* set requests to new version that works locally. fixed an issue with the mock not closing a connection. removed my stupid broken non-fix for test_pacer.py

* refactored cookie creation so be a bit more explicit in setting a cookie jar instance. refactored out posts to PACER as it turns out you need some black magick voodoo to form the post body into something it will enjoy.

* bumped requests version back down to same version as CL for now.
added mocks dependency for unit tests (to tox.ini and requirements-dev.txt
start refactoring some of the Pacer stuff into a PacerSession class that extends requests.Session to handle PACER nuances
tests passing locally with tox using free login.

* cleaned up setup.py and moved some test requirements out of base requirements.txt file. still need to update README.rst about changes.
refactored the BadLoginException into the juriscraper.pacer.http module as it fits better next to the place that raises it.
added default timeout value of 300 to pacer sessions since it seemed commonly set elsewhere

* relaxing error condition for logins

* attempt to refactor PACER login to use central auth service while still supporting the legacy test site that does not seem supported at the moment.

* slimming down the tests to focus on key functionality vs. breadth of courts.

* changes to README.rst, minor tweaks related to code review.

* segregated python2 and python3 specific regex due to issues with unicode raw string literals. minor tweaks per code review.

* added new exception class to distinguish bad pacer credentials, changed login to test site based on "psc" court_id instead of username of tr1234
  • Loading branch information
voutilad authored Feb 2, 2017
1 parent 7fbe326 commit 33756fc
Show file tree
Hide file tree
Showing 34 changed files with 642 additions and 313 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ juriscraper.egg-info/
# Private PACER stuff and test fixtures
juriscraper/pacer/private_settings.py
tests/fixtures/cassettes/

.tox
2 changes: 2 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ sudo: false
language: python
python:
- '2.7'
- '3.5'
- '3.6'
script: python setup.py test
install: pip install -U setuptools ; pip install .
cache: pip
Expand Down
42 changes: 24 additions & 18 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,21 @@ First step: Install Python 2.7.x, then:

::

# install the dependencies
sudo apt-get install libxml2-dev libxslt-dev libyaml-dev

# Install PhantomJS
wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
tar -x -f phantomjs-1.9.7-linux-x86_64.tar.bz2
sudo mkdir -p /usr/local/phantomjs
sudo mv phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/phantomjs
rm -r phantomjs-1.9.7* # Cleanup
# -- Install the dependencies
# On Ubuntu/Debian Linux:
sudo apt-get install libxml2-dev libxslt-dev libyaml-dev
# On macOS with Homebrew <https://brew.sh>:
brew install libyaml

# -- Install PhantomJS
# On Ubuntu/Debian Linux
wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-1.9.7-linux-x86_64.tar.bz2
tar -x -f phantomjs-1.9.7-linux-x86_64.tar.bz2
sudo mkdir -p /usr/local/phantomjs
sudo mv phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/phantomjs
rm -r phantomjs-1.9.7* # Cleanup
# On macOS with Homebrew:
brew install phantomjs

# Finally, install the code.
pip install juriscraper
Expand All @@ -74,15 +80,15 @@ We also generally use Intellij with PyCharm installed. These are useful because

For scrapers to be merged:

- ``python setup.py test`` must pass, listing the results for any new
scrapers. This will be run automatically by
- Running tests via ``tox`` must pass, listing the results for any new
scrapers. The test suite will be run automatically by
`Travis-CI <https://travis-ci.org/freelawproject/juriscraper>`__. If changes are being made to the pacer code, the pacer tests must also pass when run. These tests are skipped by default. To run them, set environment variables for PACER_USERNAME and PACER_PASSWORD.
- a \*\_example\* file must be included in the ``tests/examples``
- A \*\_example\* file must be included in the ``tests/examples``
directory (this is needed for the tests to run your code).
- your code should be
- Your code should be
`PEP8 <http://www.python.org/dev/peps/pep-0008/>`__ compliant with no
major Pylint problems or Intellij inspection issues.
- your code should efficiently parse a page, returning no exceptions or
- Your code should efficiently parse a page, returning no exceptions or
speed warnings during tests on a modern machine.

When you're ready to develop a scraper, get in touch, and we'll find you
Expand Down Expand Up @@ -117,8 +123,8 @@ Instead of installing Juriscraper via pip, do the following:
::

git clone https://github.com/freelawproject/juriscraper.git .
python setup.py install

pip install -r requirements.txt
python setup.py test

Usage
=====
Expand Down Expand Up @@ -188,8 +194,8 @@ Tests
=====

We got that! You can (and should) run the tests with
``python setup.py test``. This will iterate over all of the
``*_example*`` files and run the scrapers against them.
``tox``. This will run ``python setup.py test`` for all supported Python runtimes,
iterating over all of the ``*_example*`` files and run the scrapers against them.

In addition, we use `Travis-CI <https://travis-ci.org/>`__ to
automatically run the tests whenever code is committed to the repository
Expand Down
19 changes: 11 additions & 8 deletions juriscraper/AbstractSite.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
import re
import json
import certifi
import hashlib
import requests

import six

from datetime import date, datetime
from requests.adapters import HTTPAdapter
Expand Down Expand Up @@ -139,7 +138,7 @@ def _clean_attributes(self):
if attr == 'download_urls':
sub_item = sub_item.strip()
else:
if isinstance(sub_item, basestring):
if isinstance(sub_item, six.string_types):
sub_item = clean_string(sub_item)
elif isinstance(sub_item, datetime):
sub_item = sub_item.date()
Expand Down Expand Up @@ -178,7 +177,7 @@ def _check_sanity(self):
for attr in self._all_attrs:
if self.__getattribute__(attr) is not None:
lengths[attr] = len(self.__getattribute__(attr))
values = lengths.values()
values = list(lengths.values())
if values.count(values[0]) != len(values):
# Are all elements equal?
raise InsanityException("%s: Scraped meta data fields have differing"
Expand Down Expand Up @@ -236,10 +235,10 @@ def _date_sort(self):
obj_list_attrs = [self.__getattribute__(attr) for attr in
self._all_attrs if
isinstance(self.__getattribute__(attr), list)]
zipped = zip(*obj_list_attrs)
zipped = list(zip(*obj_list_attrs))
zipped.sort(reverse=True)
i = 0
obj_list_attrs = zip(*zipped)
obj_list_attrs = list(zip(*zipped))
for attr in self._all_attrs:
if isinstance(self.__getattribute__(attr), list):
self.__setattr__(attr, obj_list_attrs[i][:])
Expand All @@ -249,7 +248,7 @@ def _make_hash(self):
"""Make a unique ID. ETag and Last-Modified from courts cannot be
trusted
"""
self.hash = hashlib.sha1(str(self.case_names)).hexdigest()
self.hash = hashlib.sha1(str(self.case_names).encode()).hexdigest()

def _get_adapter_instance(self):
"""Hook for returning a custom HTTPAdapter
Expand Down Expand Up @@ -339,7 +338,11 @@ def _return_request_text_object(self):
if 'json' in self.request['request'].headers.get('content-type', ''):
return self.request['request'].json()
else:
text = self._clean_text(self.request['request'].text)
payload = self.request['request'].content
if six.PY2:
payload = self.request['request'].text

text = self._clean_text(payload)
html_tree = self._make_html_tree(text)
html_tree.rewrite_links(fix_links_in_lxml_tree,
base_href=self.request['url'])
Expand Down
2 changes: 1 addition & 1 deletion juriscraper/OpinionSite.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from AbstractSite import AbstractSite
from juriscraper.AbstractSite import AbstractSite


class OpinionSite(AbstractSite):
Expand Down
2 changes: 1 addition & 1 deletion juriscraper/OralArgumentSite.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from AbstractSite import AbstractSite
from juriscraper.AbstractSite import AbstractSite


class OralArgumentSite(AbstractSite):
Expand Down
8 changes: 4 additions & 4 deletions juriscraper/lib/date_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,11 +108,11 @@ def parse_dates(s, debug=False, sane_start=datetime.datetime(1750, 1, 1),

# Ditch unicode (_timelex() flips out on unicode if the system has
# cStringIO installed -- the default)
if isinstance(s, unicode):
s = s.encode('ascii', 'ignore')
#if isinstance(s, six.text_type):
# s = s.encode('ascii', 'ignore')

# Fix misspellings
for i, j in MISSPELLINGS.iteritems():
for i, j in MISSPELLINGS.items():
s = s.replace(i, j)


Expand All @@ -127,7 +127,7 @@ def parse_dates(s, debug=False, sane_start=datetime.datetime(1750, 1, 1),
hit_default_day_and_month = (d.month == DEFAULT.month and d.day == DEFAULT.day)
if not any([hit_default_year, hit_default_day_and_month]):
if debug:
print "Item %s parsed as: %s" % (item, d)
print("Item %s parsed as: %s" % (item, d))
if sane_start < d < sane_end:
dates.append(d)
except OverflowError:
Expand Down
12 changes: 8 additions & 4 deletions juriscraper/lib/html_utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#!/usr/bin/env python
# encoding: utf-8
from urlparse import urlsplit
from urlparse import urlunsplit
from six import text_type
from six.moves.urllib.parse import urlsplit, urlunsplit

import re
from lxml import html
Expand Down Expand Up @@ -78,7 +78,11 @@ def set_response_encoding(request):
# HTTP headers. This way it is done before r.text is accessed
# (which would do it with vanilla chardet). This is a big
# performance boon, and can be removed once requests is upgraded
request.encoding = chardet.detect(request.content)['encoding']
if isinstance(request.content, text_type):
as_bytes = request.content.encode()
request.encoding = chardet.detect(as_bytes)['encoding']
else:
request.encoding = chardet.detect(request.content)['encoding']


def clean_html(text):
Expand All @@ -100,7 +104,7 @@ def clean_html(text):
# attribute, but we remove it in all cases, as there's no downside to
# removing it. This moves our encoding detection to chardet, rather than
# lxml.
if isinstance(text, unicode):
if isinstance(text, text_type):
text = re.sub(r'^\s*<\?xml\s+.*?\?>', '', text)

# Fix </br>
Expand Down
6 changes: 3 additions & 3 deletions juriscraper/lib/importer.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ def find_all_attr_or_punt(court_id):
# juriscraper.opinions.united_states.federal_appellate.ca1,
# therefore, we add it to our list!
module_strings.append(court_id)
except ImportError, e:
except ImportError as e:
# Something has gone wrong with the import
print "Import error: %s" % e
print("Import error: %s" % e)
return []

find_all_attr_or_punt(court_id)
Expand All @@ -51,5 +51,5 @@ def site_yielder(iterable, mod):
try:
site._download_backwards(i)
yield site
except HTTPError, e:
except HTTPError as e:
continue
18 changes: 9 additions & 9 deletions juriscraper/lib/log_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,28 +24,28 @@ def make_default_logger(file_path=LOG_FILENAME):
maxBytes=5120000,
backupCount=7
)
except IOError, e:
except IOError as e:
if e.errno == 2:
print "\nWarning: %s: %s. " \
print("\nWarning: %s: %s. " \
"Have you created the directory for the log?" % (
e.strerror,
file_path,
)
))
elif e.errno == 13:
print "\nWarning: %s: %s. " \
print("\nWarning: %s: %s. " \
"Cannot access file as user: %s" % (
e.strerror,
file_path,
getpass.getuser(),
)
))
else:
print "\nIOError [%s]: %s\n%s" % (
print("\nIOError [%s]: %s\n%s" % (
e.errno,
e.strerror,
traceback.format_exc()
)
print "Juriscraper will continue to run, and all logs will be " \
"sent to stdout."
))
print("Juriscraper will continue to run, and all logs will be " \
"sent to stdout.")
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(
logging.Formatter('%(asctime)s - %(levelname)s: %(message)s')
Expand Down
Loading

0 comments on commit 33756fc

Please sign in to comment.